Introduction to IT Operations and Maintenance Monitoring System

1. Overview of IT Operations and Maintenance Monitoring System#

IT operations and maintenance include many aspects, such as installation and deployment, configuration management, and operations and maintenance monitoring. There is a saying, "No monitoring, no operations and maintenance," which shows the importance of operations and maintenance monitoring in IT operations. This article focuses on the operations and maintenance monitoring system. The IT operations and maintenance monitoring system can be divided into three dimensions: performance (Metrics), traces (Traces), and logs (Logs), as shown in Figures 1, 2, 3, and 4:

Figure 1: Three dimensions of enterprise IT operations and maintenance

8bab815bacfb1932730a26be9da282fa

Source: Love Mathematics Academy

Figure 2: Manifestation of the three dimensions

5041eb4342166fc4bdee99d4fabd36f4

Source: Love Mathematics Academy

Figure 3: Different problems solved by the three dimensions

63b976db3e56f776adfb293942305fe6

Source: Love Mathematics Academy

Figure 4: How to use the three dimensions to solve problems

d2b5ca33bd970f64a6301fa75ae2eb22

(Source: Love Mathematics Academy)

Metrics is the earliest focus of operations and maintenance, and it mainly focuses on whether the system has encountered problems, belonging to the red ocean market.
Traces are in a stage of rapid development, and they mainly focus on the links and sources of system problems. They mainly use APM (Application Performance Management) tools to monitor, alert, and optimize key business systems, continuously improve business reliability and stability, provide good services to customers, and enhance core competitiveness.
Logs, relatively speaking, can obtain more security, operations, and maintenance information. They mainly focus on the causes of system problems, so the threshold is relatively high. It is currently a relatively blue ocean market.

2. Metrics#

When it comes to Metrics, we have to mention Zabbix, an open-source operations and maintenance tool that supports distributed monitoring and is used by many Internet companies. The monitoring principle of Zabbix is to establish communication with the monitoring objects and perform data collection. The communication methods include agent, SSH/telnet, SNMP (commonly used for network devices such as switches), IPMI (commonly used for power supplies, fans, etc.), and JMX (commonly used for JVM virtual machines). The drawback of Zabbix is that it uses database storage at the underlying level, so it is not suitable for storing or retrieving logs in large quantities and frequently (the focus is still on Metrics). In addition, Zabbix has weak monitoring capabilities for containers and microservices.

b9aaa93517ed5a55df2d2a8f5d9c01e6 1

In addition to Zabbix, other commonly used tools include Nagios, Cacti, Prometheus, etc. With the rise of cloud-native in recent years, Prometheus, which is good at cloud-native monitoring, has been welcomed by many. Alibaba Cloud has also timely launched ARMS Prometheus, which fully integrates with the open-source Prometheus ecosystem, supports monitoring of various components, and provides various pre-configured monitoring dashboards out of the box, as well as comprehensive managed Prometheus services. However, Prometheus is still focused on the Metrics level, and its alerting function is not perfect, let alone analysis capabilities.

3. Traces#

With the development of enterprise business and the expansion of scale, more and more components are being used, such as microservices, message processing, distributed databases, distributed object storage, distributed caching, cross-domain invocation, etc. These components together form a complex distributed network. A business request may involve the collaborative processing of several or dozens of services. In this case, we need to use a tool that can dynamically display the service chain, analyze the bottlenecks in the service chain, optimize them, and quickly locate faulty service chains. This is where APM tools come into play. APM tools can monitor both the front-end, such as mobile apps and browsers, and the back-end of applications.

APM data is generally obtained through probe-based instrumentation, also known as the Agent approach. This approach can provide very complete and fine-grained monitoring data collection and code-level problem localization. However, this approach is invasive to applications. If the instrumentation code is abnormal, it will affect the performance and stability of the application itself. This approach can be further divided into two categories: code-invasive and bytecode-enhanced. The former represents products like Zipkin and cat. The latter represents products like PinPoint and SkyWalking. For a comparison of these products, you can refer to the following articles:

In recent years, domestic vendors providing Traces monitoring in the form of SaaS have also emerged. Although they also obtain data through probe-based instrumentation, their business models have changed. Currently, well-known vendors in China include Tingyun, Cloudwise, OneAPM, etc.

In addition, there is a growing trend of APM tools that do not obtain data through probe-based instrumentation. A representative product in China is RStone. Its data acquisition method mainly adopts bypass deployment, without any changes to the network topology or system, and without the need to install any software. Only devices need to be deployed at key nodes, as shown in Figure 5:

Figure 5: RStone monitoring method

56ca8828fb2ead3248a5e4d68d6a711c

(Source: RStone official website)

4. Logs#

Traditional operations and maintenance monitoring mainly focuses on Metrics, and in recent years, Traces have begun to receive more attention. However, the following issues still exist with the above two approaches:

Monitoring gap: The monitoring of IT infrastructure and application layers is not connected or monitored by different teams.
Lack of unity: Monitoring tools vary depending on the monitoring objects.
Alarm flooding: Lack of intelligent operations and maintenance means such as alarm convergence and fault recovery.

Logs can effectively solve the above problems. Currently, common log monitoring tools on the market include Splunk, ELK, LogYi, and AnyRobot.

Splunk is a leader in log monitoring products. It mainly analyzes company log files and allows quick statistical analysis and queries through a centralized application. It can also generate various reports to facilitate performance evaluation of the entire data. It not only analyzes log files for companies but also provides software solutions for searching, monitoring, analyzing, and interpreting large amounts of machine-generated data for enterprise customers as a SaaS company. It is used not only by IT companies and DevOps solutions but also by telecommunications, energy, finance, government, and other industries. Splunk's excellence attracted the attention of ARK, and it invested $55 million in Splunk in one day, adding Splunk to its four major ETFs.

However, Splunk is not invincible. The use of Splunk products involves multiple components and tools, and each component incurs costs, making it relatively expensive. Mature open-source alternatives are gradually eroding Splunk's market, with the most famous being ELK.

ELK stands for ElasticSearch, Logstash, and Kibana, which provide search, data ingestion, and visualization functions, respectively, forming Elastic's application stack. Although these three are separate open-source projects, they actually provide a cohesive roadmap for all components under Elastic's roof. The scoring mechanism for search results is superior to that of Splunk. Due to its open-source nature, it has received support from many developers, providing better participation for developers compared to Splunk.

In the domestic market, there are also two excellent log monitoring products—LogYi and AnyRobot. Both are big data operations and maintenance analysis products based on logs. Some differences between them are listed below:

LogYi can be regarded as a faithful imitator of Splunk, always imitating but never surpassing it. AnyRobot is developed based on ELK and fully leverages the advantages of ELK.
LogYi has both software and SaaS versions, while AnyRobot has software, SaaS, and all-in-one versions.
LogYi charges based on data traffic, while AnyRobot charges based on the computing units consumed by the actual data analysis needs of users.
LogYi has a separate product called "Data Factory" for data flow management, while AnyRobot does not have this feature.
LogYi has more cases in the financial industry, while AnyRobot has more cases in the government, education, and healthcare industries.
LogYi faces obstacles in selling to industries with information security requirements, while AnyRobot has the credentials to sell to such industries.
LogYi's strategy against Splunk is replacement, developing products similar to Splunk in terms of research and development and user experience. AnyRobot's strategy is governance, not replacement, but it can govern Splunk.

5. Conclusion#

Metrics monitoring is still the main focus of most operations and maintenance work, and tools like Zabbix will continue to be used.
Monitoring and maintenance based on Traces and even Logs will be increasingly valued and accepted, and traditional operations and maintenance monitoring is transitioning to intelligent operations and maintenance, namely AIOps.
Operations and maintenance tools are transitioning from software versions to SaaS versions.
In the domestic market, domestic vendors and products are emerging as the market grows.