Product VP at Cloudcy.cn Introduced How AIOps and Observability Streamline Cloud-native Oper

原创 精选
Techplur
In this article, we invited Mr. Zhang Huaipeng, product VP of Cloudcy.cn, to share his experience and expertise on what it takes to build a digital observation tool in an era of cloud computing.

While cloud computing brings a number of benefits such as intensification, efficiency, elasticity, and business agility, it also poses unprecedented challenges to the field of cloud operations. In this regard, adapting to new technology trends, creating an intelligent monitoring platform in the cloud era, and achieving better protection for cloud-based applications have become imperative for enterprises today.

In this article, we invited Mr. Zhang Huaipeng, product VP of Cloudcy.cn, to share his experience and expertise on what it takes to build a digital observation tool in an era of cloud computing.


Operational challenges under digital transformation

We live in a digital age where digitalization influences all aspects of our daily lives, including how we work, consume, shop, and travel. It is fair to say that now we have shifted from the IT era into the "DT" (digital transformation) one.

Since the advent of digital transformation, enterprises and their customers have experienced a fundamental change in how they conduct business. It should be noted, however, that as the digital revolution continues to advance across a range of industries,increasing numbers of accidents have been reported related to digital applications. 

A recent survey reported that 60% of CEOs believe digital transformation is essential, and businesses should make significant progress towards digital transformation and artificial intelligence evolution. Conversely, 95% of enterprise applications are not monitored effectively and may pose some problems.

It is important to note that most of the current digital operation tools were developed during the era of traditional data centers, and a significant number of tools and technologies do not consider cloud computing scenarios. Due to the popularity of cloud computing, the information technology scenario has dramatically changed.Increasingly complex, distributed, and dependent applications are evolving at a rapid pace. Thus, enterprises must develop DT-based solutions based on business and data flows to succeed in a competitive market.

Cloud-native technology is among the many new technologies and scenarios emerging during the DT era. With the introduction of cloud-native technologies, operation manners have evolved at an accelerated pace. Traditional scenarios have a large amount of physical infrastructure, which may require enterprises to care about conventional server room management, weak power management, hardware monitoring of bare metal, UPS power distribution, and temperature & humidity.

However, with business in the cloud,infrastructure will be managed by operators or providers, so enterprises will no longer need to worry about these issues.

Thus, the traditional equipment operations have evolved into site reliability, which means that enterprise investments in old-fashioned operations will diminish over time.

Currently, we are undergoing a transition to AIOps. Now is the time to make digital operations and IT operations more lightweight, more efficient, and less costly. Operations teams must focus on the enterprise business, which is the key to success.


The road to AIOps for enterprises

What is AIOps?

Defined by some research services such as Forrester and Gartner, AIOps is a software system that uses artificial intelligence and data science in business and operations to establish data correlation and provide real-time prescriptive and predictive guidance. As a software system, AIOps can be used in a commercial product. AIOps may enhance and partially replace traditional critical IT operations functions, such as monitoring availability and performance, organizing and analyzing events, and automatically managing IT services.

AIOps, as the name suggests, is concerned with operations, such as observation, management, and disposal. As Forrester suggested in their reports, AIOps promises greater observability and stability, both of which are important in this field.

According to Forrester, one of the core values of current AIOps is the enhancement and extension of ex-ante capabilities.

What is observability?

It was in the field of Cybernetics that the term "observability" was first used, and it defined a system's output to be used to infer the internal state of the system. IT research firm Gartner defines observability as a characteristic of software and systems. In particular, it refers to the ability to determine a system's current state and condition based on telemetry data it generates.

Why is observability a key concept?

Observability is essential to improving the control of complex systems. Traditional monitoring techniques and tools have difficulty tracking communication paths and dependencies of today's increasingly distributed architectures. Meanwhile, cloud-native or cloud-based application dependencies are much more complex than those in traditional monolithic applications. In addition, the three pillars of observability facilitate an intuitive understanding of all aspects of the complex system.

Department of Ops, Development, SRE,Marketing, and Business could benefit from observability. Therefore, if AIOps and observability can be integrated into one integrated platform, we will receive a perfect product and will be able to accomplish two things at once.


The paths to AIOps for enterprises

AIOps can be achieved exogenously and endogenously in enterprises. An exogenous AIOps platform is integrated into an IT operations environment as a sidecar platform. In this case, AIOps is an independent algorithm platform that accesses heterogeneous data from various sources. Data engineers process the data using big data analytics to resolve interdependencies between data sources and produce project-based results.

While endogenous AIOps emphasize the integrated technology route. It can facilitate the closing loop of the whole data processing without involving data engineers. This is like sending a courier package, but with data as the "items". Data is encapsulated,stored, scheduled, and transported by the "courier", eliminating any need for the sender or final recipient to be involved in these tasks.Endogenous AIOps emphasize this capability by embedding AI capabilities into one single integrated observation platform.

Exogenous vs. endogenous AIOps in technological implementation

Generally, exogenous AIOps use traditional machine learning techniques. They are essentially statistical approaches that correlate and analyze information such as metrics, logs, and events to reduce alert noise. Machine learning enables us to obtain a set of correlated alerts, which requires a specific period. Exogenous AIOps need manual or historical records to establish a recommended or probable root cause.

Meanwhile, exogenous AIOps require a large amount of external data dependency, and vendors usually only design their algorithms. Data cleansing, dependencies between entities within a CMDB, etc.,all rely on external data. Exogenous AIOps, therefore, require a mature information technology operations system, products with APM, the prerequisite of calling data, and excellent observability.

Endogenous AIOps provide a deterministic AI analysis with deterministic results being the targets.Therefore, the root cause of the problem is deterministic in real-time following the occurrence of the problem. Endogenous AIOps maintain a matrix dependency map in real-time. The technology does not depend on a static CMDB but rather on a dependency map that acts like a real-time CMDB, allowing the dependency to be changed in real-time and management analysis to be carried out by an endogenous relationship.

How to choose your technological pathway?

For managers, the trade-offs between cost, stability, and efficiency must be considered along with fundamental issues such as cost and team. Using AIOps can solve problems rationally,optimizing the stability and efficiency of enterprise business and keeping costs to a minimum.

According to a report from Forrester, enterprises should focus on the following key aspects when implementing AIOps:

•Whether the AIOps platform integrates seamlessly with the ITOM toolchain and is highly automated;

•AIOps will place great emphasis on native data, including cloud-native dependencies and machine data;

•An automated and comprehensive mapping of full-service dependencies;

•AIOps will require intelligent observation awareness and automation implementation;

•Automating the analysis of root causes and the planning of remediation for incidents;

•Technical operations today require intelligence and automation.

Data processing differences:

Traditional AIOps platforms(exogenous AIOps platforms) have used various tools to build up a rickety big data processing system. In this system, team members are likely to leave new employees with a significant amount of technical debt following their resignation.

Data collection begins with the use of a variety of open source and commercial tools.

After the data has been collected,the next step is to inject it into the big data platform.

The data relationships will be manually sorted and cleaned. These require a lot of time and effort.

Identify the issues in which the AIOps vendor would be involved in the field. Vendors will ask for specifications and provide services following those specifications.

Develop the dashboard.

Scale up the system. The system will, therefore, grow linearly as the application system scales.

It is common to see data engineers spend almost 80% of their time cleaning, collecting, and organizing data. The process requires cutting-edge ops talents and an understanding of ops,algorithms, and development. Meanwhile, AIOps is a tool that helps solve problems, but exogenous AIOps may increase ops workload and require a dedicated maintenance team.

As for endogenous AIOps, their data processing is very simplistic, and one tool can handle all aspects of the data collection. As a highly commercialized product, endogenous AIOps have out-of-the-box capabilities, such as a dashboard, engine, etc., and do not require business engineers to understand algorithms or SRE.

In addition, as the enterprise business system scales, endogenous AIOps will grow non-linearly. It is important to note that the entire system, including the user's team and the product, increases non-linearly. Using Cloudcy as an example, once the solution has been deployed, enterprises only need to install Databuff OneAgent, and many of the subsequent tasks can be automated. Consequently, operations personnel can devote their attention to the enterprise's core business.

Rather than presenting raw data, the industry requires a new generation of software AI platforms that can cover the entire data processing process. AIOps, which belongs to the new paradigm of AIOps, is recommended over the two paths of exogenous and endogenous AIOps.


Endogenous AIOps facilitate cloud-native operations

The objective of the endogenous AIOps platform is to build an integrated platform that combines AIOps and observability. For it to be observable, it needs to be centered on application monitoring, which is the phenomenon layer for end users. Meanwhile, it is necessary to integrate infrastructure monitoring, including monitoring of cloud platforms as well as black boxes. Lastly, it is essential to provide a digital experience that is focused on the front end.

The new AIOps platform should provide continuous automation from data access to results output. In addition,it needs to be capable of predicting and warning.

AIOps platforms must provide high-level observability to enterprises, not just raw data and raw parts, but focus on phenomena and experience and provide accurate results to minimize the impact of massive noise on enterprises.

An endogenous AIOp can have many different data processing models, such as the strength of Databuff OneAgent to the data collection process. Data processing emphasizes the metrics system, and our implementation of the metrics system differs from the traditional AIOps platform, resulting in a true endogenous AIOps.

Endogenous AIOps platforms will simplify cloud-native operations in the following five areas:

•Direct access to high-quality observation data;

•Developing continuous automation that is more efficient for operations;

•Platforms can construct real-time matrix topologies for querying;

•An instant assessment of the impact surface;

•Disclose root causes to prove results.

1.Directly access to high-quality
observation data

High-quality back-end analysis requires high-quality front-end telemetry data. Tracking data, indicators, log data, and critical topology and code data are essential for high-level observability and endogenous AIOps analysis. The data quality directly indicates how high a model can go.

Monitoring data that can be directly accessed must be non-invasive, automatedly collected, related to business and applications, a combination of context and automation, and without modifying source code. Context is an indispensable component of real root cause analysis.It can help extract accurate background information and help the platform build real-time service flow and topology diagrams for dependencies, including matrix relationship topology.

These diagrams display the dependencies of the application environment, primarily in the form of vertical and horizontal stacks. A service flow diagram provides an overview of the entire transaction from the perspective of service or request. A service flow diagram or topology diagram can demonstrate the calling sequence among services. The service flow diagram illustrates all transactions orderly,whereas the topology diagram represents dependencies more abstractly.

While there are already many open-source and free powerful monitoring tools on the market, commercialized Databuff OneAgent technology has several advantages that open-source tools lack, including:

•Agent probes collected are guaranteed for stability, security, and reliability;

•Ensured resource overhead and performance impact on core business servers;

•A reduced amount of manual labor is required for deployment, insertion, and changes;

•Dynamic methods and container classes can be automatically monitored

•High fidelity native sampling of metrics;

•Obtaining sufficient information and context to construct a unified data model.

These are some of the advantages that many free tools don't have. Endogenous AIOps rely on OneAgent technology,which is designed to perform a great deal of aggregation and cleaning work at the endpoints using edge computing.

2.Continuous
automation

Endogenous AIOps platforms are designed to enable continuous automation, which is essential to the monitoring of cloud-native environments. This includes automating deployment, adaptation,discovery, monitoring, injection, cleaning, etc. It isn't easy to understand the end-to-end business process in the complex cloud-native environment with human intelligence, so automated operations are necessary as an additional tool.

3.Construct a real-time matrix diagram

Endogenous AIOps platforms are capable of building real-time topology matrices. You can check the diagram horizontally to view the dependencies between the service, container, host, and process levels. The vertical side of the table displays which container the service is running on, which process this container corresponds to, and on which cloud host the service is running.

4.Instant assessment of the impact surface

It is similar to the analysis performed on network security but pertains to the operations. If a system failure occurs, the team should analyze which users, services, and applications are affected and what is the root cause of the failures. By automating the process, users can view the results without performing manual analysis.

5.Disclose root causes to prove results

Lastly, it is vital to identify the root causes of the problem to prove the results. AIOps offers a solution based on endogenous root cause location instead of traditional methods such as knowledge base, CMDB, and causal inference. In addition, it can bridge data dependencies between objects and data types, such as call chains, logs, and metrics. With low overhead, it provides a real-time root cause location with high adaptability and high accuracy. Furthermore, its unsupervised learning capability requires little human intervention.


Conclusion

For digital transformation to be successful, enterprises must ensure that all applications, digital services,and the dynamic multi-cloud platforms that underpin them work flawlessly.

Compared to traditional scenarios,these highly dynamic, distributed, cloud-native technologies present different challenges. Micro-services, containers and software-defined cloud infrastructure contribute to the current complexity. These complexities far exceed the capacity of a team to manage, and they are growing exponentially.Therefore, it is necessary to increase observability and AIOps capabilities to remain abreast of the changes in these rapidly changing environments.

Cloud-native operations must be made lightweight, more efficient, and less costly using highly automated and artificial intelligence technologies so that enterprise teams can focus on the core business and truly transition into the era of AI-assisted operations.


Guest Introduction

Mr. Zhang Huaipeng is the Product Vice President for Cloudcy.cn. He joined the company in 2017 and is responsible for the daily management of the DataBuff Integrated Observation and AIOps product line. As manager of the IPD integrated product development team, he is in charge of market management, requirements analysis, team collaboration, process structuring, quality assurance, etc.

责任编辑:庞桂玉 来源: 51CTO
相关推荐

2022-08-30 20:45:41

cloudcloud natieducation

2022-08-31 14:58:48

data lakescloud natibig data

2022-08-31 16:13:11

cloud nati

2016-01-22 13:12:38

云计算云原生云原生应用

2015-09-22 14:19:56

Cloud NativDevOps持续交付

2021-03-26 10:31:19

人工智能AIOps

2022-08-31 09:31:20

AlibabaKoodinatorcontainers

2016-04-07 22:11:13

时速云Cloud NativDocker

2021-03-18 12:41:42

AIOps机器学习人工智能

2020-05-28 10:53:32

存储数据框架

2017-06-29 14:29:46

互联网

2021-05-10 17:20:55

AIOps开发人员人工智能

2017-08-02 09:37:32

NFVCloud Nativ虚拟机

2018-08-02 09:44:35

AIOps实践数据

2021-03-25 10:10:09

AIOps自动化运营人工智能

2018-08-22 11:31:59

华为云

2019-07-17 15:10:12

WOT2019人工智能

2021-01-27 11:56:45

AIops人工智能AI

2021-10-15 13:44:09

人工智能AI深度学习

2022-09-23 09:02:16

数字化转型AIOps
点赞
收藏

51CTO技术栈公众号