A. N. Harutyunyan, N. M. Grigoryan, A. V. Poghosyan, S. Dua, H.Antonyan, K. Aghajanyan, and B. Zhang
|Identifying actual root causes of a performance issue within a modern cloud infrastructure with high level of scale, sophistication, and complexity, is a hard task. It is especially complicated to diagnose a service or infrastructure degradation of an unknown nature, when no active alert is enough indicative about potential sources (be it an object, its metric, property, or an associated event) of the problem. In such a situation, the data center administration is intuitively looking for changes in the system that might reveal the causative factors. This requires costly investigations and results in business-critical losses. Cloud management vendors are building visions around AI Ops-enabled automation of the entire workflow of root cause analysis and troubleshooting. We propose a solution towards such a vision which is based on hypothesis testing and machine learning approaches for automatic mining “important changes” of various kinds in behavior of data center objects across time and infrastructure topology. Those are the most relevant evidence patterns expected to explain the performance issue. Our current implementation which is integrated into vRealize Operations runs on the available three sorts of monitoring data – metrics, properties, and events. However, the full vision is to extensively include more observability provided by other cloud management tools vertically scaled to capture the depth of a specific dimension of the data center administration. The implemented module produces lists of recommended patterns across those three dimensions rank ordered subject to different criteria for each, such as confidence (p-value) provided by hypothesis testing and magnitude of change in the metric data, event’s sentiment score or abnormality degree, unexpectedness/entropy of property variations, etc. We describe the main analytical concepts behind the solution and demonstrate its validation in an application troubleshooting scenario.
Discussion Room: Intelligent Troubleshooting in Data Centers with Mining Evidence of Performance Problems