Stepping into AIOps: IT Operations meet Artificial Intelligence

Digitization has transformed the enterprise IT landscape of large organizations. The speed, scale, and complexity of multi-cloud infrastructure demand an innovative approach to meet the ever-increasing operational demands. In this context, a single IT support engineer should be enabled to handle a much broader scope than before. This is where AIOps comes into the picture. AIOps is an innovative approach that uses artificial intelligence through an algorithmic approach in automating IT operational tasks and providing remediation or solutions.

AIOps was a major buzzword that most CIOs discussed during the Gartner Data Center, Infrastructure & Operations Management Conference, in December 2018. But unlike many buzzwords that have come and gone, we can be guaranteed that AIOps is here to stay. As per Gartner, AIOps is an enabler of digital transformation and will take IT operations by storm.

As businesses become more complex and smarter, they look to be proactive and prevent critical errors, rather than reacting to them. Enabling proactivity in IT operations is one of the major advantages of AIOps. In addition, AIOps has the following advantages.

  • Identification of patterns and clusters of events
  • Better root cause analysis of critical events
  • Implementation of knowledge bases for faster remediation

Now let’s take a look at how IT Operations have been transformed to be analytics-driven and the role Data Science and Machine Learning plays in it.

Data Science and Machine Learning in Log Monitoring

The term Artificial Intelligence for IT operations (aka AIOps) was coined by Gartner in 2014. Operational analytics was able to automate and provide reports, alert triggers, and perform the analysis. Traditional log management has to now transform into a state where a program itself can act without explicitly being programmed or monitored.  

With multi-cloud infrastructures that have thousands of end-points, IT Operations and associated teams face major challenges in monitoring, configuring and maintaining. IT Operations Analytics (ITOA) and Application Performance Management (APM) are two technologies that address these problems. Even though it sounds similar, the underlying processes are different. APM is proactive where ITOA is reactive. Enabling you to be proactive by analyzing, past patterns and understanding how real-time changes mimic events that lead to major issues; and ultimately automating actions to respond to them in advance, are some of the major areas that will be covered by AIOps.  

The algorithmic approach looks to find solutions to the questions such as, when is an event going to happen next? What are the most effective & efficient actions to minimize or prevent it from happening? Finally, it will automate anomaly event identification and healing. This requires the expertise of extracting information from big data which comes in high speed and high volume and the expertise of predicting events. A rule-based method would not cater to the growing complexity of the systems. Machine learning models play a vital role in detecting patterns that exist in a large amount of data under supervision or without supervision. 

In this scenario, unsupervised techniques play a vital role as the patterns that need to be identified are typically complex. When the patterns have been recognized, then supervised models can be built. 

We have experienced that a hybrid of supervised and unsupervised models give higher accuracy in detecting and predicting events with reinforcement learning. Supervised learning methods alone cannot cater to the requirement, due to the dynamics and complexity of log events, which require a model to have the capability to learn new patterns and make the changes itself.

Benefits of AIOps

Automated anomaly detection helps IT teams in many ways. IT teams can work on important alerts (aka smart alerts) rather than attending to millions of unimportant alerts (aka Fatigue Alerts). Fatigue alerts are a big headache for monitoring teams, but with the adoption of automated anomaly detection, smart alerts will be triggered. If the actions that need to be taken for the given abnormal events are recorded, support teams will not only get the reason for the anomaly but the best possible recommendations for remediation. This increases the productivity of the IT team by minimizing the repetitive tasks of faster root cause analysis; which in turn helps in efficiently managing IT operations and lays the stepping stones of IT services automation. 

With smart alerts correlated, with critical events that have happened earlier, machine learning models can be built to proactively identify critical events that will happen and estimate the time at which it will happen. This will assist in preventing undesirable events from occurring, which in turn will prevent outages, that lead to uninterrupted operations, the dream of any ITOps team.

This discussion consists of four articles. The second article will discuss “Unsupervised Anomaly Detection”. 

Keep in touch to know more about automate anomaly detection framework!

Photo by Shahadat Rahman on Unsplash Shahadat Rahman

Hansa Perera

Associate Architect of Data Science