Artificial Intelligence for IT Operations (AIOps) describes the combination of big data and machine learning to automate IT operation processes, including event correlation, anomaly detection and root cause analysis.
What is AIOps?
AIOps stand for Artificial Intelligence for IT Operations, which is using AI technology to operate an IT operation. The word “AIOps” was established by Gartner which is the global well-known organization. They discuss about AIOps platform that using Machine Learning technology to manage all the big data generated by IT system and also explain that AIOps Platform is improving IT operation performances.
How AIOps relate with IT Operations?
Nowadays, IT Operations cover many different areas such as
- IT Infrastructure Management (ITIM)
- IT Service Management (ITSM)
- Network Performance Monitoring and Diagnostics (NPMD)
- Security Information and Event Management (SIEM)
- Application Performance Monitoring (APM)
- Digital Experience Monitoring (DEM)
The 3 main challenges which can be observed for digitally powered IT operations are:
- Huge amounts of IT infrastructure generated events, metric, traces, network flow data and telemetry data often exceeds manageability. The lack of proper tools to analyze those massive data lakes can lead to overlook the really important information.
- Having too many tools to manage the IT systems including ITIM, ITSM, NPMD, SIEM, APM, DEM raises not only complexity but also increases both, Capex and Opex expenses.
- IT Operations often focus only on the IT department itself and misses out on valuable insights which can affect business growth, competitiveness and users or client digital experience.
IT Operation Management can make the difference between success and failure for your Digital Transformation journey. In order to maximize the efficiency of existing IT Operation Management, it is highly recommended to utilize an AIOps Platform to create the best possible user experience and business results.
AIOps platforms enhance existing monitoring systems, service management and task automation. AIOps is covering 3 main areas which are:
- Observe (Monitoring) The AIOps platform processes both, real-time and historical data, e.g. events, metrics, traces and topology from other IT systems. Based on this data historical analysis, anomaly detection, performance analysis and correlation & contextualization can be observed.
- Engage (ITSM) The AIOps platform receives incidents, dependencies and changes related data to introduce task automation, change risk analysis, SD agent performance analysis and knowledge management.
- Act (Automation) The AIOps platform analyzes and run playbook processes: “Self-Diagnostic” for analyzing, “Self-Healing” for issue solving, “Self-Recovery” for recovering and “Self-Prevention” for preventing future problems in an automatic manner . Especially recurring tasks can be automated to reduce incident, reduce down time, reduce error and increase SLA.
The benefits of AIOps is that it enables your IT operations to identify, target, and resolve slowness and outages faster. Here are some several benefits:
- Reduce Noise such as false alarms.
- Determine causes or expected causes using topology or Machine Learning and link them to the customer journey.
- Detect anomalies from many variables (multivariate anomalies), that are beyond the capabilities of static thresholds or numeric outliers. This can be used to detect unusual conditions and behaviors that affect the business.
- Identify trends that could prevent problems before they can occur.
- Drive automation for low to medium risk tasks.
- Use chatbots or virtual support assistants to access knowledge and drive automation for repetitive tasks or incidents.
- Prioritize incidents automatically and suggest remedies from past incidents.
A domain-based monitoring tool for NPMD, ITIM, APM, DEM areas provides data mainly within its own domain. However, it cannot provide holistic information for when a digital service is associated with multiple domains. For example, mobile applications hosted on the cloud involves NPMD, ITIM, APM and DEM to provide a holistic view. For this, the AIOps Platform can be used to perform holistic analysis (cross-domain analysis), such as performing Root Cause Analysis to determine the reasons for negative performance issues.
The AIOps Platform can also help manage IT services or IT Service Management (ITSM) tasks, including:
- Assisting staff responsible for dispensing various incident services
- Automating tasks such as software installation, password resets, or email validation to open a service request.
- Helping to analyze past incident data to support and increase service provider productivity.
- Helping to find insights at strategic levels within change management, including predicting whether a Change Request will be successful, finding conflicts in changes, or determining the best time to implement system patches.
- Helping to predict which incidents are unlikely to be resolved within SLA timeframes.
- Using natural language processing (NLP) to aid the functionality of chatbots and virtual support assistants in order to reduce the basic work required by employees.
In addition to ITSM, the AIOps Platform is an important tool for automating workflows for DevOps, Continuous Integration/Continuous Delivery (CI/CD) and Site Reliability Engineering (SRE). This is carried out on a continual basis from development and application testing, to application deployment and monitoring.
Currently, IT monitoring has a broader scope and area than in the past. It covers management areas such as IT Infrastructure Management (ITIM), Network Performance Monitoring and Diagnostics (NPMD), Application Performance Monitoring (APM), Digital Experience Monitoring (DEM), monitoring tools for ITIM, NPMD, APM, DEM. The monitoring tool for ITIM, NPMD, APM, DEM mainly provides data within its own domain.
But it cannot provide holistic information that a digital service involves multiple domains, for example, mobile application applications that need to access a cloud-based server are related to ITIM, NPMD, APM, DEM. Therefore, it is necessary to use the AIOps Platform to do a holistic view (cross-domain analysis) to look for the reasons why Digital Experience is not good, which may be caused by network, server or application.
According to Gartner (Gartner, Inc.)’s customer surveys, having multiple monitoring tools in the organization may lead to the delay in responsiveness and longer solving time. The question is, how can I&O Leader improve IT operations efficiency and shorten the time of impact due to this variety of monitoring tools? To use AIOps may be the answer as the following:
1. Improve collaboration by using the AIOps platform to collect data from monitoring tools
such as telemetry data, logs, and virtual them on the central dashboard, which corresponds to what the operation team wants to monitor.
Data collection may reduce the risk of having multiple monitoring tools by:
- To create centralized visibility for events from various IT systems, which impact to the business.
- To find and correlate information from various systems reduce vague and duplication of information
- To improve teams’ collaboration by using a central data set for making decisions and problems solving
In addition, data collection reduces the time we have spent manually collecting the data, analysis and decision-making capabilities improvement. Including the extension to automation work.
2. To deliver the insights gained from analysis to stakeholders by integrating raw data from
various tools and confident that the outcome data or results are more valuable by using various techniques as follows:
- To manage active events data using Event Correlation Analysis (ECA), the challenge of this technique is that we expect manually adjust the rule periodically.
- To handle of active data and archived events data by using pattern recognition and machine learning to optimize event correlation and reduce manually rule updating.
- To Manage events and metrics by holding the data on the same time series axis. This will make it easier for us to find the root cause of the problem.
- To handle of metrics by hold the data on the same time series axis for anomaly.
3. Establish realistic expectations from data collection techniques for indicators affecting IT operations, such as Mean time to repair (MTTR), or the problems’ resolve time. Where the effectiveness pattern recognition technique is necessary require a lot of past data. While machine learning also requires a huge of data, together with human resource to identify algorithms to create models for precisely results.
Adopting AIOps technology for IT Monitoring can bring benefits and value to IT departments, end-users, customers and businesses by making IT operations efficient, integration of information, reduce data duplication, provide valuable information to support in problem analysis, finding root cause as well as shorten problem solving time. These results will improve digital experience for both users and customers, reducing churn, increasing revenue, which is a critical goal of today’s businesses.
In the IT industry, we often use the ITIL framework, which is an IT best practice for IT services such as Incident Management, Service Request Management, Change Management, Knowledge Management, IT Asset Management, Service Configuration Management, etc.
In terms of Incident Management and Service Request Management, we can apply AIOps technology to support the tasks as follows:
- To help define incident information such as category, urgency, impact, priority etc.
- Suggesting the right staff or team for the specific incident
- To help analyze incident in the past to support and increase the productivities of service provider
- To predict which incidents are unlikely to be resolved in time to a specified SLA.
- Automatically resolve the case from easy to moderate that often recurring.
- To automate basic routine tasks such as software installation, password reset, or verify email’s content to open a request ticket
- The use of natural language processing (NLP) to facilitate the operation of chatbots and virtual support agents (VSAs) in order to reduce the generic tasks that service person may perform regularly and take time to do more complex jobs
- To monitor response or activities in the incident and escalate or perform the tasks automatically according to workflow assigned.
In Change Management, we can use AIOps technology to help various tasks as follows.
- Increasing the success rate of Change by analyzing the same or similar historical RFC data pattern that was not successful. What’s the risk? Or what is the effect?
- Automatically assess Change’s risk level by analyzing various data in the RFC, including the past RFCs.
- Prediction of whether the RFC will succeed or not
- Finding the best period to perform the Change
In IT Asset Management and Service Configuration Management related to the Configuration Item (CI) in the CMDB, we can use AIOps technology to help with various tasks as follows.
- Grouping and find the reason that the Change is not successful that caused misconfiguration or incomplete identification of impact on CI, etc.
- Finding the dependency or relationship between CI of Incident and Change
- Integration of configuration change or infrastructure change from events sent by monitoring tools or orchestration tools to notify service providers.
- To access the CI for automate basic incident’s resolve to automate medium incident resolution
The adoption of AIOps technology for IT Service Management can bring benefits and value to IT departments, users, customers and businesses as follows:
- Helps to reduce the response time (mean time to response)
- Reduces problem solving time (mean time to repair)
- Helps to find the cause of the problem faster.
- Significantly improved uptime and SLA
- Increase productivity and customer satisfaction, reduce churn.
- Help to manage and maximize the use of resources, including reducing people, reducing time and expenses.
The main AIOps Platform functionality we discussed are:
- Data Ingestion is bringing in information from many sources entry to the system
- Data Analytics is the analysis of data by using machine learning technology
- Prescription is the things what to do from the analysis of the data in item 2.
- Action is an automate operation.
Gartner (Gartner, Inc.) analyst mentioned, how the AIOps Platform works in three areas: Observe (Monitoring), Engage (ITSM), and Act (Automation) .The automation is a desperately requirement for all IT professionals from the AIOps Platform. This will give us a digital workforce with a technician or engineering skills, that ready to work on our behalf, to verify (Self-Diagnostic), troubleshooting the problems (Self-Healing), recovery (Self-Recovery) and prevent the problems. (Self-Prevention) for IT system, automatically, especially on recurring tasks to reduce incidents, to reduce down-time and errors as well as to increase SLA.
The four steps of creating AI-Assisted Automation:
- Start with what we know that creating a knowledge base by collecting successful solutions that solve the problems, which are not usually stored in the system, but normally being kept with a specific person called “tribal knowledge”, including problem classification in categories
- Find comparable problems with our internal and external knowledges (Crowdsource)
- Suggest methods or solutions to solve the problems.
- Perform automatic troubleshooting by accessing the device or applications and submits commands to fix problems or preform prevention as recommended. Evaluate the outcomes that can be resolved or not in order to improve the problem-solving methods more effectiveness.
In addition to driving automation, the AIOps Platform can help us analyze data using machine learning technology in the case to manage the big data, which is generated from the IT system, including:
- Pattern search and prediction by looking for patterns of information that occurred in the past to predict what will happen in the future
- Find unusual events by looking for normal data patterns and abnormal patterns, for example, an organization has an average of 50% and a maximum of 70% of usage rate, but over a period it has been used beyond the max value. The platform should support seasonal or holiday period or the end of the month analysis. Because the behavior during those times may be exceed the normal range and be considered as normal situations.
- Root cause identifying, it detects the patterns and coherence of the information that indicate the causes and results.
- Topological analysis, which uses a diagram in searching the root cause to provide precise and efficient results such as detecting event’s points and searching for upstream and downstream from those points.
With AIOps technology, automated IT operations can be performed more efficiently, reducing tasks, reducing time and human errors that often occur and cause the damages. In addition, the AIOps Platform can analyze enormous data at the device. An application, which is built to give us quick insights into what went wrong and what are the cause of the problem.
Netka AIOps Director or N-AIOps is AIOps Platform which provides data ingestion, data analytics by using AI technologies and intelligently drive automation. N-AIOps have workflow designer which is tool for creating automation process that can flexibly design workflow with complex conditions. N-AIOps is platform which require data from IT management systems e.g. ITIM, ITSM, NPMD, SIEM, APM, DEM for cross-domain analysis and drive automation. N-AIOps supports data for processing as follow:
- Log data e.g. Syslog, SNMP Trap, Windows event
- Telemetry data e.g. metrics, traces
- Network data e.g. packet analysis data, flow analysis data, topology, inventory
- ITSM data e.g. incidents, changes, problems, Cis
- IoT data or sensor values g. temperature, humidity, AC/DC voltage, current, watt, relay, contact, access door’s status
N-AIOps can work with 3rd party application which send data with Syslog, SNMP Trap or JSON format and work seamlessly with Netka products including NetkaView Network Manager or NNM, NetkaQuartz Service Desk or NSD, NetkaView Logger or NLG, NetkaView IoT or NIoT which you can click to learn more.