Blog
Annotated Bibliography Part1: Summary of the Research Topic In recent growing trends
Annotated Bibliography
Part1: Summary of the Research Topic
In recent growing trends of Artificial Intelligence, major TECH players in the market are adopting new strategies and techniques on how to leverage AI to enhance operations that impact production ready systems. This research paper will on focus on key areas where Artificial Intelligence is revolutionizing IT/DevOps/SRE Team’s, and is leading to improve productivity, resource optimization, cost savings, automation, and enhanced cloud strategies.
Through this research, I plan to dive into each topic to showcase how AIOps can enhance and impact on IT/DevOps/SRE practices, focusing on Ai-Driven DR (Disaster Recovery) practices, resource management optimization (improving performance), cost optimization through AI’s identification on inefficiencies, and automation for enhanced productivity (to eliminate the operational burden). Moreover, the research will showcase how AIOps can contribute to smarter cloud strategies (fault prediction, resource usage, managing costs, etc.) by providing real-time insights.
AIOps, holds immense importance in people’s lives due to its potential to enhance production ready system reliability, speed up disaster recovery, optimize resource management for cost savings, and automate routine tasks, which leads to increased productivity and focus on innovations. AI’s contribution to smarter cloud strategies also ensures more efficient and responsive cloud infrastructures. Furthermore, AIOps promotes accessibility by enabling businesses of all sizes to leverage advanced technology solutions.
My end goal with this research will be to highlight real-world implementations and case studies to provide insights into AIOps adoption, benefits, and challenges. By showcasing the potential of AI-powered solutions, the research will aim to equip readers with a comprehensive understanding of how AIOps drives efficiency, stability, and innovation in modern IT/Engineering operations.
Part2: Annotated Bibliography
“Impact of Artificial Intelligence-enabled Software-defined Networks in Infrastructure and Operations: Trends and Challenges”
Citation:
M. R. Belgaum, “Impact of Artificial Intelligence-enabled Software-defined Networks in Infrastructure and Operations: Trends and Challenges,” International Journal of Advanced Computer Science & Applications, pp. 66-73, 2021.
Annotation:
This paper discusses the impact of integrating artificial intelligence (AI) with software-defined networking (SDN) in enterprise infrastructure and operations (I&O). The research explores the benefits of this fusion, including process automation and quick decision-making for management. However, it acknowledges the challenges that need to be addressed. The study highlights the trends and issues affecting I&O while outlining potential future directions for implementing AI-enabled SDN.
“Applications of Artificial Intelligence in IT Disaster Recovery”
Citation:
V. Skala, T. P. Singh, T. Choudhury, R. Tomar, and M. Abul Bashar, “Applications of Artificial Intelligence in IT Disaster Recovery,” in Machine Intelligence and Data Science Applications: Proceedings of MIDAS 2021, vol. 132, pp. 663-677, Springer, 2022. [Online]. Available: https://doi.org/10.1007/978-981-19-2347-0_52
Annotation:
The paper discusses the impact of AI in disaster recovery planning, highlighting its benefits and challenges. AI enables quick initiation of the DR plan during adverse events, providing insights for effective handling. Utilizing AI ensures faster initiation, reliability, and availability, enhancing business continuity. The study covers various use cases for AI implementation in pre-disaster, implementation, and aftermath phases, streamlining the DR process while addressing potential challenges.
“AIOps Architecture in Data Center Site Infrastructure Monitoring”
Citation:
W. Dong and B. Ding, “AIOps Architecture in Data Center Site Infrastructure Monitoring,” Computational Intelligence and Neuroscience, vol. 2022, pp. 1-12, 2022. [Online]. Available: https://doi.org/10.1155/2022/1988990
Annotation:
The paper presents an AIOps architecture designed for data center infrastructure monitoring, emphasizing its uniqueness and custom solutions. Core modules like technical architecture, machine learning algorithms, big data, and business applications are tailored for this domain. The focus is on technical aspects rather than nonfunctional requirements. The paper aims to drive AIOps advancements in data center infrastructure.
“Artificial Intelligence Enabled Effective Fault Prediction Techniques in Cloud Computing Environment for Improving Resource Optimization”
Citation:
J. H. Abro, C. Li, M. Shafiq, A. Vishnukumar, S. Mewada, K. Malpani, J. Osei-Owusu, and P. Gupta, “Artificial Intelligence Enabled Effective Fault Prediction Techniques in Cloud Computing Environment for Improving Resource Optimization,” Scientific Programming, vol. 2022, pp. 1-7, 2022. [Online]. Available: https://doi.org/10.1155/2022/7432949
Annotation:
The paper emphasizes proactive failure prediction for virtual machines (VMs) in cloud computing to optimize resources, reduce downtime, and enhance scalability. By leveraging artificial intelligence, effective fault prediction techniques and a safe resource migration strategy were developed, resulting in improved system performance and reliability. The paper underscores the significance of timely VM failure prediction for efficient resource optimization in cloud environments.
“AIOPs based Predictive Alerting for System Stability in IT Environment”
Citation:
P. P. Teggi, H. N, and B. Malakreddy, “AIOPs based Predictive Alerting for System Stability in IT Environment,” in 2022 International Conference on Innovative Trends in Information Technology (ICITIIT), pp. 1-7, IEEE, 2022. [Online]. Available: https://doi.org/10.1109/ICITIIT54346.2022.9744236
Annotation:
The paper introduces AIOps for digital transformation, employing advanced analytics to enhance IT operations. It presents an automated predictive alerting system based on logistic regression to reduce alert noise in the Micro Focus Operations Bridge. By identifying abnormal operational data, the system raises targeted alerts, enabling proactive IT activities and improving application and service management.
“Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps”
Citation:
A. Saha and S. C. Hoi, “Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps,” in 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 2825–206, IEEE, 2022. https://doi.org/10.1109/ICSE-SEIP55303.2022.9793994.
Annotation:
The text discusses using natural language documentation from past incident investigations (PRB Data) to improve Root Cause Analysis (RCA) in the cloud industry. They propose an Incident Causation Analysis (ICA) engine using NLP techniques to extract structured information from PRB documents. This forms the basis of a Retrieval-based RCA system for new incidents, evaluated and deployed at Salesforce with positive results.
“Evolving from Traditional Systems to AIOps: Design, Implementation and Measurements”
Citation:
S. Shen, J. Zhang, D. Huang, and J. Xiao, “Evolving from Traditional Systems to AIOps: Design, Implementation and Measurements,” in 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), pp. 437–280, IEEE, 2020. https://doi.org/10.1109/AEECA49918.2020.9213650.
Annotation:
The paper introduces AIOps (Artificial Intelligence for IT Operations) and its benefits in enhancing IT performance. It proposes Proton, a novel AIOps system with five key abilities: perception, detection, location, action, and interaction. Proton is designed to be compatible with traditional systems and has been deployed successfully in a large IT environment with a fault self-healing rate exceeding 80% for server ping failures.
“Log Anomaly to Resolution: AI Based Proactive Incident Remediation”
Citation:
R. Mahindru, H. Kumar, and S. Bansal, “Log Anomaly to Resolution: AI Based Proactive Incident Remediation,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 102–1357, IEEE, 2021. https://doi.org/10.1109/ASE51524.2021.9678815.
Annotation:
The 2020 SRE report revealed that 80% of SREs focus on postmortem incident analysis due to insufficient information, and 16% of their work involves investigating false positives/negatives. To proactively reduce outages and resolution time, the paper proposes an AIOps-based approach to identify log anomalies and their resolutions. The method involves preparing an augmented dataset from various sources, predicting metadata, and retrieving contextual resolutions for log-triggered signals. Early evaluation achieved 78.57% accuracy in metadata prediction and 65.7% accuracy in resolution retrieval.
“Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution”
Citation:
Y. Li, Z. M. Jiang, H. Li, A. E. Hassan, C. He, R. Huang, Z. Zeng, M. Wang, and P. Chen, “Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution,” ACM Trans. Softw. Eng. Methodol., vol. 29, no. 2, Article 13, April 2020, 24 pages. https://doi-org.libaccess.sjlibrary.org/10.1145/3385187.
Annotation:
The article discusses the challenges of predicting node failures in cloud computing platforms and the importance of proactive measures to minimize their impact. AIOps (Artificial Intelligence for IT Operations) is introduced as a promising approach to enhance computing platform quality through data analytics and machine learning. However, successful adoption of AIOps solutions requires more than just a high-performing machine learning model; they must also be trustable, interpretable, maintainable, scalable, and evaluated in context. The article presents the process of building an AIOps solution for predicting node failures in an ultra-large-scale cloud computing platform at Alibaba. The experiences shared in the article are intended to benefit researchers and practitioners working on AIOps solutions for large-scale cloud platforms.
“Towards AIOps in Edge Computing Environments”
Citation:
S. Becker, F. Schmidt, A. Gulenko, A. Acker, and O. Kao, “Towards AIOps in Edge Computing Environments,” in 2020 IEEE International Conference on Big Data (Big Data), pp. 73–3475, IEEE, 2020. https://doi.org/10.1109/BigData50022.2020.9378038.
Annotation:
The paper introduces edge computing as a solution for the demanding requirements of new network technologies like 5G. It aims to distribute computational resources to the network edge, overcoming challenges of centralized cloud computing. AIOps (Artificial Intelligence for IT Operations) is proposed to assist human operators in managing complex infrastructures using machine learning. The paper describes the system design of an AIOps platform applicable in distributed environments. It evaluates the overhead of high-frequency monitoring on edge devices and conducts performance experiments for anomaly detection algorithms, showing feasibility with reasonable resource utilization.
Grouping of these:
-Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
-AIOPs based Predictive Alerting for System Stability in IT Environment
– Artificial Intelligence Enabled Effective Fault Prediction Techniques in Cloud Computing Environment for Improving Resource Optimization
-Log Anomaly to Resolution: AI Based Proactive Incident Remediation
-Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
– Applications of Artificial Intelligence in IT Disaster Recovery
-Evolving from Traditional Systems to AIOps: Design, Implementation and Measurements
– AIOps Architecture in Data Center Site Infrastructure Monitoring
-Towards AIOps in Edge Computing Environments
-Impact of Artificial Intelligence-enabled Software-defined Networks in Infrastructure and Operations: Trends and Challenges

