Microsoft Research explored using large language models (LLMs) to automate cloud incident management in Microsoft 365 services. The study focused on using GPT-3 and GPT-3.5 models to analyze incident reports and generate recommendations for root cause analysis and mitigation steps. Through rigorous evaluation of over 40,000 incidents across 1000+ services, they found that fine-tuned GPT-3.5 models significantly outperformed other approaches, with over 70% of on-call engineers rating the recommendations as useful (3/5 or better) in production settings.
This case study from Microsoft Research focuses on applying large language models (LLMs) to automate cloud incident management at hyperscale, specifically for services like Microsoft 365 and Microsoft Teams. The research was published at the prestigious IEEE/ACM International Conference on Software Engineering (ICSE) 2023, building on earlier foundational work that won a Best Paper award at SoCC’22. The core challenge addressed is the operational complexity of managing production incidents in massive cloud services that support hundreds of thousands of organizations, where rapid detection, root cause analysis, and mitigation are critical to minimizing customer impact.
The research team, part of the Microsoft 365 Systems Innovation research group, demonstrated that LLMs can be effectively leveraged to generate automated recommendations for root cause analysis and mitigation steps when incidents occur. This represents a practical application of AIOps (Artificial Intelligence for IT Operations) principles to reduce manual effort and accelerate incident resolution in production environments.
Building reliable hyperscale cloud services presents significant operational challenges. When production incidents occur, engineering teams must quickly detect the issue, perform root cause analysis, and implement mitigation steps. The manual nature of this process is time-consuming and requires significant human expertise. Microsoft’s prior research on Microsoft Teams incidents provided a comprehensive understanding of the incident lifecycle, common root causes, and mitigation patterns, which motivated the exploration of AI-powered automation.
The research identified that incident management could benefit from automation in two key ways: supporting incident diagnosis to identify root causes and mitigation steps quickly, and leveraging past lessons to build resilience for future incidents. This dual goal aligns well with the capabilities of modern LLMs, which can understand and reason over large volumes of natural language data.
The system architecture takes incident tickets as input, specifically using the title and summary fields that incident authors populate when creating tickets. These fields typically contain error messages, anomalous behavior descriptions, and other diagnostic details. The LLMs then generate two key outputs: root cause analysis and recommended mitigation steps.
The research conducted a rigorous comparative study across multiple LLM variants, including GPT-3 models (Curie, Codex, Davinci) and GPT-3.5 (Davinci-002), as well as encoder-based models like RoBERTa and CodeBERT. The study evaluated these models in three configurations: zero-shot (direct inference on pretrained models without task-specific training), fine-tuned (models trained on historical incident data), and multi-task settings.
A particularly notable aspect of this case study is the comprehensive evaluation framework employed. The team used both automated metrics and human evaluation, reflecting best practices in LLMOps evaluation:
Automated Metrics: The evaluation employed multiple lexical and semantic similarity metrics to compare generated recommendations against ground truth from the incident management (IcM) portal:
The evaluation also considered both Top-1 and Top-5 predictions, allowing for a more nuanced assessment of model performance.
Human Evaluation: The team interviewed actual incident owners (on-call engineers) to evaluate the practical usefulness of generated recommendations. This human-in-the-loop evaluation is crucial for LLMOps deployments, as automated metrics may not fully capture real-world utility.
The fine-tuned GPT-3.5 model (Davinci-002) demonstrated the strongest performance across most metrics:
An important finding related to fine-tuning: The fine-tuned GPT-3.5 model showed a 45.5% improvement in average lexical similarity for root cause generation and a remarkable 131.3% improvement for mitigation generation compared to zero-shot settings. This highlights the significant value of domain-specific fine-tuning for production LLM deployments, particularly in specialized domains like cloud operations.
The research also revealed performance differences based on incident type. Machine-reported incidents (MRIs) showed better LLM performance than customer-reported incidents (CRIs), attributed to the more repetitive and structured nature of automated incident reports. This finding has practical implications for deployment strategies, suggesting that automation may be more immediately effective for certain incident categories.
The human evaluation results are particularly noteworthy for production deployment considerations: over 70% of on-call engineers rated the recommendations at 3 out of 5 or higher for usefulness in real-time production settings. This validation by actual practitioners provides strong evidence for the practical viability of the approach.
The case study touches on several important LLMOps challenges for production deployment:
Model Selection and Fine-tuning: The comparison across multiple model variants and configurations demonstrates the importance of systematic model evaluation before production deployment. The significant performance gains from fine-tuning suggest that organizations should invest in creating domain-specific training datasets from historical incident data.
Staleness and Model Retraining: The research explicitly acknowledges the challenge of model staleness, noting that models would need to be frequently retrained with the latest incident data to remain effective. This reflects the ongoing operational overhead of LLMOps in production environments where the underlying data distribution evolves over time.
Context Integration: The researchers identify incorporating additional context (discussion entries, logs, service metrics, dependency graphs) as an open research question. This speaks to the broader challenge in LLMOps of effectively grounding LLMs with relevant operational context to improve accuracy and reduce hallucinations.
Retrieval-Augmented Approaches: Looking forward, the team outlines plans to leverage retrieval-augmented generation (RAG) approaches combined with the latest LLMs to improve incident diagnosis. The described workflow shows a conversational interface where relevant documents and logs are retrieved to provide context for generating more accurate and grounded responses.
The research team envisions an evolution toward a conversational interface for incident diagnosis, where ChatGPT-style models can actively participate in incident discussions. This approach would collect evidence from available documents and logs to generate contextual, natural-sounding responses that facilitate diagnosis and accelerate resolution.
The acknowledgment that future LLM versions may reduce the need for fine-tuning while improving performance reflects the rapidly evolving nature of the LLM landscape and the need for LLMOps practices to adapt accordingly.
While the results are promising, several considerations merit attention. The evaluation focuses primarily on text similarity metrics, which may not fully capture the actionable quality of recommendations. The 70% satisfaction rate from engineers, while positive, also implies that 30% of recommendations may not meet the threshold for practical utility, suggesting room for improvement before full automation.
The research is positioned as an initial exploration with “many open research questions,” appropriately tempering expectations about immediate production readiness. The staleness challenge, while acknowledged, represents a significant operational burden that organizations would need to address for sustained deployment.
The study also originates from Microsoft Research examining Microsoft’s own services, which provides unique access to real production data but also means the findings are specific to Microsoft’s operational context and incident management practices. Generalization to other cloud environments would require further validation.
This case study presents lessons learned from deploying generative AI applications in production, with a specific focus on Flo Health's implementation of a women's health chatbot on the Databricks platform. The presentation addresses common failure points in GenAI projects including poor constraint definition, over-reliance on LLM autonomy, and insufficient engineering discipline. The solution emphasizes deterministic system architecture over autonomous agents, comprehensive observability and tracing, rigorous evaluation frameworks using LLM judges, and proper DevOps practices. Results demonstrate that successful production deployments require treating agentic AI as modular system architectures following established software engineering principles rather than monolithic applications, with particular emphasis on cost tracking, quality monitoring, and end-to-end deployment pipelines.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
A comprehensive overview of how enterprises are implementing LLMOps platforms, drawing from DevOps principles and experiences. The case study explores the evolution from initial AI adoption to scaling across teams, emphasizing the importance of platform teams, enablement, and governance. It highlights the challenges of testing, model management, and developer experience while providing practical insights into building robust AI infrastructure that can support multiple teams within an organization.