Datadog built Bits AI SRE, an autonomous agent designed to investigate and resolve production incidents in distributed systems. The agent addresses the challenge of increasing complexity in modern environments where failures span multiple services and generate noisy signals across large volumes of telemetry data. Bits AI SRE mimics human SRE investigation patterns by forming hypotheses, testing them against live telemetry data, and recursively following evidence to root causes. The solution uses a benchmark dataset of real production incidents for evaluation and has reportedly helped teams decrease time to resolution by up to 95%, moving beyond simple summarization to perform deep, causal investigations across multi-component systems.
Datadog developed Bits AI SRE, an autonomous AI agent specifically designed to handle one of the most challenging aspects of operating distributed systems in production: investigating and resolving incidents. This case study represents a sophisticated application of LLMs in production environments where the agent must operate autonomously on real-time telemetry data to diagnose complex, multi-component system failures. The agent represents a significant departure from simple LLM-based summarization tools, instead implementing a reasoning framework that mimics how experienced SREs approach incident investigation.
The fundamental problem Bits AI SRE addresses stems from the increasing complexity of modern distributed systems. As environments become more dynamic, failures now span multiple services, generate noisier signals, and produce vast volumes of telemetry data that make it difficult for on-call engineers to quickly identify root causes. Datadog claims their agent has helped teams reduce time to resolution by up to 95%, though as with any vendor-provided metrics, these numbers should be viewed as optimistic and likely representing best-case scenarios rather than average improvements.
The architecture of Bits AI SRE reflects a fundamental design decision that distinguishes it from simpler LLM applications: rather than treating the problem as one of summarization or information retrieval, Datadog built the agent to replicate the investigative reasoning process that human SREs use. This manifests in several key architectural choices.
The agent operates through an iterative hypothesis-testing loop. When responding to an incident or monitor alert, Bits AI SRE formulates hypotheses about potential root causes, validates or rejects these hypotheses using targeted queries against live telemetry data, and repeats this process until reaching a root cause. This approach addresses a critical challenge in LLMOps: how to manage context windows and prevent the model from being overwhelmed by irrelevant information or noise that could lead to incorrect conclusions.
The evolution of the agent’s architecture reveals important lessons about scaling LLM-based systems in production. Early versions of Bits AI SRE followed a more conventional approach: scaling by performing more tool calls across the platform and then prompting an LLM to summarize all the responses. This approach encountered a fundamental limitation that many production LLM systems face—as the number of tool calls increased, the input token count for the summarization prompt scaled linearly. This linear scaling meant that incorporating additional telemetry data either degraded model performance or exceeded context window limits.
Datadog provides a concrete example of this failure mode: in a Kafka incident where lag was caused by a spike in commit latency, an early version issued 12 tool calls across logs, traces, and metrics. While one tool call correctly identified the root cause, other responses included suspicious but ultimately irrelevant signals like critical application errors in an upstream service. The summarization approach failed to distinguish between causal relationships and correlated noise, leading to an incorrect root cause determination.
The current version of Bits AI SRE addresses this limitation by focusing on causal relationships rather than attempting to synthesize all available data at once. This represents a significant architectural shift in how the agent processes information. Rather than viewing the problem as “collect all potentially relevant data and summarize,” the agent now operates with a hypothesis-driven approach that explicitly seeks to establish causal links between the monitor alert and specific telemetry data pertaining to each hypothesis.
This focus on causality over correlation is particularly important in production environments where distributed systems generate enormous amounts of correlated but non-causal signals. The ability to distinguish between a symptom, a correlation, and an actual cause represents one of the most sophisticated reasoning tasks that SREs perform, and building this capability into an autonomous agent required moving beyond standard RAG (retrieval-augmented generation) patterns toward more sophisticated reasoning frameworks.
A particularly sophisticated aspect of Bits AI SRE’s design is its ability to handle multi-component issues that require multiple investigative steps. The agent implements a recursive decomposition strategy where complex hypotheses are broken down into sub-hypotheses. When a sub-hypothesis is supported by evidence, the agent continues digging deeper along that investigative path. If evidence doesn’t support a sub-hypothesis, the agent pivots to explore alternative explanations.
The case study provides an illustrative example: when investigating pods in CrashLoopBackOff state, an early version of the agent correctly identified that the alert fired because a pod ran out of memory. While technically accurate, this represents a surface-level diagnosis. The current version recursively generates deeper root cause hypotheses, ultimately discovering that the out-of-memory conditions were caused by an influx of abnormally large payloads, which caused a single pod to crash and trigger the alert. This deeper investigation provides actionable insight rather than merely confirming the immediate symptom.
This recursive investigation capability represents a significant technical challenge in LLMOps. The agent must maintain investigative state across multiple reasoning steps, decide when to explore deeper versus when to backtrack, and manage the complexity of multi-step reasoning chains without losing coherence or exceeding computational budgets. The implementation details aren’t fully disclosed in the source material, but this likely involves sophisticated prompting strategies, potentially multiple specialized models or agents working in concert, and careful management of context to maintain coherent reasoning across extended investigation chains.
One of the most significant aspects of this case study from an LLMOps perspective is Datadog’s emphasis on evaluation using real production incidents. The team worked across hundreds of internal teams to collect and label actual incidents, creating a benchmark dataset of test scenarios with ground truth labels. The agent’s performance is regularly measured by scoring its output against these labeled real-world scenarios.
This evaluation approach represents a best practice in LLMOps that deserves emphasis. Rather than relying on synthetic benchmarks or proxy metrics, Datadog built their evaluation framework around the actual distribution of problems the agent would encounter in production. This is particularly valuable in the SRE domain where the complexity and diversity of real-world incidents is difficult to capture in synthetic scenarios.
The case study explicitly states that having “the largest dataset of production telemetry data in the industry” positions Datadog uniquely to build this type of agent effectively. While this is clearly a marketing claim, there’s an important underlying truth: production LLM systems for complex reasoning tasks benefit enormously from access to large volumes of domain-specific training and evaluation data. The quality and realism of the evaluation framework likely represents one of the most important factors in the agent’s effectiveness, potentially more important than specific model architecture choices or prompting strategies.
While the case study doesn’t extensively detail the production deployment infrastructure, several aspects can be inferred. The agent must integrate with Datadog’s existing observability platform, querying across logs, traces, metrics, and other telemetry sources. This integration layer represents a significant engineering challenge—the agent needs to translate high-level investigative hypotheses into specific queries against various data sources, execute those queries efficiently, and interpret the results in context.
The agent also produces “audit-ready root cause analyses,” which suggests significant attention to explainability and documentation. In production SRE contexts, it’s not sufficient for an agent to simply identify a root cause; it must also provide a coherent explanation of its reasoning process that can be reviewed by human engineers and potentially included in post-incident reports. This requirement for audit trails and explainability adds significant complexity to the LLMOps implementation.
While Datadog presents Bits AI SRE as a significant advancement, several considerations warrant a balanced perspective. The “up to 95% reduction in time to resolution” claim should be treated with appropriate skepticism. This represents an upper bound rather than an average case, and no information is provided about the distribution of improvements across different incident types or complexity levels. It’s likely that the agent performs very well on certain classes of well-defined incidents while struggling with novel or particularly complex scenarios.
The case study doesn’t discuss failure modes, accuracy rates, or situations where the agent provides incorrect or misleading root cause analyses. In production LLM systems, understanding failure modes is at least as important as understanding success cases. Engineers relying on the agent’s output need to understand when to trust its conclusions and when to apply additional scrutiny.
The recursive hypothesis generation and testing approach, while sophisticated, could potentially lead to very long investigation chains that consume significant computational resources or time. The case study doesn’t discuss how the agent manages these trade-offs or how it decides when an investigation is complete versus when to continue exploring.
Datadog indicates they’re expanding Bits AI SRE to cover additional data sources and integrating it with other specialized agents across their platform. This points toward a multi-agent architecture where different specialized agents handle different aspects of the SRE workflow, potentially coordinating to drive end-to-end resolution. This architectural direction aligns with broader trends in LLMOps toward compositional systems where multiple specialized agents or models work together rather than relying on a single monolithic agent.
The mention of “expert investigator and optimization agents” suggests Datadog is building a broader ecosystem of AI capabilities, potentially with different agents specialized for different types of investigations or optimization tasks. This compositional approach offers advantages in terms of maintainability, specialization, and potentially cost management, though it introduces coordination challenges.
This case study illustrates several important principles for production LLM systems in complex domains. First, the shift from summarization-based approaches to hypothesis-driven reasoning represents a significant architectural evolution that addresses fundamental limitations of simpler approaches. Second, the emphasis on evaluation with real production data rather than synthetic benchmarks demonstrates the importance of grounding evaluation in actual use cases. Third, the focus on causal reasoning over correlation highlights how production LLM systems must often implement domain-specific reasoning patterns rather than relying solely on general-purpose summarization or question-answering capabilities.
The recursive investigation capabilities and multi-component issue handling demonstrate that sophisticated production LLM applications often require complex control flows and state management that go beyond simple prompt-response patterns. Finally, the integration challenges of connecting LLM reasoning to existing production infrastructure—in this case, the entire Datadog observability platform—represent a significant but often underappreciated aspect of LLMOps in enterprise contexts.
This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.
Cosine, a company building enterprise coding agents, faced the challenge of deploying high-performance AI systems in highly constrained environments including on-premise and air-gapped deployments where large frontier models were not viable. They developed a multi-agent architecture using specialized orchestrator and worker models, leveraging model distillation, supervised fine-tuning, preference optimization, and reinforcement fine-tuning to create smaller models that could match or exceed the performance of much larger models. The result was a 31% performance increase on the SWE-bench Freelancer benchmark, 3X latency improvement, 60% reduction in GPU footprint, and 20% fewer errors in generated code, all while operating on as few as 4 H100 GPUs and maintaining full deployment flexibility across cloud, VPC, and on-premise environments.
OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.