Komodor introduced Klaudia AI, a multi-agent architecture designed to address the complexity of modern cloud-native infrastructure incident management. The problem stems from contemporary systems running hundreds of microservices across multi-cloud environments where symptoms appear in one place while root causes exist elsewhere, making single-agent AI tools ineffective. Klaudia's solution employs a three-layer architecture with over 50 domain-specific expert agents (covering Kubernetes, GPU/NVIDIA, AWS, ArgoCD, Istio, and more) coordinated by workflow orchestrators, all underpinned by a knowledge graph that maps entity relationships across the stack. The system demonstrated significant results including 80% reduction in MTTR for Kubernetes issues at Cisco Outshift, 55% faster pipeline failure diagnosis with the Airflow agent, and the ability to ship new domain agents in 2-4 weeks through its extensible platform architecture.
Komodor unveiled Klaudia AI at KubeCon Europe 2026, representing a sophisticated approach to production LLM systems for site reliability engineering. This case study is particularly valuable for understanding multi-agent LLM architectures in production environments, though it’s important to note that the source material is marketing content from Komodor itself, so performance claims should be viewed as vendor-reported rather than independently verified.
The fundamental challenge Komodor addresses is one of the hardest problems in LLMOps for infrastructure: how to provide the right context to LLMs when dealing with massively complex, interconnected systems. Modern cloud-native environments running hundreds of microservices across Kubernetes, service meshes, GPUs, storage layers, and multiple cloud providers create scenarios where symptoms appear in one location while root causes exist in entirely different parts of the stack. Traditional single-agent AI approaches fail because they either drown in too much data or fabricate conclusions from insufficient context.
Klaudia’s production architecture is organized into three distinct layers that demonstrate sophisticated LLMOps engineering:
The Domain Agnostic Core serves as shared infrastructure powering all investigations. This layer includes the Planner, Enricher, Action Executor, Validation Engine, Guardrail Engine, Knowledge Graph, Eval Engine, and Continuous Learning components. This separation of concerns is a noteworthy architectural pattern in production LLM systems, where generic reasoning machinery is decoupled from domain-specific knowledge, enabling consistent behavior and reliability guarantees across all agents.
The Agentic Workflows layer contains orchestrator agents that manage the reliability engineering lifecycle: Detect → Investigate → Remediate → Optimize → Prevent. These workflow agents make strategic decisions about what to examine, which specialists to consult, and how to synthesize findings before transitioning to the next phase. The handoff capability between workflow agents represents a meaningful architectural distinction from monolithic agent systems, allowing specialization at the workflow level while maintaining coherent investigations.
The Domain Specific Expertise layer contains over 50 Subject Matter Expert (SME) agents covering technologies including GPU/NVIDIA, AWS, ArgoCD, Istio, Cilium, Airflow, Redis, Kafka, Postgres, and dozens more. Each SME is an autonomous module with deep expertise in exactly one domain. The production system already has cloud-layer coverage in progress and APM support planned, which will extend coverage to observability platforms like Datadog, Grafana, New Relic, Splunk, Prometheus, and OpenTelemetry.
Komodor’s architecture explicitly frames the problem as context engineering rather than model selection, which represents an important insight for production LLM systems. The text argues that most AI operations tools fail not because of LLM limitations but because of poor context management—either overwhelming the model with noise or providing insufficient information for accurate reasoning.
The system addresses this through several mechanisms. First, each SME agent receives only the context it needs for its specific domain, reducing hallucination risk through scoped input. Second, agents are designed with isolation properties where poorly-performing or uncertain agents don’t contaminate the broader investigation. Third, the architecture employs iterative refinement where agents form hypotheses, gather targeted evidence, evaluate findings, and refine conclusions—mirroring how senior SREs actually debug production systems.
Each agent is built around a consistent methodology: defining the agent’s goal and scope, specifying investigation patterns and failure modes, defining available tools and data sources, and standardizing output formats for downstream consumption. This consistency of structure across agents is what enables extensibility and reliability rather than just complexity.
One of the most sophisticated LLMOps components is Komodor’s relationship engine, which maintains a dynamic knowledge graph mapping entities and connections across the cloud-native stack. Agents can traverse this graph bidirectionally (forward: A uses B, reverse: B is used by A) to follow incident cascades from symptoms to root causes.
The knowledge graph serves two critical functions in production. First, it preserves context window space by enabling targeted retrieval of connected, relevant data rather than fetching everything. Second, it compresses investigation time by allowing agents to follow relationship chains directly (alert → service → pod → node → GPU → deployment) rather than searching blindly.
Importantly, the graph is extensible—when new agents join the platform, they naturally extend existing mappings. The ArgoCD agent extended Deployment → ReplicaSet → Pod mappings into Application CRDs. The Airflow agent extended the graph to DAG → TaskInstance → Worker Pod → Node. This demonstrates a production-ready approach to evolving LLM system capabilities without requiring retraining or rebuilding existing components.
The production system employs multi-track investigation where the Main RCA Track and SME Agents Track run in parallel, enriching the investigation iteratively across multiple passes. In each iteration, the main track accumulates evidence while SME agents surface domain-specific findings (Secrets SME checking certificates, Storage/PVC SME examining mount state, GPU agent pulling DCGM metrics and kernel logs).
Critically, a Knowledge Base Query Agent runs alongside both tracks, implementing a RAG pattern by pulling relevant content from indexed customer documentation, runbooks, and postmortems stored in a vector database. The system also surfaces historical learnings from past investigations through vector database queries, enabling the system to apply patterns observed in previous incidents to new situations.
This RAG implementation is particularly noteworthy because it combines multiple knowledge sources: indexed documentation (external knowledge), historical incident data (system memory), and real-time metrics (live observability). The multi-source retrieval strategy represents a sophisticated approach to context augmentation in production LLM systems.
Komodor addresses a critical gap in production AI systems: the difference between understanding infrastructure in general versus understanding a specific organization’s infrastructure. The system integrates three context sources that work together:
The Blueprint is always loaded and contains architectural truth about the specific customer environment—service dependencies, topology, constraints, and compliance rules. The Knowledge Base is queried on-demand using semantic search to surface relevant content from customer Confluence pages, runbooks, and postmortems. The Self-Learning Memory accumulates automatically over time, capturing root causes and remediation patterns from every investigation.
This continuous learning mechanism is a key LLMOps capability. Past root causes, remediations, and environment patterns are captured and automatically indexed per customer, meaning the tribal knowledge that normally exists only in experienced engineers’ heads accumulates in the system over time. Each customer gets an increasingly tailored expert that understands their specific infrastructure, not just infrastructure in general.
The platform demonstrates impressive agent development velocity, though again these are vendor-reported metrics. The GPU agent went from zero to GA in four weeks, moving through research into NVIDIA failure modes and DCGM metrics, building GPU-specific tooling, prototyping in Klaudia Lab, shadow testing on production, and finally A/B validation and customer beta. The ArgoCD agent reached GA in two weeks. The Airflow agent shipped in four weeks and reportedly delivered 55% faster pipeline failure diagnosis against baseline.
The rapid development is attributed to the mature platform being ready before new agents are built. Workflow agents already know how to investigate, remediate, and learn. The relationship engine already understands entity connections. Each new domain becomes an extension rather than a rebuild from scratch.
From an LLMOps perspective, this demonstrates the value of investing in platform infrastructure. The Domain Agnostic Core—with its validation engine, guardrail engine, eval engine, and continuous learning components—provides the scaffolding that makes new agent development primarily about encoding domain expertise in a structured format rather than solving orchestration, evaluation, and safety problems repeatedly.
While details are limited, the text mentions several production LLMOps practices. New agents go through Klaudia Lab (presumably a testing environment), shadow testing on production workloads, and A/B validation before customer beta and GA release. This staged rollout approach is standard practice for production LLM systems but worth noting as evidence of operational maturity.
The Eval Engine mentioned in the Domain Agnostic Core suggests systematic evaluation infrastructure, though specifics about metrics, benchmarks, or evaluation methodologies aren’t provided. The text also mentions that all actions from Bring Your Own Agent implementations run in sandboxed environments with auditing, indicating security and observability considerations in the production architecture.
The Cisco Outshift deployment provides concrete evidence of production usage. Cisco built JARVIS, an AI Platform Engineer for automating developer workflows, which calls Klaudia as a subagent via A2A (agent-to-agent) protocol when developers encounter Kubernetes issues like CrashLoopBackOff. Klaudia investigates, returns root cause and remediation, and developers get answers without leaving their workflow.
Cisco reported up to 80% reduction in MTTR for Kubernetes issues, with query response time dropping from hours to seconds. While these are impressive claims, they come through Komodor’s marketing materials and lack independent verification. The architectural pattern of embedding Klaudia as a subagent in larger workflows is interesting and demonstrates the composability of the multi-agent system.
The formal Bring Your Own Agent capability represents an important production feature for enterprise LLM systems. Organizations can define custom agents for their own services, tools, and workflows using Model Context Protocol (MCP) or OpenAPI specifications. They specify trigger conditions, external systems the agent can query, the expertise it encodes, and output formats. A single Python file, validation in Klaudia Lab, and the agent joins investigations alongside Komodor’s native SMEs.
This extensibility pattern is significant for production LLMOps because it allows customers to augment the base system with proprietary knowledge and tooling without requiring access to Komodor’s core platform. The standardized interfaces (MCP, OpenAPI) and validation workflow (Klaudia Lab testing, sandboxed execution, auditing) demonstrate production-grade extension mechanisms.
Several aspects of this case study warrant balanced assessment:
The architectural sophistication is genuinely impressive—the separation of domain-agnostic infrastructure from workflow orchestrators and domain experts represents thoughtful system design. The knowledge graph for relationship-aware context retrieval addresses a real problem in infrastructure AI. The multi-track investigation with parallel SME agents and RAG-augmented context is a sensible approach to complex diagnostic workflows.
However, all performance claims come from vendor marketing materials. The 80% MTTR reduction at Cisco, 55% faster pipeline diagnosis with the Airflow agent, and 2-4 week agent development timelines are presented without methodology details, baseline definitions, or independent verification. Production LLM systems are notoriously difficult to evaluate, and without understanding what constitutes “success” in root cause analysis or what baseline comparison was used, these numbers should be viewed as directional rather than definitive.
The text doesn’t address several important LLMOps challenges. What happens when agents disagree or provide conflicting diagnoses? How does the system handle uncertainty quantification and communicate confidence levels? What are the failure modes when the knowledge graph has incomplete relationship data? How does the system prevent prompt injection or adversarial inputs in the Bring Your Own Agent framework? What are the cost implications of running 50+ specialized agents with parallel investigation tracks and vector database queries?
The emphasis on “context engineering” over model selection is both a strength and potential limitation. While it’s true that context matters enormously, the text doesn’t discuss which LLMs power the agents, how prompts are engineered for each domain, whether the system uses fine-tuned models for specific domains versus general-purpose models with RAG, or how the system handles model updates and versioning across 50+ agents.
The continuous learning mechanism is described at a high level but lacks implementation details. How is feedback captured from investigations? What prevents the system from learning incorrect patterns? How is the self-learning memory validated before being applied to new incidents? These are critical questions for production LLM systems where automated learning loops can compound errors.
Despite the marketing context and unverified claims, this case study represents a sophisticated approach to production LLM systems for infrastructure operations. The multi-agent architecture with specialized domain experts, workflow orchestrators, and shared infrastructure components demonstrates mature thinking about how to build reliable, extensible AI systems for complex technical domains.
The emphasis on context engineering, knowledge graphs for relationship-aware retrieval, multi-source RAG, and organizational-specific learning addresses real challenges in production LLM deployments. The extensibility mechanisms (Bring Your Own Agent, A2A protocol integration) suggest a platform designed for enterprise adoption rather than a closed system.
For LLMOps practitioners, the architectural patterns here—particularly the three-layer design separating domain-agnostic infrastructure from orchestration and expertise, the use of knowledge graphs for targeted context retrieval, and the multi-track investigation with parallel specialist agents—offer valuable reference points for building production LLM systems in other complex technical domains beyond infrastructure operations.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.