Cleric AI: AI-Powered SRE Agent for Production Infrastructure Management

Overview

Cleric AI is building an AI-powered SRE (Site Reliability Engineering) agent that aims to automate the diagnosis and eventual remediation of production infrastructure issues. The company’s CTO, Willam, presented an in-depth look at their architecture, learnings, and challenges during a technical talk. The core premise is that modern infrastructure has become too complex for human engineers to effectively monitor and debug at scale, and that AI agents represent a compelling solution to this problem.

The company is currently live with enterprise customers, operating in a VPC deployment model where the agent integrates with existing observability tools rather than introducing new data collection mechanisms. This is an important architectural decision that speaks to their approach of augmenting existing tooling rather than replacing it.

The Problem Space

The presentation begins by framing the infrastructure monitoring problem in stark terms. Modern production environments, particularly Kubernetes clusters, have become extraordinarily complex. Even a “small” Kubernetes cluster, when mapped as a graph of relationships between pods, deployments, and other resources, reveals an overwhelming number of connections and dependencies.

The speaker argues that traditional approaches to this problem have fundamental limitations:

Hiring more engineers doesn’t solve the underlying complexity because more engineers also produce more systems
Scripts and runbooks cannot anticipate every unique situation requiring human judgment
Adding more observability tools like Datadog only creates more dashboards and information without providing actionable insights
The rise of AI-assisted code generation will accelerate the problem, as more code ships to production faster and with potentially less understanding

This framing sets up the case for an AI agent that can operate at machine scale while applying human-like reasoning to infrastructure problems.

Architecture

Cleric’s architecture follows a pattern common to agentic AI systems but is tailored specifically for infrastructure operations. The key components include:

Reasoning Engine: This serves as the “brain” of the system, implementing planning, execution, and reflection loops. The agent can generate hypotheses about potential root causes, branch out to investigate multiple paths concurrently, and iteratively refine its understanding based on information gathered from infrastructure.

Tool Access: The agent is deployed within the customer’s VPC and has access to existing observability tools, knowledge bases, logs, and metrics. Importantly, Cleric doesn’t introduce new tools into the environment but operates like an engineer who can query existing systems. This is a critical architectural decision for enterprise adoption, as it minimizes security concerns and integration overhead.

Knowledge Graph: The system maintains a continuously updated graph of relationships within the organization—teams, clusters, VMs, deployment relationships, and service dependencies. This graph is used during investigations to efficiently navigate the infrastructure landscape and reduce the search space for root cause analysis.

Memory and Learning Module: The agent retains memories of past investigations and resolutions. When similar issues occur, it can retrieve relevant past experiences to inform its current investigation, leading to faster resolution times and improved accuracy over time.

Investigation Flow

When an alert fires (e.g., a login failure), Cleric kicks into action with a structured investigation process:

The agent first generates multiple hypotheses about potential root causes. For a login failure, this might include auth service issues, database connection problems, or deployment issues. It then branches out to investigate these paths concurrently—something impossible for a human engineer to do effectively.

Each investigative branch involves calling tools, gathering information, and reasoning about the results. The agent uses pre-existing context from the knowledge graph to make more informed decisions about which paths to pursue. This iterative process continues until the agent reaches a conclusion with sufficient confidence.

The final diagnosis is presented to the human engineer in a concise format, typically delivered in Slack where engineers are already working. The key insight here is that the human only sees “the tip of the iceberg”—the agent may have touched hundreds or thousands of systems in its investigation but surfaces only the essential findings.

Key Learnings: Trust But Verify

One of the most valuable sections of the presentation covers lessons learned in deploying an AI agent as a “teammate” in engineering organizations. These insights are relevant to anyone building production AI systems that interact with human experts.

Confidence-Based Communication: Cleric implements self-assessment of confidence, and crucially, if the system isn’t confident in an answer, it doesn’t waste the engineer’s time by presenting uncertain findings. This is a departure from many AI systems that always provide an answer regardless of certainty. The rationale is that an AI teammate must provide net value, and overwhelming engineers with low-confidence information undermines trust.

Concise Output with Progressive Disclosure: After extensive experimentation with different information presentation formats (raw findings, dashboard links, detailed metrics), the team found that being very concise and allowing users to drill down was most effective. Engineers can open links to full investigation details if they want to verify the agent’s reasoning and tool calls, but the primary interface is minimal.

Feedback Loop Design: Every interaction includes follow-up actions like “rerun with feedback” or “proposed solution” buttons. These serve as positive and negative signals that provide ground truth for training the agent. The team emphasizes that if humans don’t interact with the agent at all (because they’re overwhelmed or ignoring it), there’s no opportunity for improvement.

Key Learnings: Engineers as Teachers

A significant insight is that engineers want to teach and guide the AI agent just as they would onboard a new human team member. Cleric accommodates this through several mechanisms:

Natural Interaction: Engineers can ask follow-up questions in Slack, and the system learns from this dialogue. It extracts services, facts, and context from these conversations to improve future performance.

Control Services: Teams can provide specific instructions for how Cleric should operate on their services, clusters, or in specific conditions. This gives engineers ownership and control, making the agent feel like a team member rather than an external black box.

Generalizing Learnings: When the team observes engineers building the same custom tools or following similar patterns across deployments and customers, they upstream those patterns into the core product. This creates a virtuous cycle where individual customizations become universal improvements.

Technical Challenges

The presentation identifies several “gnarly challenges” the team is actively working on:

Tool Building: The production environment is overflowing with APIs, CLIs, dashboards, and various data sources. While LLMs excel at processing logs and finding patterns in text, other data types present difficulties. Metrics are particularly challenging—engineers often debug issues by visually correlating graphs across hundreds of services on Grafana dashboards, building causal relationships in their heads. Replicating this process efficiently with an AI agent is non-trivial.

The team mentions drawing inspiration from the ACI (Agent-Computer Interface) layer pattern seen in systems like SWE-agent and OpenHands, which provides agents with a more uniform view of infrastructure.

Confidence from Experience: This is described as a “Holy Grail” problem. Simply using an LLM to assess its own confidence is unreliable because it’s not grounded in actual experience. Instead, Cleric classifies and tags incoming alerts along multiple dimensions, then retrieves memories of successfully resolving similar issues in the past. If the system has solved a similar problem many times before, it has grounds for higher confidence. The challenge is determining which dimensions are relevant for similarity matching.

Generalizing Learnings: Moving from localized learnings (within a customer or team) to universal patterns that benefit all customers is an ongoing challenge. The goal is for zero-day vulnerabilities solved in one place to benefit everyone else, but implementing this while respecting customer data boundaries requires careful design.

Roadmap to Autonomy

The ultimate goal is “full autonomy” or closed-loop resolution, where the agent can independently resolve issues without human intervention. The team frames this as a progression:

Current state: Accurate diagnosis, with human approval required for any remediation actions
Near-term: Closed-loop remediation for specific, well-understood classes of problems where the system has high confidence
Long-term vision: Moving beyond reactive alerting toward preventative actions—identifying and addressing potential failures before they manifest

The team acknowledges that Black Swan events and unanticipated issues will always require human judgment, but the goal is to reliably handle known classes of problems automatically.

Evaluation and Performance Measurement

When asked about performance measurement, the speaker notes this is a significant challenge for agent builders. Different aspects require different metrics: planning accuracy, finding accuracy, diagnosis accuracy, and resolution accuracy are all distinct KPIs. The team maintains an extensive offline evaluation bench for hyperparameter tuning and model evaluation, though they haven’t implemented online hyperparameter optimization.

Key Considerations

It’s worth noting that while the presentation is compelling, some claims should be viewed with appropriate skepticism as this is ultimately a product presentation. The actual accuracy rates, the degree of enterprise adoption, and the specific performance improvements over human-only workflows are not quantified in the transcript. The comparison to systems like Datadog positions Cleric as complementary rather than competitive, which may be strategically motivated.

The emphasis on “product and change management challenges” being more difficult than the LLM/AI technical challenges is a refreshingly honest perspective that aligns with observations from many production AI deployments. The sociotechnical aspects of introducing an AI agent into an established engineering workflow may indeed be the harder problem.

AI-Powered SRE Agent for Production Infrastructure Management

Industry

Technologies