ZenML

Building an Agentic DevOps Copilot for Infrastructure Automation

Qovery 2025
View original source

Qovery developed an agentic DevOps copilot to automate infrastructure tasks and eliminate repetitive DevOps work. The solution evolved through four phases: from basic intent-to-tool mapping, to a dynamic agentic system that plans tool sequences, then adding resilience and recovery mechanisms, and finally incorporating conversation memory. The copilot now handles complex multi-step workflows like deployments, infrastructure optimization, and configuration management, currently using Claude Sonnet 3.7 with plans for self-hosted models and improved performance.

Industry

Tech

Technologies

Overview

Qovery, a DevOps automation platform that enables organizations to deploy and manage Kubernetes clusters on any cloud, embarked on building an AI-powered DevOps Copilot in February 2025. The goal was to create an assistant that could understand developer intent and autonomously take action on infrastructure, effectively eliminating the manual grunt work of DevOps. The resulting Agentic DevOps Copilot, currently in Alpha, helps developers automate deployments, optimize infrastructure, and answer advanced configuration questions. This case study provides an interesting look at how the team iterated through multiple architectural phases to achieve a production-ready agentic AI system.

It’s worth noting that this is a promotional piece from Qovery describing their own product development, so some claims should be viewed with that context in mind. The technical details of the evolutionary approach, however, provide valuable insights into building production LLM systems.

Technical Architecture Evolution

Phase 1: Basic Intent-to-Tool Mapping

The initial implementation used a simple agent architecture that detected user intent and mapped it directly to predefined tools or actions. For example, a request like “Stop all dev environments after 6pm” would be mapped to a stop-env tool through explicit mapping logic.

This approach had clear advantages in terms of implementation simplicity, predictable behavior, and explicit control over each action. However, significant limitations emerged quickly:

The team discovered that this rigid approach failed when real users asked questions like “How can I optimize my Dockerfile?” or “Why is my deployment time high this week?” — questions that didn’t fit neatly into predefined categories.

Phase 2: Agentic Architecture

The second phase represented a significant architectural shift to an agentic system. Instead of hardcoded intent-to-tool mapping, the DevOps AI Agent Copilot receives user input, analyzes it, and dynamically plans a sequence of tool invocations to fulfill the request. This approach treats the available tools as a toolbox, allowing the AI to determine how to use the tools in the appropriate order.

Each tool in this architecture was designed with specific characteristics:

This agentic approach proved far more scalable and flexible, capable of solving unanticipated user needs while encouraging clean tool abstraction. However, new challenges emerged around tool chaining fragility — outputs had to precisely match expected inputs, and if any tool in the chain failed or behaved unexpectedly, the entire plan would break.

Phase 3: Resilience and Recovery

Recognizing that dynamic systems require graceful failure handling, the third phase introduced resiliency layers and robust retry logic into the agentic execution flow. When the agent misuses a tool or receives unexpected output, the system now:

This implementation required tracking intermediate state throughout execution, running validation between tool steps, and enabling re-planning when execution fails. The team reports that without these resilience mechanisms, reliability would drop significantly. With them in place, they began seeing successful completions of multi-step workflows that weren’t even anticipated during development — a strong indicator of the system’s adaptability.

Phase 4: Conversation Memory

The final phase addressed a critical limitation: each request was treated in isolation, which doesn’t match how humans naturally interact. Follow-up questions like “What about the staging cluster?” should relate to the previous question about the production cluster, but the system couldn’t make these connections.

The introduction of conversation memory enabled the Agentic DevOps Copilot to:

The team reports this dramatically improved user experience and opened doors to deeper, multi-step optimization and monitoring tasks.

Technology Stack

The case study mentions several specific technologies used in the implementation:

Current Capabilities and Use Cases

The Copilot, currently in Alpha, can handle various DevOps tasks:

These capabilities demonstrate a blend of informational queries, analytical tasks, and automated actions — suggesting the system has both read and write access to infrastructure components, with appropriate guardrails.

Production Challenges and Future Roadmap

The team is transparent about current limitations and future development priorities:

Planning Latency: Complex tasks can take up to 10 seconds to plan, which the team acknowledges is acceptable for testing but not optimal for production use. This is a common challenge in agentic AI systems where multi-step reasoning adds significant latency.

Self-Hosted Models: While currently using Claude Sonnet 3.7, Qovery recognizes that as a business solution, they need to offer customers options that fit their compliance standards. They note that even though they don’t send sensitive information to the model, having self-hosted model options would address enterprise compliance requirements.

Long-Term Memory: The current conversation memory is session-scoped. Future plans include long-term memory across sessions to tailor the experience to each user and learn from previous interactions.

LLMOps Considerations

This case study illustrates several important LLMOps principles and challenges:

Iterative Architecture Development: The four-phase evolution shows how production LLM systems often require multiple iterations to achieve the right balance of flexibility, reliability, and user experience. Starting simple and adding complexity incrementally allowed the team to understand failure modes before adding more sophisticated capabilities.

Tool Design Philosophy: The emphasis on clear interfaces, versioning, statelessness, and independent testability reflects best practices for building robust tool ecosystems that LLMs can interact with reliably.

Error Handling and Recovery: The explicit focus on resilience layers, retry logic, and re-planning capabilities addresses one of the most challenging aspects of agentic AI systems — handling the inevitable failures that occur when LLMs make imperfect decisions or external tools behave unexpectedly.

State Management: Tracking intermediate state between tool invocations is crucial for enabling recovery from failures and maintaining execution context.

Model Selection and Compliance: The acknowledgment that enterprise customers may have specific compliance requirements around model hosting highlights a common tension in LLMOps between using the most capable cloud-hosted models and meeting enterprise security requirements.

Latency Considerations: The 10-second planning time for complex tasks illustrates the latency challenges inherent in multi-step agentic reasoning, which must be addressed for production readiness.

Critical Assessment

While the case study provides useful technical insights, readers should note several limitations:

Despite these caveats, the evolutionary approach and the specific architectural decisions documented here offer valuable lessons for teams building their own agentic AI systems for infrastructure automation or similar domains.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41