Qovery: Building an Agentic DevOps Copilot for Infrastructure Automation

LLMOps Database

Tech

Qovery

Company

Qovery

Title

Building an Agentic DevOps Copilot for Infrastructure Automation

Industry

Tech

Link

https://www.qovery.com/blog/how-we-built-an-agentic-devops-copilot-to-automate-infrastructure-tasks-and-beyond/

Year

2025

Summary (short)

Qovery developed an agentic DevOps copilot to automate infrastructure tasks and eliminate repetitive DevOps work. The solution evolved through four phases: from basic intent-to-tool mapping, to a dynamic agentic system that plans tool sequences, then adding resilience and recovery mechanisms, and finally incorporating conversation memory. The copilot now handles complex multi-step workflows like deployments, infrastructure optimization, and configuration management, currently using Claude Sonnet 3.7 with plans for self-hosted models and improved performance.

## Overview Qovery, a DevOps automation platform that enables organizations to deploy and manage Kubernetes clusters on any cloud, embarked on building an AI-powered DevOps Copilot in February 2025. The goal was to create an assistant that could understand developer intent and autonomously take action on infrastructure, effectively eliminating the manual grunt work of DevOps. The resulting Agentic DevOps Copilot, currently in Alpha, helps developers automate deployments, optimize infrastructure, and answer advanced configuration questions. This case study provides an interesting look at how the team iterated through multiple architectural phases to achieve a production-ready agentic AI system. It's worth noting that this is a promotional piece from Qovery describing their own product development, so some claims should be viewed with that context in mind. The technical details of the evolutionary approach, however, provide valuable insights into building production LLM systems. ## Technical Architecture Evolution ### Phase 1: Basic Intent-to-Tool Mapping The initial implementation used a simple agent architecture that detected user intent and mapped it directly to predefined tools or actions. For example, a request like "Stop all dev environments after 6pm" would be mapped to a stop-env tool through explicit mapping logic. This approach had clear advantages in terms of implementation simplicity, predictable behavior, and explicit control over each action. However, significant limitations emerged quickly: - Every new intent required hardcoding new mapping logic - Complex workflows necessitated manual tool chaining - The system lacked flexibility for unexpected or unplanned user requests The team discovered that this rigid approach failed when real users asked questions like "How can I optimize my Dockerfile?" or "Why is my deployment time high this week?" — questions that didn't fit neatly into predefined categories. ### Phase 2: Agentic Architecture The second phase represented a significant architectural shift to an agentic system. Instead of hardcoded intent-to-tool mapping, the DevOps AI Agent Copilot receives user input, analyzes it, and dynamically plans a sequence of tool invocations to fulfill the request. This approach treats the available tools as a toolbox, allowing the AI to determine how to use the tools in the appropriate order. Each tool in this architecture was designed with specific characteristics: - Clear interface definitions with specified inputs and outputs - Versioning for tracking changes - Stateless execution for predictability - Independent testability for quality assurance This agentic approach proved far more scalable and flexible, capable of solving unanticipated user needs while encouraging clean tool abstraction. However, new challenges emerged around tool chaining fragility — outputs had to precisely match expected inputs, and if any tool in the chain failed or behaved unexpectedly, the entire plan would break. ### Phase 3: Resilience and Recovery Recognizing that dynamic systems require graceful failure handling, the third phase introduced resiliency layers and robust retry logic into the agentic execution flow. When the agent misuses a tool or receives unexpected output, the system now: - Analyzes the failure to understand what went wrong - Updates its plan or fixes the problematic step - Retries with a corrected approach This implementation required tracking intermediate state throughout execution, running validation between tool steps, and enabling re-planning when execution fails. The team reports that without these resilience mechanisms, reliability would drop significantly. With them in place, they began seeing successful completions of multi-step workflows that weren't even anticipated during development — a strong indicator of the system's adaptability. ### Phase 4: Conversation Memory The final phase addressed a critical limitation: each request was treated in isolation, which doesn't match how humans naturally interact. Follow-up questions like "What about the staging cluster?" should relate to the previous question about the production cluster, but the system couldn't make these connections. The introduction of conversation memory enabled the Agentic DevOps Copilot to: - Reuse previous answers for context - Understand references and context from earlier in the conversation - Maintain continuity across a session The team reports this dramatically improved user experience and opened doors to deeper, multi-step optimization and monitoring tasks. ## Technology Stack The case study mentions several specific technologies used in the implementation: - **Claude Sonnet 3.7**: The LLM powering the agent's reasoning and planning capabilities. The choice of Claude suggests they needed strong reasoning capabilities for tool selection and planning. - **QDrant**: An open-source vector database, mentioned as being used in the system. While not explicitly detailed, this likely supports semantic search capabilities for understanding user queries or retrieving relevant documentation and context. ## Current Capabilities and Use Cases The Copilot, currently in Alpha, can handle various DevOps tasks: - Computing deployment success rates - Generating CISO reports for specified time periods - Preventing modifications to resources (read-only mode enforcement) - Generating usage statistics over time periods for specific teams - Optimizing Dockerfiles - Stopping inactive environments and notifying teams These capabilities demonstrate a blend of informational queries, analytical tasks, and automated actions — suggesting the system has both read and write access to infrastructure components, with appropriate guardrails. ## Production Challenges and Future Roadmap The team is transparent about current limitations and future development priorities: **Planning Latency**: Complex tasks can take up to 10 seconds to plan, which the team acknowledges is acceptable for testing but not optimal for production use. This is a common challenge in agentic AI systems where multi-step reasoning adds significant latency. **Self-Hosted Models**: While currently using Claude Sonnet 3.7, Qovery recognizes that as a business solution, they need to offer customers options that fit their compliance standards. They note that even though they don't send sensitive information to the model, having self-hosted model options would address enterprise compliance requirements. **Long-Term Memory**: The current conversation memory is session-scoped. Future plans include long-term memory across sessions to tailor the experience to each user and learn from previous interactions. ## LLMOps Considerations This case study illustrates several important LLMOps principles and challenges: **Iterative Architecture Development**: The four-phase evolution shows how production LLM systems often require multiple iterations to achieve the right balance of flexibility, reliability, and user experience. Starting simple and adding complexity incrementally allowed the team to understand failure modes before adding more sophisticated capabilities. **Tool Design Philosophy**: The emphasis on clear interfaces, versioning, statelessness, and independent testability reflects best practices for building robust tool ecosystems that LLMs can interact with reliably. **Error Handling and Recovery**: The explicit focus on resilience layers, retry logic, and re-planning capabilities addresses one of the most challenging aspects of agentic AI systems — handling the inevitable failures that occur when LLMs make imperfect decisions or external tools behave unexpectedly. **State Management**: Tracking intermediate state between tool invocations is crucial for enabling recovery from failures and maintaining execution context. **Model Selection and Compliance**: The acknowledgment that enterprise customers may have specific compliance requirements around model hosting highlights a common tension in LLMOps between using the most capable cloud-hosted models and meeting enterprise security requirements. **Latency Considerations**: The 10-second planning time for complex tasks illustrates the latency challenges inherent in multi-step agentic reasoning, which must be addressed for production readiness. ## Critical Assessment While the case study provides useful technical insights, readers should note several limitations: - The product is in Alpha, meaning claims about capabilities and reliability haven't been validated at scale - No quantitative metrics are provided about success rates, error rates, or user satisfaction - The promotional nature of the content means potential downsides may be understated - The integration with QDrant is mentioned but not detailed, making it unclear exactly how vector search fits into the architecture Despite these caveats, the evolutionary approach and the specific architectural decisions documented here offer valuable lessons for teams building their own agentic AI systems for infrastructure automation or similar domains.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source