Gitlab faced challenges with delivering prompt improvements for their AI-powered issue description generation feature, particularly for self-managed customers who don't update frequently. They developed an Agent Registry system within their AI Gateway that abstracts provider models, prompts, and parameters, allowing for rapid prompt updates and model switching without requiring monolith changes or new releases. This system enables faster iteration on AI features and seamless provider switching while maintaining a clean separation of concerns.
This case study documents GitLab’s architectural evolution in managing AI-powered features, specifically focusing on their “Generate Issue Description” feature. The presentation, given by a GitLab engineer, walks through a proof-of-concept for an Agent Registry system designed to solve fundamental LLMOps challenges around prompt management, provider flexibility, and deployment velocity.
The core problem GitLab faced is a common one in enterprise LLM deployments: how do you iterate on prompts and AI behavior rapidly when your AI logic is tightly coupled to your main application’s release cycle? For GitLab, this is especially acute because while gitlab.com can deploy frequently, self-managed customers—enterprises running GitLab on their own infrastructure—only update to stable releases periodically. This creates a significant lag between prompt improvements and their delivery to a substantial portion of the user base.
In the original implementation, GitLab’s AI features like “Generate Issue Description” were structured with tightly coupled components within the Ruby monolith. The architecture consisted of service classes with specific references to LLM providers and parameters, and prompt classes where prompt templates were literally hardcoded into Ruby code.
This tight coupling created several operational challenges:
The solution GitLab developed involves moving AI logic out of the Ruby monolith and into a dedicated AI Gateway, with an Agent Registry that abstracts away the implementation details. The key architectural components include:
Rather than having AI logic scattered throughout the monolith, GitLab is creating dedicated endpoints in the AI Gateway for each AI operation. For the Generate Description feature, there’s now a specific endpoint that handles the entire AI interaction. The API logic becomes simple because most complexity is abstracted into agents.
The Agent Registry acts as a central coordination layer. When a request comes in, the API layer simply tells the registry to “fetch a specific agent for a specific use case.” The registry knows about all available agents and their configurations. This creates a clean separation of concerns:
For the initial proof of concept, agent configurations are stored in YAML files. Each agent definition includes:
This YAML-based approach was explicitly described as a starting point. The presenter mentioned that this will eventually be replaced with a more dynamic system, potentially using GitLab itself as a “prompt lifecycle manager” to provide a dynamic backend for retrieving agent configurations.
One of the most powerful aspects of the architecture is provider abstraction. The demo showed switching from Claude to Google’s Vertex AI (using Gemini/ChatBison) by simply:
Importantly, all input/output processing remains the same—the agent logic handles provider-specific nuances internally.
The presenter walked through a live demonstration that illustrated several key operational capabilities:
The demo showed adding a new requirement to always end issue descriptions with a “/assign me” slash command. This was accomplished by simply modifying the YAML configuration and restarting the Gateway—no changes to the Ruby monolith required. The presenter emphasized that this restart would be equivalent to releasing a new version of the AI Gateway through their Runway deployment system.
The demonstration also showed creating a new agent that uses Vertex AI instead of Claude. The switch was transparent to the user experience, though the presenter noted that “ChatBison seems to be more succinct” in its responses—an interesting observation about behavioral differences between providers that this architecture makes easy to experiment with.
The presenter mentioned that prompts are cached in the Gateway, which is why a restart was needed to pick up changes. This is a practical production consideration—caching improves performance but requires cache invalidation strategies for updates.
The presentation touched on upcoming work that would replace the static YAML configuration with a dynamic system. Interestingly, this involves using GitLab’s monolith as a “prompt lifecycle manager”—but in a fundamentally different way than before.
Rather than hardcoding prompts in Ruby classes, the monolith would provide a dynamic interface for prompt configuration that the AI Gateway can query. This creates a more sophisticated architecture where:
The presenter acknowledged this might seem contradictory (“taking the prompts out of the monolith and putting them back”) but clarified the crucial difference: the new interface is dynamic rather than hardcoded.
This architecture addresses several production LLMOps concerns:
By decoupling AI logic from the main application release cycle, teams can iterate on prompts and AI behavior independently. This is crucial for LLM-powered features where prompt engineering is an ongoing process.
The abstraction layer makes it straightforward to support multiple LLM providers, enabling:
The presenter explicitly mentioned that this architecture would enable support for custom models, which would require “specific agents using a specific provider and specific templates.”
Perhaps most importantly for GitLab’s business model, this architecture allows self-managed customers to receive AI improvements at the same pace as gitlab.com users, since the AI Gateway can be updated independently.
While the presentation was optimistic about this architecture, there are some considerations worth noting:
Overall, this case study represents a thoughtful approach to a common LLMOps challenge: how to maintain agility in prompt engineering and provider selection while operating at enterprise scale with diverse deployment models.
This case study explores how Airia developed an orchestration platform to help organizations deploy AI agents in production environments. The problem addressed is the significant complexity and security challenges that prevent businesses from moving beyond prototype AI agents to production-ready systems. The solution involves a comprehensive platform that provides agent building capabilities, security guardrails, evaluation frameworks, red teaming, and authentication controls. Results include successful deployments across multiple industries including hospitality (customer profiling across hotel chains), HR, legal (contract analysis), marketing (personalized content generation), and operations (real-time incident response through automated data aggregation), with customers reporting significant efficiency gains while maintaining enterprise security standards.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.