GitLab shares their experience of integrating and testing their AI-powered features suite, GitLab Duo, within their own development workflows. The case study demonstrates how different teams within GitLab leverage AI capabilities for various tasks including code review, documentation, incident response, and feature testing. The implementation has resulted in significant efficiency gains, reduced manual effort, and improved quality across their development processes.
This case study from GitLab documents how the company internally uses its own AI-powered feature suite, GitLab Duo, across various engineering and product teams. The practice of “dogfooding”—using one’s own products—is a common approach in tech companies, and GitLab applies this to test and demonstrate the value of their AI capabilities before and alongside customer adoption. The case study is part of a broader blog series aimed at showcasing how GitLab creates, tests, and deploys AI features integrated throughout the enterprise DevSecOps platform.
It is important to note that this case study is inherently promotional, coming directly from GitLab’s marketing and product teams. While it provides useful insights into how AI tools can be integrated into developer workflows, readers should approach the claimed benefits with appropriate skepticism, as specific quantitative metrics are largely absent from the discussion.
GitLab Duo encompasses multiple AI-powered capabilities designed to assist developers and other team members throughout the software development lifecycle. The key features highlighted in this case study include:
The case study describes how Staff Backend Developer Gosia Ksionek uses GitLab Duo to streamline code review processes. The AI summarizes merge requests, making it faster to review code changes, and answers coding questions while explaining complex code snippets. This represents a common LLMOps pattern where AI is integrated directly into developer tooling to reduce cognitive load during code review.
Senior Frontend Engineer Peter Hegman reportedly uses Code Suggestions for full-stack JavaScript and Ruby development, demonstrating the tool’s ability to work across different programming languages and frameworks. This multi-language support is important for production AI tools in heterogeneous development environments.
Several use cases focus on using LLMs for documentation and content generation tasks:
Taylor McCaslin, Group Manager for the Data Science Section, used GitLab Duo to create documentation for GitLab Duo itself—a meta use case that the company highlights as demonstrating the tool’s utility. Staff Technical Writer Suzanne Selhorn used the AI to optimize documentation site navigation by providing a workflow-based ordering of pages and drafting Getting Started documentation more quickly than manual approaches.
Senior Product Manager Amanda Rueda uses GitLab Duo to craft release notes, employing specific prompts like requesting “a two sentence summary of this change, which can be used for our release notes” with guidance on tone, perspective, and value proposition. This prompt engineering approach is a practical example of how production AI tools can be customized for specific content generation tasks through carefully crafted prompts.
The case study highlights non-coding applications of the AI tools. Engineering Manager François Rosé uses Duo Chat for drafting and refining OKRs (Objectives and Key Results), providing example prompts that request feedback on objective and key result formulations. Staff Frontend Engineer Denys Mishunov used Chat to formulate text for email templates used in technical interview candidate communications.
These use cases demonstrate that LLM-powered tools in production environments often extend beyond purely technical tasks into administrative and communication workflows.
Staff Site Reliability Engineer Steve Xuereb employs GitLab Duo to summarize production incidents and create detailed incident reviews. He also uses Chat to create boilerplate .gitlab-ci.yml files, which reportedly speeds up workflow significantly. The Code Explanation feature provides detailed answers during incidents, enhancing productivity and understanding of the codebase during time-critical situations.
This incident response use case is particularly relevant to LLMOps, as it demonstrates AI assistance in operational contexts where speed and accuracy are critical.
Senior Developer Advocate Michael Friedrich uses GitLab Duo to generate test source code for CI/CD components, sharing this approach in talks and presentations. The case study mentions that engineers test new features like Markdown support in Code Suggestions internally before release, using GitLab Duo for writing blog posts and documentation in VS Code.
The /explain feature is highlighted as particularly useful for understanding external projects imported into GitLab. This capability was demonstrated during a livestream with open source expert Eddie Jaoude, showcasing how AI can help developers quickly understand unfamiliar codebases, dependencies, and open source projects.
GitLab claims several benefits from integrating GitLab Duo:
However, these claims warrant scrutiny. The case study provides anecdotal evidence and user testimonials but lacks specific quantitative metrics such as percentage improvements in cycle time, reduction in bugs, or time savings measurements. The mention of an “AI Impact analytics dashboard” suggests GitLab is developing metrics capabilities, but concrete data from this dashboard is not provided in this case study.
The self-referential nature of the case study—a company promoting its own products using internal testimonials—means that the evidence should be considered accordingly. Real-world enterprise adoption and independent benchmarks would provide more reliable validation of the claimed benefits.
While the case study does not delve deeply into technical architecture, several LLMOps-relevant aspects can be inferred:
The mention of validating and testing AI models at scale in related blog posts suggests GitLab has developed internal infrastructure for model evaluation, though details are not provided in this specific case study.
This case study provides a useful window into how a major DevOps platform company integrates AI capabilities throughout their internal workflows. The breadth of use cases—from code generation to documentation to incident response—demonstrates the versatility of LLM-powered tools in production software development environments. However, the promotional nature of the content and absence of quantitative metrics mean the claimed benefits should be viewed as indicative rather than definitive. The case study is most valuable as a catalog of potential AI integration points in software development workflows rather than as proof of specific productivity improvements.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.
This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.