ZenML

LLM Validation and Testing at Scale: GitLab's Comprehensive Model Evaluation Framework

Gitlab 2024
View original source

GitLab developed a robust framework for validating and testing LLMs at scale for their GitLab Duo AI features. They created a Centralized Evaluation Framework (CEF) that uses thousands of prompts across multiple use cases to assess model performance. The process involves creating a comprehensive prompt library, establishing baseline model performance, iterative feature development, and continuous validation using metrics like Cosine Similarity Score and LLM Judge, ensuring consistent improvement while maintaining quality across all use cases.

Industry

Tech

Technologies

Overview

Gitlab provides an inside look at how they validate and test AI models at scale for their GitLab Duo AI features, which are integrated throughout their DevSecOps platform. GitLab Duo includes capabilities like intelligent code suggestions, conversational chatbots, code explanations, and vulnerability analysis—all powered by large language models (LLMs). The company uses a multi-model strategy, currently leveraging foundation models from Google and Anthropic, deliberately avoiding lock-in to a single provider.

This case study is valuable for understanding enterprise-grade LLMOps practices because it details the challenges of deploying LLMs in production where outputs are nuanced, diverse, and context-dependent. Unlike traditional software testing where inputs and outputs can be precisely defined, LLM testing requires comprehensive strategies that account for subjective interpretations of quality and the stochastic (probabilistic) nature of model outputs.

The Centralized Evaluation Framework (CEF)

At the core of GitLab’s LLMOps approach is their Centralized Evaluation Framework (CEF), which utilizes thousands of prompts tied to dozens of use cases. This framework is designed to identify significant patterns and assess the overall behavior of both foundational LLMs and the GitLab Duo features in which they are integrated.

The framework serves three primary purposes:

The Testing at Scale Process

Building a Representative Prompt Library

A notable aspect of GitLab’s approach is their commitment to privacy—they explicitly state they do not use customer data to train their AI features. This constraint required them to develop a comprehensive prompt library that serves as a proxy for both the scale and activity of production environments.

The prompt library consists of question/answer pairs where questions represent expected production queries and answers represent “ground truth” or target responses. These pairs can be human-generated or synthetically created. The key design principle is that the library must be representative of inputs expected in production, specific to GitLab features and use cases rather than relying on generic benchmark datasets that may not reflect their specific requirements.

Baseline Model Performance Measurement

Once the prompt library is established, GitLab feeds questions into various models to test how well they serve customer needs. Each response is compared to ground truth and ranked using multiple metrics:

This baseline measurement guides the selection of foundational models for specific features. GitLab acknowledges that LLM evaluation is not a solved problem and that the wider AI industry is actively researching new techniques. Their model validation team continuously iterates on measurement and scoring approaches.

Feature Development with Confidence

With established baselines, GitLab can develop features knowing how changes affect model behavior. The article makes an important point about prompt engineering: focusing entirely on changing model behavior via prompting without validation means “operating in the dark and very possibly overfitting your prompting.” A change might solve one problem while causing a dozen others—without testing at scale, these regressions would go undetected.

During active development, GitLab re-validates feature performance on a daily basis. This continuous validation helps ensure that all changes improve overall functionality rather than causing unexpected degradation.

Iterative Improvement Cycle

The iteration process involves examining scores from scale tests to identify patterns. They look for commonalities across weak areas, specific metrics or use cases where performance lags, and consistent errors in response to certain question types. Only through testing at scale do these patterns emerge to focus experimentation.

Since testing at scale is both expensive and time-consuming, GitLab uses a tiered approach. They craft smaller-scale datasets as “mini-proxies” containing:

Changes are first validated against the focused subset, then the broader subset, and only when both show improvement (or at least no degradation) is the change pushed to production. The entire CEF is then run against the new prompt to validate that it has increased performance against the previous day’s baseline.

Multi-Model Strategy

GitLab explicitly states they are “not tied to a single model provider by design.” They currently use foundation models from Google and Anthropic but continuously assess which models are the right matches for specific GitLab Duo use cases. This approach provides flexibility and allows them to:

Different LLMs can be optimized for different characteristics, which explains why there are so many AI models actively being developed. GitLab’s evaluation framework allows them to systematically compare models for specific tasks rather than relying on generic benchmarks.

Transparency and Ethics Considerations

The article emphasizes GitLab’s commitment to transparency, referencing their AI Transparency Center and AI Ethics Principles for Product Development. They explicitly state that they do not view or use customer data to train AI features—a significant differentiator from some competitors.

Critical Assessment

While this case study provides valuable insights into production LLMOps practices, a few caveats are worth noting:

Despite these limitations, the case study offers a realistic and practical view of enterprise LLMOps, emphasizing the importance of systematic evaluation, baseline measurement, and iterative improvement rather than ad-hoc prompt engineering. The framework described represents a mature approach to deploying LLMs in production where reliability and quality assurance are paramount.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino 2025

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

healthcare fraud_detection customer_support +50

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41