Gitlab: LLM Validation and Testing at Scale: GitLab's Comprehensive Model Evaluation Framework

LLMOps Database

Tech

Gitlab

Company

Gitlab

Title

LLM Validation and Testing at Scale: GitLab's Comprehensive Model Evaluation Framework

Industry

Tech

Link

https://about.gitlab.com/blog/2024/05/09/developing-gitlab-duo-how-we-validate-and-test-ai-models-at-scale/

Year

2024

Summary (short)

GitLab developed a robust framework for validating and testing LLMs at scale for their GitLab Duo AI features. They created a Centralized Evaluation Framework (CEF) that uses thousands of prompts across multiple use cases to assess model performance. The process involves creating a comprehensive prompt library, establishing baseline model performance, iterative feature development, and continuous validation using metrics like Cosine Similarity Score and LLM Judge, ensuring consistent improvement while maintaining quality across all use cases.

## Overview Gitlab provides an inside look at how they validate and test AI models at scale for their GitLab Duo AI features, which are integrated throughout their DevSecOps platform. GitLab Duo includes capabilities like intelligent code suggestions, conversational chatbots, code explanations, and vulnerability analysis—all powered by large language models (LLMs). The company uses a multi-model strategy, currently leveraging foundation models from Google and Anthropic, deliberately avoiding lock-in to a single provider. This case study is valuable for understanding enterprise-grade LLMOps practices because it details the challenges of deploying LLMs in production where outputs are nuanced, diverse, and context-dependent. Unlike traditional software testing where inputs and outputs can be precisely defined, LLM testing requires comprehensive strategies that account for subjective interpretations of quality and the stochastic (probabilistic) nature of model outputs. ## The Centralized Evaluation Framework (CEF) At the core of GitLab's LLMOps approach is their Centralized Evaluation Framework (CEF), which utilizes thousands of prompts tied to dozens of use cases. This framework is designed to identify significant patterns and assess the overall behavior of both foundational LLMs and the GitLab Duo features in which they are integrated. The framework serves three primary purposes: - **Quality Assurance:** Assessing quality and reliability across wide-ranging scenarios and inputs, identifying patterns while mitigating potential issues such as systematic biases, anomalies, and inaccuracies. - **Performance Optimization:** Evaluating performance and efficiency under real-world conditions, including output quality, latency, and cost considerations for deployment and operation. - **Risk Mitigation:** Identifying and addressing potential failure modes, security vulnerabilities, and ethical concerns before they impact customers in critical applications. ## The Testing at Scale Process ### Building a Representative Prompt Library A notable aspect of GitLab's approach is their commitment to privacy—they explicitly state they do not use customer data to train their AI features. This constraint required them to develop a comprehensive prompt library that serves as a proxy for both the scale and activity of production environments. The prompt library consists of question/answer pairs where questions represent expected production queries and answers represent "ground truth" or target responses. These pairs can be human-generated or synthetically created. The key design principle is that the library must be representative of inputs expected in production, specific to GitLab features and use cases rather than relying on generic benchmark datasets that may not reflect their specific requirements. ### Baseline Model Performance Measurement Once the prompt library is established, GitLab feeds questions into various models to test how well they serve customer needs. Each response is compared to ground truth and ranked using multiple metrics: - **Cosine Similarity Score:** Measuring vector similarity between generated and target responses - **Cross Similarity Score:** Additional similarity measurement for validation - **LLM Judge:** Using an LLM to evaluate the quality of responses - **Consensus Filtering with LLM Judge:** Combining multiple evaluation signals for more robust scoring This baseline measurement guides the selection of foundational models for specific features. GitLab acknowledges that LLM evaluation is not a solved problem and that the wider AI industry is actively researching new techniques. Their model validation team continuously iterates on measurement and scoring approaches. ### Feature Development with Confidence With established baselines, GitLab can develop features knowing how changes affect model behavior. The article makes an important point about prompt engineering: focusing entirely on changing model behavior via prompting without validation means "operating in the dark and very possibly overfitting your prompting." A change might solve one problem while causing a dozen others—without testing at scale, these regressions would go undetected. During active development, GitLab re-validates feature performance on a daily basis. This continuous validation helps ensure that all changes improve overall functionality rather than causing unexpected degradation. ### Iterative Improvement Cycle The iteration process involves examining scores from scale tests to identify patterns. They look for commonalities across weak areas, specific metrics or use cases where performance lags, and consistent errors in response to certain question types. Only through testing at scale do these patterns emerge to focus experimentation. Since testing at scale is both expensive and time-consuming, GitLab uses a tiered approach. They craft smaller-scale datasets as "mini-proxies" containing: - A focused subset weighted toward question/answer pairs needing improvement - A broader subset sampling other use cases and scores to ensure changes don't adversely affect the feature broadly Changes are first validated against the focused subset, then the broader subset, and only when both show improvement (or at least no degradation) is the change pushed to production. The entire CEF is then run against the new prompt to validate that it has increased performance against the previous day's baseline. ## Multi-Model Strategy GitLab explicitly states they are "not tied to a single model provider by design." They currently use foundation models from Google and Anthropic but continuously assess which models are the right matches for specific GitLab Duo use cases. This approach provides flexibility and allows them to: - Match models to specific use cases based on performance characteristics - Avoid vendor lock-in - Adapt as the LLM landscape evolves rapidly Different LLMs can be optimized for different characteristics, which explains why there are so many AI models actively being developed. GitLab's evaluation framework allows them to systematically compare models for specific tasks rather than relying on generic benchmarks. ## Transparency and Ethics Considerations The article emphasizes GitLab's commitment to transparency, referencing their AI Transparency Center and AI Ethics Principles for Product Development. They explicitly state that they do not view or use customer data to train AI features—a significant differentiator from some competitors. ## Critical Assessment While this case study provides valuable insights into production LLMOps practices, a few caveats are worth noting: - The article is promotional in nature, published on GitLab's blog to highlight their AI capabilities. Specific performance metrics, error rates, or comparative benchmark results are not shared. - The claim that they do not use customer data for training is notable, but the article doesn't detail how they ensure their synthetic prompt library truly represents production usage patterns. - The evaluation metrics mentioned (Cosine Similarity, LLM Judge, etc.) are industry-standard but the article acknowledges this remains an unsolved problem—there's no claim to have definitively solved LLM evaluation. - The cost and infrastructure requirements for running daily evaluations across thousands of prompts are not discussed, though they acknowledge testing at scale is "expensive and time-consuming." Despite these limitations, the case study offers a realistic and practical view of enterprise LLMOps, emphasizing the importance of systematic evaluation, baseline measurement, and iterative improvement rather than ad-hoc prompt engineering. The framework described represents a mature approach to deploying LLMs in production where reliability and quality assurance are paramount.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source