This case study explores how Github developed and evolved their evaluation systems for Copilot, their AI code completion tool. Initially skeptical about the feasibility of code completion, the team built a comprehensive evaluation framework called "harness lib" that tested code completions against actual unit tests from open source repositories. As the product evolved to include chat capabilities, they developed new evaluation approaches including LLM-as-judge for subjective assessments, along with A/B testing and algorithmic evaluations for function calls. This systematic approach to evaluation helped transform Copilot from an experimental project to a robust production system.
This case study details Github's journey in building and evolving evaluation systems for their Copilot AI code completion tool, as shared by two former Github engineers who worked directly on the project. The study provides deep insights into how a major tech company approached the challenge of evaluating AI systems in production, particularly focusing on code generation.
The evaluation system evolved through several distinct phases and approaches:
**Harness Lib - Verifiable Code Completion Testing**
The team developed a sophisticated evaluation framework called "harness lib" that tested code completions against real unit tests from open source repositories. The process involved:
* Collecting test samples from open source Python and JavaScript repositories
* Filtering for repos where all tests passed
* Using code coverage tools to associate functions with unit tests
* Creating candidate functions that met specific criteria (had unit tests, doc strings, etc.)
* Testing by removing function implementations and having Copilot regenerate them
* Running the original unit tests against the generated code
This approach provided objective measurements of code completion quality, reaching acceptance rates of 40-50% in early implementations. The team had to carefully manage test repository selection, ensuring they weren't using repos that were in the model's training data and maintaining consistency with production traffic patterns.
**A/B Testing and Key Metrics**
For production deployment, the team implemented comprehensive A/B testing with three key metrics:
* Completion acceptance rate (how often users accepted suggestions)
* Characters retained (how much of the accepted code remained after editing)
* Latency (response time for suggestions)
They also maintained dozens of "guardrail metrics" to monitor for unexpected behaviors or regressions. The team learned valuable lessons about metric trade-offs, such as how optimizing for acceptance rates could lead to many tiny completions that created a poor user experience.
**LLM-as-Judge Evolution**
With the introduction of Copilot Chat, the team faced new evaluation challenges that couldn't be addressed through traditional unit testing. They needed to evaluate the quality of conversations and explanations, not just code output. This led to the development of their LLM-as-judge approach:
* Initially tried human side-by-side evaluations
* Evolved to using GPT-4 as a judge with specific evaluation criteria
* Developed detailed rubrics to keep the judge focused on relevant aspects
* Created a system where humans could verify the judge's decisions efficiently
**Algorithmic Evaluations**
For specific features like function calls, they developed algorithmic evaluations to verify correct tool usage:
* Checking correct function call patterns
* Using confusion matrices to identify when tools were being confused or misused
* Monitoring tool selection accuracy
**Key Learnings and Best Practices**
The case study revealed several important insights about AI evaluation in production:
* The importance of having both regression testing and experimental evaluation capabilities
* How evaluation needs evolve with product capabilities (code completion vs chat)
* The value of caching and optimization in evaluation pipelines
* The need to balance comprehensive evaluation with practical shipping decisions
**Infrastructure and Tooling**
The team utilized various tools and approaches:
* Jupyter notebooks for experimentation
* Azure pipelines for automated testing
* SQL databases for storing evaluation data
* Custom dashboards for monitoring metrics
The case study emphasizes the importance of starting with simple "vibe-based" evaluation for early prototypes before building more sophisticated evaluation systems. The team found that evaluation needs to follow product development rather than precede it, with evaluation systems evolving based on actual production issues and user feedback.
An interesting finding was that evaluations done on high-resource languages like Python and JavaScript proved to be good proxies for performance in other languages. The team also discovered that users preferred shorter, more incremental code completions rather than long complete solutions, leading to adjustments in their prompting strategy.
The success of this evaluation system played a crucial role in transforming Copilot from an experimental project that many were initially skeptical about into a robust production tool. The combination of objective metrics, subjective assessments, and comprehensive testing frameworks provided the feedback loop necessary for continuous improvement and reliable deployment of AI capabilities.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.