GitHub: Evolving GitHub Copilot with LLM Experimentation Across the Developer Lifecycle

Company

GitHub

Title

Evolving GitHub Copilot with LLM Experimentation Across the Developer Lifecycle

Industry

Tech

Link

https://github.blog/2023-12-06-how-were-experimenting-with-llms-to-evolve-github-copilot/

Year

2023

Summary (short)

GitHub details their internal experimentation process with GPT-4 and other large language models to extend GitHub Copilot beyond code completion into multiple stages of the software development lifecycle. The GitHub Next research team received early access to GPT-4 and prototyped numerous AI-powered features including Copilot for Pull Requests, Copilot for Docs, Copilot for CLI, and GitHub Copilot Chat. Through iterative experimentation and internal testing with GitHub employees, the team discovered that user experience design, particularly how AI suggestions are presented and allow for developer control, is as critical as model accuracy for successful adoption. The experiments resulted in technical previews released in March 2023 that demonstrated AI integration across documentation, command-line interfaces, and pull request workflows, with key learnings around making AI outputs predictable, tolerable, steerable, and verifiable.

Tags

## Overview This case study documents GitHub's comprehensive approach to experimenting with and deploying large language models in production as part of their evolution of GitHub Copilot. The article provides rare behind-the-scenes insights into how GitHub Next, the company's research and development division, received early access to OpenAI's GPT-4 model and rapidly prototyped multiple production features across different parts of the developer workflow. The case study is particularly valuable because it openly discusses both successful experiments and failed approaches, revealing critical lessons about LLM deployment that go beyond technical model performance to focus on user experience, workflow integration, and human-AI interaction patterns. The experimentation period took place between late 2022 and March 2023, culminating in the public announcement of several technical previews that represented GitHub's vision for making AI ubiquitous, conversational, and personalized across the developer experience. The teams involved included researchers and engineers from GitHub Next working on distinct but complementary projects that would collectively expand GitHub Copilot from an IDE-based code completion tool to a platform-wide AI assistant. ## Strategic Framework for AI Experimentation GitHub established four key design principles that guided all their LLM experimentation work, which represent a thoughtful framework for production LLM deployment. These principles address fundamental challenges in making AI systems useful rather than merely impressive in demonstrations. The first principle is **predictability** - creating tools that guide developers toward end goals without surprising or overwhelming them. This acknowledges that while LLMs can generate unexpected outputs, production systems need to maintain consistent behavior patterns that users can rely upon. The second principle is **tolerability** - explicitly accepting that AI models will be wrong and designing interfaces where users can easily spot incorrect suggestions and address them at low cost to focus and productivity. This represents a pragmatic acceptance of current LLM limitations rather than optimistic assumptions about perfect accuracy. The third principle is **steerability** - ensuring that when responses aren't correct or aligned with user needs, developers can guide the AI toward better solutions. This principle recognizes that one-shot generation rarely produces perfect results and that interactive refinement is essential for practical utility. The fourth principle is **verifiability** - making solutions easy to evaluate so that users can leverage AI as a helpful tool while maintaining appropriate skepticism and oversight. This principle acknowledges that the human remains in the decision-making loop and must be empowered to assess AI outputs efficiently. These principles collectively demonstrate a mature understanding of LLM capabilities and limitations, moving beyond simple accuracy metrics to consider the full user experience of working with AI systems in production environments. ## GPT-4 Access and Rapid Prototyping In late 2022, GitHub Next researchers received advanced access to GPT-4 before its public release. According to Idan Gazit, senior director of research, this represented unprecedented capability - "no one had seen anything like this." The access created what Gazit describes as "a race to discover what the new models are capable of doing and what kinds of applications are possible tomorrow that were impossible yesterday." The team followed their standard methodology of rapid experimentation - quickly prototyping numerous concepts, identifying those showing genuine value, and then intensively developing the most promising ideas. This approach, which Gazit characterizes as "classic GitHub Next fashion," involved spiking multiple ideas and doubling down on those that appeared likely to bear fruit. The compressed timeline between receiving model access and the planned March 2023 announcement alongside Microsoft and OpenAI's GPT-4 launch created urgency that drove rapid iteration. Senior leadership at GitHub recognized that while GitHub Next's experiments weren't production-ready, they represented valuable future-focused investments that could inform a broader vision for GitHub Copilot's evolution. This led to strategic thinking about extending Copilot to be ubiquitous across developer tools, conversational by default through natural language interfaces, and personalized to individual, project, team, and community contexts. ## Copilot for Pull Requests: The Critical Importance of UX The development of Copilot for Pull Requests provides perhaps the most instructive lesson in the entire case study regarding the relationship between AI capability and user acceptance. A team including Andrew Rice, Don Syme, Devon Rifkin, Matt Rothenberg, Max Schaefer, Albert Ziegler, and Aqeel Siddiqui experimented with adding AI capabilities to pull requests, GitHub's signature collaborative code review feature. The team prototyped several features including automatic code suggestions for reviews, summarization, and test generation. As the March deadline approached, they focused specifically on the summary feature that would generate descriptions and walkthroughs of pull request code to provide context for reviewers. The initial implementation would automatically generate this content as a comment when developers submitted pull requests. When deployed internally to GitHub employees (referred to as "Hubbers"), the response was notably negative. However, Rice's analysis of the feedback revealed something surprising: the problem wasn't the quality of the AI-generated content itself, but rather how it was presented and integrated into the workflow. Developers expressed concern that the AI might be wrong, but this concern was largely driven by the interface design rather than actual content quality. The team made a pivotal change: instead of posting AI-generated descriptions as comments, they presented them as suggestions that developers could preview, edit, and optionally accept before finalizing their pull request. This seemingly subtle UX change transformed user reception - the exact same AI-generated content that received poor feedback as automatic comments was suddenly viewed as helpful when presented as editable suggestions. This experiment demonstrates a crucial LLMOps insight: giving users agency and control over AI outputs dramatically improves acceptance even when the underlying model quality remains constant. The interface shifted the framing from "the AI is making authoritative statements about my code" to "the AI is offering helpful starting points I can refine," fundamentally changing the psychological relationship between developer and tool. Rice's key takeaway emphasizes that how AI output is presented matters as much or more than the total accuracy of suggestions. Developer tolerance for AI imperfection exists on a spectrum depending on workflow integration. When developers maintain authority to accept, reject, or modify suggestions, they become more forgiving of occasional errors because the cost of verification and correction is low and the benefit of saved time remains high. ## Copilot for Docs: RAG Architecture and Reference Linking Eddie Aftandilian led development of Copilot for Docs, which took a different technical approach by implementing retrieval-augmented generation (RAG) to ground LLM responses in actual documentation. In late 2022, Aftandilian and Johan Rosenkilde were experimenting with embeddings and retrieval systems, prototyping a vector database for another GitHub Copilot experiment. This work led them to consider whether retrieval could be applied to content beyond code. When GPT-4 access became available, the team realized they could use their retrieval engine to search large documentation corpora and compose search results into prompts that would elicit more accurate, topical answers grounded in actual documentation. The team - Aftandilian, Devon Rifkin, Jake Donham, and Amelia Wattenberger - identified documentation search as a significant pain point in developer workflows. Developers spend substantial time searching documentation, the experience is often frustrating, and finding correct answers can be difficult. The technical architecture combined vector embeddings for semantic search across documentation with LLM-based answer generation that synthesized retrieved content into conversational responses. This RAG approach aimed to reduce hallucination and increase factual accuracy by grounding the model's responses in retrieved documentation snippets rather than relying purely on parametric knowledge. The team deployed early versions to GitHub employees, extending Copilot to both internal GitHub documentation and public documentation for various tools and frameworks. A critical design decision emerged from user feedback: including references and links to source documentation alongside AI-generated answers. When testing reached public preview, Aftandilian discovered that developers were remarkably tolerant of imperfect answers as long as the linked references made it easy to evaluate the AI's output and find additional information. Users were effectively treating Copilot for Docs as an enhanced search engine rather than an oracle. The chat-like modality made answers feel less authoritative than traditional documentation, which paradoxically increased tolerance for errors. Developers appreciated getting pointed in the right direction even when the AI didn't provide perfectly complete answers, because the combination of summarized response plus reference links accelerated their research compared to manual documentation searching. Aftandilian's key learnings emphasize the importance of shipping early to gather real human feedback rather than optimizing endlessly in isolation. He notes that "human feedback is the true gold standard for developing AI-based tools." Additionally, the UX must be tolerant of AI mistakes - designers cannot assume the AI will always be correct. The initial team focus on achieving perfect accuracy proved less important than creating an interface that acknowledged uncertainty and empowered users to verify outputs efficiently. The RAG architecture represents a significant LLMOps pattern for production deployment - combining retrieval systems with generative models to improve accuracy and verifiability. The inclusion of source citations creates an audit trail that allows users to assess whether the AI correctly interpreted source material, partially addressing the black-box nature of LLM reasoning. ## Copilot for CLI: Structured Output and Multi-Purpose Features Johan Rosenkilde pitched the concept for Copilot for CLI during an October 2022 GitHub Next team meeting in Oxford, England. His initial vision involved using LLMs to help developers figure out command-line interface commands through natural language prompts, possibly with a GUI to help narrow requirements. As Rosenkilde presented this idea, Matt Rothenberg simultaneously built a working prototype that demonstrated the concept's viability within approximately thirty minutes. While the rapid prototype validated the core concept, it required substantial refinement to reach preview quality. The team carved out dedicated time to transform the rough demo into a polished developer tool that would bring GitHub Copilot capabilities directly into the terminal. By March 2023, they had a technical preview that allowed developers to describe desired shell commands in natural language and receive appropriate commands along with explanatory breakdowns - eliminating the need to search the web for command syntax. Rosenkilde, who identifies as a backend-focused engineer drawn to complex theoretical problems, credits Rothenberg's UX expertise as critical to the product's success. Rothenberg iterated rapidly through numerous design options, and Rosenkilde came to appreciate how heavily the application's success depended on subtle UX decisions. He notes that since AI models aren't perfect, the key design challenge is minimizing the cost to users when the AI produces imperfect outputs. A particularly important design element that emerged during development was the explanation field that breaks down each component of suggested shell commands. This feature wasn't part of the original interface but became central to the product's value. However, implementing it required significant prompt engineering effort - Rosenkilde describes hitting the LLM "with a very large hammer" to produce the structured, scannable explanations they desired rather than the long paragraphs that models naturally generate. The explanation field serves multiple purposes, demonstrating efficient feature design where individual components provide several types of value. It serves as an educational tool helping developers learn about shell commands, a verification mechanism allowing developers to confirm they received the correct command, and a security feature enabling users to check in natural language whether commands will modify unexpected files. This multi-faceted utility allows the visually simple interface to package significant complexity. The structured output challenge that Rosenkilde describes represents a common LLMOps problem - models trained primarily on natural language often require substantial prompt engineering to produce formatted outputs that integrate well with existing interfaces and workflows. Getting LLMs to generate consistently structured content rather than conversational prose often requires experimentation with prompts, examples, and output constraints. ## Common LLMOps Themes Across Experiments Several consistent patterns emerge across the three major experiments that represent broader LLMOps principles applicable beyond GitHub's specific use cases. The primacy of user experience over raw accuracy appears repeatedly. All three teams discovered that how AI outputs are presented, framed, and integrated into workflows matters as much or more than the technical quality of model predictions. The pull request team found identical content received vastly different reception based purely on interface framing. The documentation team discovered that reference links made users tolerant of imperfect answers. The CLI team learned that explanation fields transformed commands from opaque suggestions into educational, verifiable tools. The importance of maintaining human agency and control represents another consistent theme. Successful designs positioned AI as a helpful assistant offering suggestions rather than an authoritative system making decisions. Giving users ability to preview, edit, accept, or reject AI outputs proved essential for adoption. This aligns with the stated design principle of tolerability - explicitly accepting that AI will sometimes be wrong and designing for easy human oversight. Rapid prototyping with real user feedback emerged as more valuable than extended isolated development. Multiple teams emphasized shipping quickly to gather human feedback rather than pursuing theoretical perfection. Aftandilian explicitly states that "you should ship something sooner rather than later to get real, human feedback to drive improvements." This iterative approach with fast feedback loops appears central to GitHub's experimentation methodology. The value of grounding and verifiability appears particularly in the documentation work. The RAG architecture with citation links allowed users to verify AI responses against source material, addressing trust and accuracy concerns. This pattern of making AI reasoning more transparent and checkable represents an important production deployment strategy for high-stakes applications. The challenge of structured output generation versus natural conversation emerged in the CLI work. While LLMs excel at generating natural language, production applications often require specific formats, structures, or presentation patterns that require significant prompt engineering to achieve reliably. ## Model Selection and Technical Architecture While the case study focuses heavily on GPT-4, the documentation work also involved embeddings and vector databases for retrieval, suggesting a multi-model architecture. The RAG implementation for Copilot for Docs required separate embedding models to create vector representations of documentation content and a vector database to enable semantic search, with GPT-4 used for answer synthesis given retrieved context. The case study doesn't provide detailed information about prompt engineering techniques, model fine-tuning approaches, or infrastructure requirements. It doesn't discuss latency requirements, cost optimization strategies, or scaling challenges. The focus remains primarily on product development and UX insights rather than detailed technical implementation. The teams appear to have used relatively straightforward prompting approaches with GPT-4 rather than complex fine-tuning or reinforcement learning from human feedback (RLHF), though Rosenkilde's comments about hitting the model "with a very large hammer" to achieve structured outputs suggests some prompt engineering complexity. ## Production Deployment Considerations The case study describes releasing technical previews rather than generally available products, acknowledging that GitHub Next's work was "future-focused" rather than production-ready. This staged release approach allowed GitHub to gather user feedback and refine implementations before broader deployment. The internal deployment to GitHub employees ("Hubbers") before public preview represents a valuable testing strategy, though it's worth noting that GitHub employees likely represent a specific demographic of highly technical, developer-focused users who may not be representative of the broader developer population. The negative initial feedback on pull request summaries demonstrates the value of honest internal testing, though organizations should be cautious about over-indexing on internal user preferences. The staged rollout from internal testing to technical preview to eventual general availability allows for iterative refinement based on progressively larger and more diverse user populations. This approach manages risk while gathering increasingly representative feedback. ## Critical Assessment and Limitations While this case study provides valuable insights, it's important to note that it represents GitHub's own perspective on their products and comes from a blog post intended to generate interest in their offerings. The narrative presents a relatively positive view of the experimentation process, though it does acknowledge failures like the initial pull request implementation. The case study doesn't discuss potential negative consequences or concerns about AI-generated content in development workflows. It doesn't address questions about training data, copyright, code ownership, or security implications of AI-generated suggestions. There's no discussion of how errors in AI suggestions might introduce bugs or vulnerabilities into codebases. The focus on developer experience and productivity gains doesn't include quantitative metrics about actual productivity improvements, error rates, or adoption statistics. Claims about user satisfaction come from qualitative feedback rather than controlled studies. While the insights about UX importance are valuable, they primarily reflect subjective developer preferences rather than measured outcomes. The teams' emphasis on making AI mistakes "tolerable" and "low cost" acknowledges imperfection but doesn't deeply examine scenarios where even low-cost errors might accumulate or where developers might over-rely on AI suggestions without adequate verification. The responsibility remains with developers to verify outputs, but the case study doesn't address cognitive fatigue or verification burden as developers interact with multiple AI systems throughout their workflows. The staged technical preview approach means these features were tested with early adopters who volunteered to try experimental features - a population likely more tolerant of rough edges and more capable of identifying and working around issues than the general developer population. Feedback from technical preview users may not fully represent challenges that average developers would experience. ## Broader Implications for LLMOps Despite these limitations, the case study offers valuable lessons for organizations implementing LLMs in production. The emphasis on UX, human control, and workflow integration represents mature thinking about AI deployment that extends beyond simply achieving high benchmark scores or impressive demos. The principles of predictability, tolerability, steerability, and verifiability provide a useful framework for evaluating LLM-powered features. These principles acknowledge current AI limitations while providing design guidance for creating practical, useful tools despite those limitations. The case study demonstrates that successful LLM deployment requires cross-functional collaboration between ML researchers, product designers, and domain experts. The CLI team's partnership between backend engineers and UX designers, for instance, proved essential to creating a successful product. The rapid prototyping methodology - quickly testing multiple concepts, gathering feedback, and iterating based on real usage - appears more effective than extended development in isolation. This aligns with broader software development principles but may require particular emphasis with AI systems where capabilities and limitations aren't always obvious until systems are tested with real users on real tasks. The technical approaches demonstrated - particularly the RAG architecture for documentation search and the structured output formatting for CLI commands - represent reusable patterns applicable to other domains. Combining retrieval with generation improves accuracy and verifiability, while investing in prompt engineering to achieve consistent structured outputs enables better integration with existing tools and workflows. Organizations considering LLM deployment should note GitHub's staged approach from early access experimentation through internal testing to technical preview to general availability. This measured rollout allows for learning and refinement while managing risk. The willingness to pivot based on feedback - as with the pull request summary reframing - demonstrates important organizational flexibility.

Start deploying reproducible AI workflows today