## Overview
eBay, one of the world's largest e-commerce marketplaces, published this case study describing their systematic approach to deploying generative AI for developer productivity. The article, authored by Senthil Padmanabhan and published in February 2024, outlines three distinct but complementary tracks for leveraging LLMs within their engineering organization. Rather than positioning one approach as superior to others, eBay advocates for a multi-pronged strategy where each track addresses different productivity challenges and can be used in combination.
The case study is notable for its measured, practical approach to AI adoption. eBay acknowledges upfront that developer productivity is complex and cannot be captured by a single metric, employing both quantitative measures (Git metrics, DORA, Flow, code quality via Sonar) and qualitative measures (developer surveys). This methodological rigor adds credibility to their findings, though it's worth noting that many of the specific results come from internal assessments rather than independent verification.
## Track 1: Commercial Offerings (GitHub Copilot)
eBay's first track involved expanding their use of GitHub Copilot to all developers, but they conducted a controlled experiment first. The pilot program involved 300 developers split into two groups: one using Copilot and a control group without it, matched for similar assignments and abilities. This A/B test was conducted in summer 2023.
The quantitative results reported were:
- 27% code acceptance rate (as reported through Copilot telemetry)
- 70% accuracy for generated documentation
- 60% accuracy for generated code
- 17% decrease in pull request creation to merge time
- 12% decrease in Lead Time for Change
- Code quality (measured through Sonar) remained consistent across both groups
The use cases where Copilot excelled included converting comments to code, suggesting next lines, generating tests, and auto-filling repetitive code. However, eBay encountered significant limitations with context size. Copilot has inherent prompt size limits, which becomes problematic for a codebase of eBay's scale with millions of lines of code. Certain tasks require knowledge of the entire codebase, which commercial tools simply cannot accommodate.
It's important to note that acceptance rate metrics from commercial tools can be somewhat misleading, as they measure what developers accept rather than what ultimately proves useful. The decrease in merge time is a more meaningful operational metric, though the causal relationship to Copilot specifically would benefit from longer-term study.
## Track 2: Fine-Tuned LLMs (eBayCoder)
Recognizing the limitations of generic commercial offerings, eBay pursued fine-tuning an existing open-source LLM on their proprietary codebase. They chose Code Llama 13B as their base model, noting that the architecture allows for easy substitution if needed. The resulting system, called "eBayCoder," was trained on eBay's codebase and associated documentation.
The primary use case highlighted for eBayCoder is software upkeep and migration tasks. eBay, like many large technology organizations, maintains foundational libraries and frameworks built on top of open-source software for servers, message queues, batch jobs, iOS, and Android. These systems require periodic upgrades for security patches and developer ergonomics improvements (such as upgrading to newer versions of Spring Boot). The effort required for such migrations varies dramatically depending on the application stack's current version, and eBay reports spending significant engineering resources on these tasks despite having existing migration tools.
The fine-tuned approach also addresses the code duplication problem. With a massive, varied codebase, commercial LLMs typically only have access to immediately relevant context—surrounding files, the current repository, and dependent libraries. They may not be aware of existing internal services or non-dependent libraries maintained by other teams that offer similar functionality to what's being written. This frequently leads to code duplication. A fine-tuned LLM with access to broader context can potentially identify these overlaps and reduce redundancy.
From an LLMOps perspective, this track represents a more resource-intensive investment. Fine-tuning requires computational resources, data preprocessing pipelines, and ongoing maintenance as the codebase evolves. The article doesn't detail the infrastructure requirements, training time, or operational costs, which would be valuable for organizations considering similar approaches. The choice of Code Llama 13B suggests a balance between capability and resource requirements—larger models might offer better performance but at significantly higher operational costs.
## Track 3: Internal Knowledge Base (RAG-Based GPT)
The third track addresses a different productivity bottleneck: the time developers spend investigating internal knowledge. eBay describes common questions like "Which API should I call to add an item to the cart?" or "How do I create a pipeline to deploy my application to production?" that can require multiple meetings and significant time to answer due to fragmented internal documentation.
eBay built an internal GPT that ingests data from multiple primary sources including enterprise GitHub Markdowns, Google Docs, Jira, Slack, and Wikis. The metaphor they use is instructive: every organization has that long-tenured employee who knows where everything is and whom to ask—this system aims to codify that institutional knowledge.
The technical implementation uses Retrieval Augmented Generation (RAG). The system creates embedding vectors for each piece of content from the various sources, which are then stored in a vector database. This embedding and indexing process is automated and recurring, suggesting they've built pipelines to keep the knowledge base current as documentation changes.
When a user enters a query, the system generates an embedding vector for that query and uses similarity mechanisms (such as cosine similarity) to compare against the stored content embeddings. This retrieval step identifies the most relevant content and links. The retrieved context is then passed to their private instance of LLMs (both commercial and open-source options) with a carefully constructed prompt that instructs the model to answer truthfully based on the provided context and to explicitly state "I don't know" when the answer isn't contained in the retrieved material.
This prompt engineering approach—instructing the model to acknowledge uncertainty—is a common and effective technique for reducing hallucinations in RAG systems. It shifts the model's behavior toward conservative responses when context is insufficient.
eBay reports rising daily usage and positive qualitative feedback from developers regarding efficiency and relevance. They're tracking task completion time and support/meeting requests as metrics, though they note the sample size isn't yet large enough to share definitive quantitative results.
The acknowledged limitation is that, like any automated chat system, the GPT occasionally delivers nonsensical answers. To address this, eBay implements Reinforcement Learning from Human Feedback (RLHF), allowing employees to provide feedback through the user interface that gets incorporated back into the system. This creates a continuous improvement loop, though the operational complexity of maintaining such a feedback pipeline shouldn't be underestimated.
## LLMOps Considerations and Critical Assessment
From an LLMOps perspective, this case study demonstrates several best practices:
**Multi-model strategy**: Rather than betting on a single approach, eBay's three-track strategy provides resilience and flexibility. If commercial offerings change pricing or terms, they have internal alternatives. If fine-tuning proves too resource-intensive for certain use cases, commercial options remain available.
**Controlled experimentation**: The A/B test methodology for evaluating Copilot represents a more rigorous approach than many organizations take when adopting AI tools. The inclusion of a control group with matched characteristics strengthens the validity of their findings.
**Hybrid infrastructure**: The mention of "private instances of commercial and open source LLMs" suggests eBay maintains flexibility in their infrastructure, potentially allowing them to balance cost, performance, and data privacy considerations across different use cases.
**Continuous improvement loops**: The RLHF implementation for the internal GPT demonstrates attention to ongoing system improvement rather than one-time deployment.
However, there are areas where the case study could be more comprehensive:
The article doesn't discuss infrastructure costs, training times, or the team size required to maintain these systems. For organizations considering similar approaches, operational costs are critical decision factors. The fine-tuning track in particular requires significant ongoing investment as codebases evolve.
The metrics provided, while encouraging, come from internal assessments. Independent verification or longer-term studies would strengthen confidence in the claimed productivity gains. The "perceived productivity" increase from surveys is subjective and could be influenced by novelty effects.
The article also doesn't address potential downsides like developer skill atrophy, over-reliance on generated code, or security implications of training models on proprietary code. These are legitimate concerns that mature LLMOps programs should consider.
Overall, eBay's case study provides a useful framework for thinking about generative AI deployment for developer productivity, with the three-track approach offering a pragmatic template that acknowledges the strengths and limitations of different approaches.