eBay implemented a three-track approach to enhance developer productivity using AI: deploying GitHub Copilot enterprise-wide, creating a custom-trained LLM called eBayCoder based on Code Llama, and developing an internal RAG-based knowledge base system. The Copilot implementation showed a 17% decrease in PR creation to merge time and 12% decrease in Lead Time for Change, while maintaining code quality. Their custom LLM helped with codebase-specific tasks and their internal knowledge base system leveraged RAG to make institutional knowledge more accessible.
eBay’s journey into implementing LLMs for developer productivity represents a comprehensive and pragmatic approach to adopting AI technologies in a large-scale enterprise environment. The company explored three distinct but complementary tracks for improving developer productivity through AI, offering valuable insights into the real-world challenges and benefits of deploying LLMs in production.
The case study is particularly noteworthy for its measured approach to evaluation and deployment, using both quantitative and qualitative metrics to assess the impact of these technologies. Instead of relying on a single solution, eBay recognized that different aspects of developer productivity could be better served by different approaches to LLM deployment.
The first track involved the enterprise-wide deployment of GitHub Copilot, preceded by a carefully designed A/B test experiment with 300 developers. The evaluation methodology was robust, involving:
The results showed significant improvements:
However, eBay was also transparent about the limitations, particularly noting Copilot’s context window constraints when dealing with their massive codebase. This highlights an important consideration for large enterprises implementing similar solutions.
The second track demonstrates a more specialized approach to handling company-specific code requirements. eBay created eBayCoder by fine-tuning Code Llama 13B on their internal codebase and documentation. This approach addressed several limitations of commercial solutions:
The implementation shows careful consideration of model selection (Code Llama 13B) and training strategy (post-training and fine-tuning on internal data). This represents a significant investment in MLOps infrastructure to support model training and deployment.
The third track focused on creating an intelligent knowledge retrieval system using RAG (Retrieval Augmented Generation). This system demonstrates several sophisticated LLMOps practices:
The system includes important production-ready features:
The case study reveals several important MLOps considerations:
eBay implemented comprehensive monitoring and evaluation strategies:
The case study acknowledges that they are at the beginning of an exponential curve in terms of productivity gains. They maintain a pragmatic view of the technology while recognizing its transformative potential. The implementation of RLHF and continuous improvement mechanisms suggests a long-term commitment to evolving these systems.
This case study provides valuable insights into how large enterprises can systematically approach LLM deployment, balancing commercial solutions with custom development while maintaining a focus on practical productivity improvements. The multi-track approach demonstrates a sophisticated understanding of how different LLM implementations can complement each other in a production environment.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.
OpenDev is an open-source, command-line AI coding agent written in Rust that addresses the fundamental challenges of building production-ready autonomous software engineering systems. The agent tackles three critical problems: managing finite context windows over long sessions, preventing destructive operations while maintaining developer productivity, and extending capabilities without overwhelming token budgets. The solution employs a compound AI system architecture with per-workflow LLM binding, dual-agent separation of planning from execution, adaptive context compaction that progressively reduces older observations, lazy tool discovery via Model Context Protocol (MCP), and a defense-in-depth safety architecture. Results demonstrate approximately 54% reduction in peak context consumption, session lengths extending from 15-20 turns to 30-40 turns without emergency compaction, and a robust framework for terminal-first AI assistance that operates where developers manage source control, execute builds, and deploy environments.