ZenML

Building and Deploying Production LLM Code Review Agents: Architecture and Best Practices

Ellipsis 2024
View original source

Ellipsis developed an AI-powered code review system that uses multiple specialized LLM agents to analyze pull requests and provide feedback. The system employs parallel comment generators, sophisticated filtering pipelines, and advanced code search capabilities backed by vector stores. Their approach emphasizes accuracy over latency, uses extensive evaluation frameworks including LLM-as-judge, and implements robust error handling. The system successfully processes GitHub webhooks and provides automated code reviews with high accuracy and low false positive rates.

Industry

Tech

Technologies

Overview

Ellipsis is an AI software engineering tool that automates code review, bug fixing, and other developer tasks through integration with GitHub. This case study provides a detailed technical deep dive into how their production LLM system is architected, evaluated, and maintained. The company, which is Y Combinator-backed and SOC II Type I certified, has developed a sophisticated multi-agent architecture specifically designed to handle the challenges of automated code review at scale.

The case study is particularly valuable because it comes directly from the founder who has been working on coding agents since before ChatGPT, offering insights that span the evolution of LLM-based development tools. While the content is promotional in nature, it provides substantial technical detail that makes the architectural decisions and trade-offs tangible for practitioners.

System Architecture and Production Infrastructure

The production system begins when users install a GitHub App into their repositories. Webhook events from GitHub are routed through Hookdeck for reliability guarantees, then forwarded to a FastAPI web application. Events are immediately placed onto a workflow queue managed by Hatchet, enabling asynchronous processing.

A critical design insight highlighted is that asynchronous workflows fundamentally change the optimization priorities - latency becomes less critical than accuracy. This is an important consideration for LLMOps practitioners, as it allows for more comprehensive processing, multiple agent passes, and sophisticated filtering without the real-time constraints that plague many LLM applications.

When a PR is opened or marked ready for review, the workflow clones the repository and runs the review agent. The system also responds to tags like “@ellipsis-dev review this”, allowing users to request specific actions such as making changes, answering questions, or logging issues to GitHub/Linear.

Multi-Agent Architecture for Code Review

The core architectural principle employed is decomposition: rather than building one large agent with a massive prompt, Ellipsis uses dozens of smaller agents that can be independently benchmarked and optimized. This follows the prompt engineering principle that “to increase performance, make the problem easier for the LLM to solve.”

The review system operates in two main phases. First, multiple Comment Generators run in parallel to identify different types of issues. One generator might look for violations of customer-defined rules, another searches for duplicated code, and so forth. This parallel architecture also enables model mixing - the system can leverage both GPT-4o and Claude Sonnet simultaneously, taking advantage of each model’s strengths without having to choose between them. Each generator can attach Evidence (links to code snippets) to their comments, which becomes useful in later filtering stages.

Second, a multistage Filtering Pipeline processes the raw comments to significantly reduce false positive rates - identified as developers’ most common complaint about AI code review tools. The pipeline includes several filter types:

Notably, the system includes filtered comments and reasoning in its final output, providing transparency about what Ellipsis found suspicious and why certain comments weren’t posted. This is an excellent practice for production AI systems where explainability matters.

Feedback Integration and Continuous Improvement

The system incorporates user feedback directly into the filtering process rather than relying on per-customer fine-tuned models. Thumbs up/down reactions are used through embedding search over similar past comments, allowing the system to learn from historical reactions. Users can also respond to comments with explanations, making it easier for the LLM to understand why a comment was inappropriate.

This approach has several advantages over fine-tuning: more consistent behavior, near-immediate reflection of feedback in agent behavior (no retraining delay), and easier maintenance. This is a pragmatic production decision that trades theoretical optimality for operational simplicity.

Code Search Subagent and RAG Implementation

A modular Code Search subagent underpins both comment generation and filtering, demonstrating good software engineering principles applied to LLM systems. This same subagent is reused across code review, code generation, and codebase chat features, allowing independent benchmarking and improvement.

The Code Search agent implements a multi-step RAG+ approach combining keyword and vector search. For code indexing, two complementary methods are used:

The team notes that GraphRAG is on their horizon but not yet implemented.

A significant insight relates to limiting context due to LLM performance degradation at large context sizes. Traditional RAG approaches use rerankers and cosine similarity thresholds to limit results, but for code search, relative ranking is less important than whether retrieved code is actually useful. To address this, they run an LLM-based binary classifier after vector search, using additional context from the agent trajectory to select truly relevant pieces. This keeps the top-level search agent’s context uncluttered across multiple passes.

Efficient Vector Database Management

Despite latency not being the primary concern, blocking PR review on slow repository indexing was unacceptable. Ellipsis uses Turbopuffer as their vector store and implements an efficient incremental indexing approach. The key insight is avoiding re-embedding the entire repository on every commit.

Their process for updating on new commits involves chunking the repo and using SHA hashes as chunk IDs, adding obfuscated metadata for later retrieval (no customer code stored), fetching existing IDs from the vector DB namespace, comparing new versus existing IDs, deleting obsolete IDs, and only embedding/upserting chunks with new IDs. In practice, since most commits affect only a small percentage of chunks, synchronization takes only a couple of seconds.

Language Server Integration

To provide IDE-like capabilities (go-to-definition, find-all-references) to agents, Ellipsis sidecars an Lsproxy container from Agentic Labs using Modal. This provides a high-level API for various language servers while maintaining separation of concerns.

The implementation accounts for LLM limitations: since LLMs are notoriously bad at correctly identifying column numbers and often off-by-one on line numbers, tools allow the agent to specify symbol names and then fuzzy match to the closest symbol of that name. This is a practical workaround for a well-known LLM weakness.

Developer Workflow and Evaluation Systems

The team’s approach to development has evolved significantly. Their earlier philosophy emphasized extensive CI tests, request caching (custom-built with DynamoDB for speed, cost, and determinism), keeping prompts in code rather than third-party services, rolling custom code instead of using frameworks like LangChain, and heavy reliance on snapshot testing with manual human reviews.

The major evolution has been significant investment in automated evals, reducing reliance on snapshot testing. Their developer workflow for adding new agents follows an iterative pattern: writing initial prompts and tools, sanity checking examples, finding correctness measures, building mini-benchmarks with around 30 examples (often using LLM augmentation for “fuzz testing”), measuring accuracy, diagnosing failure classes with LLM auditors, updating agents through prompt tweaks and few-shot examples, generating more data, and repeating until plateau. They occasionally sample public production data to identify new edge cases.

Notably, they have not found fine-tuning very applicable due to slower iteration speed and lack of fine-tuning availability on the latest models.

LLM-as-Judge Implementation

For evaluation, deterministic assessment is preferred where possible (e.g., comparing file edits to expected outputs with fuzzy whitespace handling). For more subjective domains, LLM-as-judge compares results to ground truth, removing the need for manual human review. The case study includes an example prompt for judging Code Search results, demonstrating Chain of Thought reasoning and structured output with Pydantic models.

Even GPT-4o is reported to reliably handle simpler judgment tasks like codebase chat, though more complex evaluations like entire PR generation remain challenging.

Agent Trajectory Auditing

An Auditor agent runs on failed test cases to diagnose where in the trajectory things went wrong. This receives the agent trajectory and correct answer, identifying whether failures stem from poor vector search, hallucination, laziness, or other issues. The case study notes that o1 models excel at this kind of complicated analysis task.

Production Reliability Patterns

The case study emphasizes that LLMs’ probabilistic nature requires building in graceful error handling from the beginning, regardless of how well tasks perform in testing. Error handling operates at multiple levels:

Managing Long Context

Despite longer context lengths, LLMs still suffer performance degradation as context increases. The general rule of thumb reported is that hallucinations become noticeably more frequent when more than half the context is filled, though with high variance.

Mitigation strategies include hard and soft token cutoffs, truncating early messages, priority-based message inclusion, self-summarization, and tool-specific summaries. For example, large shell command outputs are summarized using GPT-4o with predictive outputs.

Model Selection and Performance Optimization

The team primarily uses Claude 3.5 Sonnet (specifically claude-3-5-sonnet-20241022, which they consider distinct enough to informally call “Sonnet-3.6”) and GPT-4o, finding Sonnet slightly better at most tasks. An interesting observation is that Claude and GPT models, which used to require very different prompting approaches, now seem to be converging - prompts tuned for 4o can often be swapped to Sonnet without changes while still seeing performance benefits.

For o1, while qualitatively better results are seen on harder problems, the model must be prompted very differently - it makes different types of mistakes and is not a drop-in replacement.

Two “big hammers” for performance plateaus are described: splitting complicated agents into simpler subagents (enabling independent benchmarking) and running agents in parallel with a judge selecting the best candidate (with variations from temperature, prompts, few-shot examples, or models). The trade-off is increased system complexity and maintenance burden.

Critical Assessment

While this case study provides excellent technical depth, it’s important to note some limitations. No quantitative results are provided - claims about “significantly reduced false positive rates” and accuracy improvements are not backed by specific numbers. The case study is promotional in nature, originating from a company blog post, so some claims should be taken with appropriate skepticism. Additionally, customer-specific results or testimonials are not included, making it difficult to assess real-world performance across different codebases and use cases.

Nevertheless, the architectural patterns, evaluation frameworks, and production reliability strategies described represent solid engineering practices that are broadly applicable to LLMOps implementations beyond code review.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Scaling AI-Assisted Developer Tools and Agentic Workflows at Scale

Slack 2025

Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.

code_generation question_answering summarization +46

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61