ZenML

Building a Production-Ready Multi-Agent Coding Assistant

Replit 2023
View original source

Replit developed a coding agent system that helps users create software applications without writing code. The system uses a multi-agent architecture with specialized agents (manager, editor, verifier) and focuses on user engagement rather than full autonomy. The agent achieved hundreds of thousands of production runs and maintains around 90% success rate in tool invocations, using techniques like code-based tool calls, memory management, and state replay for debugging.

Industry

Tech

Technologies

Overview

Replit, the cloud-based development platform, launched their “Replit Agent” in September 2024 after approximately 15 months of development. The agent is designed to lower the barrier to software creation by allowing users to build web applications from natural language prompts without writing code. This case study, derived from a talk by Michele Catasta (who leads Replit’s AI team), provides extensive insights into the LLMOps practices and architectural decisions behind building and operating a complex agentic system in production at scale.

The core philosophy behind Replit Agent differentiates it from other coding agents in the market. Rather than aiming for full autonomy or acting as a “replacement for a junior software engineer,” Replit positioned their agent as a product that keeps the user at the center. The agent shows users what it’s doing, asks questions, and requests feedback. This design choice directly addresses one of the fundamental challenges in agentic systems: the compounding error problem where agents make mistakes over long trajectories and eventually crash or fail to accomplish goals.

Multi-Agent Architecture

Replit Agent employs a multi-agent architecture, which they arrived at iteratively rather than starting with from the outset. The team began with the simplest possible architecture—a basic ReAct (Reasoning and Acting) loop—and scaled up complexity only when they encountered limitations, specifically when too many tools led to too many errors.

The multi-agent system includes several specialized sub-agents:

The key insight behind this architecture is scope isolation. Each sub-agent has the minimum necessary tools and instructions visible to it. The rationale is straightforward: the more you expose to a sub-agent, the more opportunities it has to make incorrect choices. This principle of minimal scope per agent appears to be a critical success factor for maintaining reliability.

Tool Calling Innovation: Code-Based DSL

One of the most technically interesting aspects of Replit Agent is their approach to tool calling. They started with standard function calling APIs from providers like OpenAI and Anthropic, but found these “fairly limited” despite being “elegant and easy to use.”

The team encountered numerous challenges with function calling reliability, experimenting with tricks like reordering arguments, renaming parameters to guide LLM reasoning, and other prompt engineering techniques. Eventually, they took a “complete detour” and decided to invoke tools through code generation instead.

Their approach works as follows: they provide all necessary information to the model in context, allow it to reason through a chain of thought, and then have the model generate a Python DSL (a stripped-down version of Python) that represents the tool invocation. This generated code is parsed on their backend, and if it’s compliant with their schema, they know it’s a valid tool invocation. If not, they retry.

This technique achieved approximately 90% success rate for valid tool calls, which is particularly impressive given that some of Replit’s tools are complex with many arguments and settings. The team credits LLMs’ strong code generation capabilities as the reason this approach works well—essentially leveraging the models’ strengths rather than fighting against function calling limitations.

Model Selection and Trade-offs

Replit primarily relies on Claude 3.5 Sonnet for their main agent operations, describing it as “a step function improvement compared to other models” for code generation and editing tasks. They experimented with fine-tuning frontier models early on but found that the out-of-the-box Sonnet performance was sufficient for their needs.

For auxiliary tasks, they use a “long tail” of calls to other models. GPT-4 mini is used for compression and other “watchdog” tasks. They also fine-tune embedding models for file system retrieval.

When asked about prioritizing the trade-off triangle of accuracy, cost, and latency, the answer was unequivocal: accuracy first, cost second, latency last. The rationale is that low accuracy is what frustrates users most, and there’s an assumption that frontier models will continue to become cheaper, making cost optimization less critical in the long run.

Error Recovery and Debugging Strategies

The agent employs multiple strategies for recovering from errors:

Retry with feedback: When a tool call fails, the agent retries with the error message passed back. The team notes that “even from that basic self-debugging, it already understands how to make progress.”

Reflection: Every five steps (configurable), the agent stops and reasons about whether it has actually made progress. This serves as a recovery strategy where the agent can roll back to a previous state and restart, adding “enough randomness to the process” to explore different solution paths.

Human-in-the-loop: As a last resort, the verifier agent can invoke a tool to ask the user for input, particularly when dealing with errors that require human judgment or testing.

The “ask human” tool is not just a simple fallback—it’s a deliberate design choice. The team found that by building something more engaging and asking users for feedback, they address the compounding error problem more effectively than trying to achieve full autonomy.

Memory Management and Long Trajectories

Trajectories in Replit Agent frequently extend to tens of steps, with some users spending hours on single projects. This creates significant challenges for memory management.

The team employs several techniques:

Even with these optimizations, the compound probability problem remains: even with a high probability of success on individual steps, reliability drops quickly after around 50 steps. Their main insight is to “do as much progress as possible in the first few steps”—essentially front-loading the most important work to avoid the pitfalls of very long trajectories.

Observability with LangSmith

The team adopted LangSmith early in development for observability, emphasizing that observability is essential for any serious agent work. As Catasta noted: “Don’t start to do any serious agent work without any level of observability… if you don’t have a way to observe how your system behaves, you’re basically just building a prototype.”

A particularly innovative technique they use is trajectory replay. At every step of the agent, they store the complete state in LangSmith traces. When users report bugs, rather than manually reading through entire traces, they:

This fully automated replay system significantly accelerates debugging for complex agentic systems where traditional debugging approaches are “excruciating.”

Evaluation Challenges

The team acknowledges that evaluation remains a significant challenge. They experimented with SWE-bench for testing the core ReAct loop but found it insufficient for end-to-end testing of the complete Replit Agent experience.

Honestly, the team admits to launching “based on vibes”—using the product extensively internally and making clear to users that it was an “early access” product. They express hope that academia or industry will develop better public benchmarks for coding agents.

Their verifier sub-agent is positioned as a potential solution, with plans to make it more powerful and capable of mimicking user actions on the artifacts the agent creates, essentially building internal evaluation capabilities.

Production Scale and User Behavior

Since launching in September 2024, Replit Agent has processed “hundreds of thousands of runs in production.” The team was surprised by several user behavior patterns:

The focus on zero-to-one software creation—helping people with ideas quickly see prototypes—has proven effective. The team deliberately scoped down for launch, noting that working on pre-existing projects is “not very good” yet despite having the underlying technology. This narrow initial scope allowed them to iterate based on real user feedback rather than launching something generic that would receive more criticism.

Key Lessons and Recommendations

The talk concluded with several insights for others building similar systems:

The team noted that they’re looking forward to future model improvements that might enable techniques like self-consistency (generating multiple LLM calls to check agreement) or Monte Carlo Tree Search-style approaches with parallel agent sessions—techniques that are currently cost-prohibitive but may become viable as model costs decrease.

More Like This

AI-Powered On-Call Assistant for Airflow Pipeline Debugging

Wix 2026

Wix developed AirBot, an AI-powered Slack agent to address the operational burden of managing over 3,500 Apache Airflow pipelines processing 4 billion daily HTTP transactions across a 7 petabyte data lake. The traditional manual debugging process required engineers to act as "human error parsers," navigating multiple distributed systems (Airflow, Spark, Kubernetes) and spending approximately 45 minutes per incident to identify root causes. AirBot leverages LLMs (GPT-4o Mini and Claude 4.5 Opus) in a Chain of Thought architecture to automatically investigate failures, generate diagnostic reports, create pull requests with fixes, and route alerts to appropriate team owners. The system achieved measurable impact by saving approximately 675 engineering hours per month (equivalent to 4 full-time engineers), generating 180 candidate pull requests with a 15% fully automated fix rate, and reducing debugging time by at least 15 minutes per incident while maintaining cost efficiency at $0.30 per AI interaction.

customer_support code_generation data_analysis +29

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Developer Tools for Code Quality and Test Generation

Uber 2025

Uber's developer platform team built AI-powered developer tools using LangGraph to improve code quality and automate test generation for their 5,000 engineers. Their approach focuses on three pillars: targeted product development for developer workflows, cross-cutting AI primitives, and intentional technology transfer. The team developed Validator, an IDE-integrated tool that flags best practices violations and security issues with automatic fixes, and AutoCover, which generates comprehensive test suites with coverage validation. These tools demonstrate the successful deployment of multi-agent systems in production, achieving measurable improvements including thousands of daily fix interactions, 10% increase in developer platform coverage, and 21,000 developer hours saved through automated test generation.

code_generation customer_support code_interpretation +17