ZenML

Autonomous Coding Agent Evolution: From Short-Burst to Extended Runtime Operations

Replit 2025
View original source

Replit evolved their AI coding agent from V1 (running autonomously for only a couple of minutes) to V2 (running for 10-15 minutes of productive work) through significant rearchitecting and leveraging new frontier models. The company focuses on enabling non-technical users to build complete applications without writing code, emphasizing performance and cost optimization over latency while maintaining comprehensive observability through tools like Langsmith to manage the complexity of production AI agents at scale.

Industry

Tech

Technologies

Summary

Replit is a cloud-based development platform that has evolved from a collaborative online code editor to a full-fledged AI-powered application development platform. The company serves over 30 million developers and has recently launched V2 of their Replit Agent, which allows users—including those without coding experience—to build complete applications through natural language prompts. This case study, drawn from a fireside chat with Replit’s president, Michael (referred to as Mika in the transcript), provides valuable insights into the LLMOps practices behind building and operating autonomous coding agents at scale.

The core product evolution represents a significant shift in how software can be created. Rather than requiring users to write code manually, Replit Agent interprets user intent, generates code, tests it, and iterates until the application meets requirements. This represents one of the most ambitious deployments of agentic AI in production, with the company now generating approximately 1 million applications per month through their free tier alone.

Agent Architecture and Model Selection

Replit takes an opinionated approach to model selection, choosing not to expose model choices to end users. This decision stems from practical LLMOps considerations: maintaining multiple prompt sets optimized for different models would be prohibitively complex. As the speaker noted, going from “n=1 to n=3 prompt sets” would require substantial additional work.

The agent relies on a multi-model architecture. Claude 3.7 (specifically Sonnet) serves as the foundation model, described as providing the “IQ” of the agent and enabling new levels of autonomy for coding tasks. However, the system employs multiple models for different functions within a single agent run. Smaller and faster models (referred to as “flash models”) handle accessory functions where latency-performance trade-offs favor speed over capability.

The team has tested open-source models like DeepSeek but currently focuses primarily on frontier models from major labs. The reasoning is pragmatic: the pace of improvement from frontier labs (now every couple of months versus the previous six to nine months) makes it more efficient to leverage their advances rather than investing heavily in fine-tuning open-source alternatives. The speaker indicated openness to exploring fine-tuning when and if the pace of frontier model improvements slows down.

Autonomy as the Core Metric

The most significant advancement from V1 to V2 is autonomy. V1 agents could work autonomously for only a few minutes at most, while V2 regularly operates for 10-15 minutes of continuous, productive work. This isn’t just spinning cycles—it represents sustained useful progress toward user goals.

This increase in autonomy creates interesting tensions in the product experience. There’s a constant balance between keeping humans in the loop (to catch and correct sideways behavior) and allowing the agent to work uninterrupted (which users prefer when things are going well). The speaker noted that user expectations continuously rise: “The bar keeps raising basically on a monthly basis. The more we can get done, it maybe takes a week for the user to get used to that and then they just want more.”

The vision for V3 includes pushing autonomy from 10-15 minutes to potentially one hour through several mechanisms: incorporating computer use for automated testing, implementing more sophisticated software testing in the loop, and exploring test-time compute strategies including sampling and parallelism.

Evaluation and Testing Strategy

The speaker emphasized that investing early in evaluations is “extremely important” and identified it as one of two fundamental pillars for agent development. The reasoning is clear: as agents become more advanced, it becomes increasingly difficult to determine whether changes introduce regressions or represent genuine progress.

Building confidence to ship V2 involved extensive internal testing. The team had prototypes running since early 2024, well before the V1 launch. The progression followed a careful pattern: internal testing, launching to alpha users as an opt-in experience, gathering feedback, and then rushing to GA based on “exceedingly positive feedback.”

One interesting internal metric tracks how often users return to the code editor (which the product increasingly hides). This has improved from one in four users manually editing code to approximately one in ten, with a stated goal of approaching zero. This metric serves as a proxy for agent effectiveness—if users don’t need to touch the code, the agent is doing its job.

Observability Practices

Observability represents the second fundamental pillar in Replit’s agent development approach. The speaker was candid that this is a challenging and evolving area: “We are all learning as a field how to do observability on agents. It’s a complete different animal compared to how we built distributed systems in the last decades.”

Replit uses LangSmith extensively alongside other observability tools. The speaker described a two-tier observability need. First, there’s the traditional “Datadog-style” observability with aggregates and dashboards that can trigger alerts (e.g., “you’re failing to use this tool 50% of the time”). Second, and perhaps more critically for agents, there’s deep trace inspection capability.

The current state of agent debugging was compared to the “assembly era”—when something goes wrong, engineers often must read the entire input prompt (potentially 100,000 tokens) and generated output to understand why certain choices were made. Aggregates alone are insufficient; something like a step debugger is needed, but instead of showing memory state, it must expose the full context that led to each decision.

The speaker’s strong recommendation for anyone building agents: “Invest in observability from day one. Otherwise, you’re going to be lost immediately and you’re probably going to give up because you’re going to think it’s impossible to pull this off.”

Replit’s ownership of the entire stack—from container orchestration (execution plane) to agent orchestration (control plane, described as “LangGraph-style orchestration”) to the product surface—means debugging can require reading traces all the way down. Problems can occur anywhere, including in the binaries of tools the agent invokes.

Cost, Latency, and Performance Trade-offs

The speaker articulated a clear prioritization: performance and cost are roughly equivalent in importance, with latency as a distant third. This prioritization intensified from V1 to V2, with V2 accepting “almost one order of magnitude in terms of additional latency.”

The reasoning connects to their target user base (non-technical people) and the value proposition: it’s far more important for the agent to complete what users want than to do so quickly. Initial user reactions to increased latency were “fairly non-controversial”—after about a week of adjustment, users appreciated the improved outcomes.

This trade-off acceptance represents a significant bet. The speaker admitted to still having fear about costs: “I still have that fear. So, it doesn’t change much, trust me.” Running agents for 10-15 minutes (potentially moving toward one hour) consumes substantial compute resources.

Human-in-the-Loop Dynamics

The approach to human involvement has evolved significantly. Replit provides multiple communication channels: the agent is “fairly verbose” and users can expand every action to see tool outputs; a mobile app sends notifications when the agent wants feedback; and users can interrupt or redirect the agent via chat at any time.

The product accommodates different user profiles—some trust the agent fully and delegate extensively, while others prefer hands-on involvement. The vision for future versions involves reducing notifications further by incorporating computer use for testing. Currently, a bottleneck is relying solely on humans for testing, but many testing tasks are “fairly trivial” (data input, clicking around simple interfaces) and should be automatable.

User Patterns and Use Cases

Users span a wide range, from those testing agent capabilities (experiencing the “high” of seeing code run without writing it) to those building substantial business tools. The key differentiator is that Replit Agent supports “very long trajectories”—the speaker mentioned users spending hundreds of hours on single projects without writing any code.

Common use cases include internal tools for companies (the speaker referenced the concept of “unbundling SaaS”—rebuilding expensive features internally rather than paying for overpriced software) and personalized applications for professionals or hobbyists.

For collaborative scenarios, particularly in business and enterprise settings, multiple people can interact with the agent on shared projects. However, there’s currently a “giant lock” on parallel agent execution—not due to infrastructure limitations (Replit already runs at scale) but because merging patches from multiple concurrent agent runs involves solving merge conflicts, which remains difficult even for frontier AI models.

Planning and Specification Challenges

One ongoing challenge is reconciling user behavior (trained by products like ChatGPT to write brief prompts) with the needs of a coding agent (which ideally receives detailed PRD-like specifications). Users often write two-line prompts or generate lengthy PRDs through Claude and expect Replit to follow them exactly.

The team is actively redesigning the planning experience to accommodate both chatbot-style interactions (one task at a time, submitted sequentially) and more comprehensive specifications. They’ve even created a course with Andrew Ng to help users understand effective prompting patterns.

Future Roadmap

The speaker outlined several V3 priorities: incorporating computer use for automated testing, integrating software testing more deeply into the loop (leveraging code’s inherent observability), and exploring test-time compute strategies. The latter includes sampling multiple solutions in parallel and ranking them—initially targeting higher-spending users but potentially improving outcomes across the board. These advances represent the continuing evolution of LLMOps practices as the field matures and agents become increasingly capable.

More Like This

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain 2025

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

code_generation summarization chatbot +39

Context Engineering for Production AI Agents at Scale

Manus 2025

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

code_generation data_analysis visualization +34

Engineering Principles and Practices for Production LLM Systems

Langchain 2025

This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.

code_generation question_answering summarization +35