Cursor experimented with running hundreds of concurrent LLM-based coding agents autonomously for weeks on large-scale software projects. The problem was that single agents work well for focused tasks but struggle with complex projects requiring months of work. Their solution evolved from flat peer-to-peer coordination (which failed due to locking bottlenecks and risk-averse behavior) to a hierarchical planner-worker architecture where planner agents create tasks and worker agents execute them independently. Results included agents successfully building a web browser from scratch (1M+ lines of code over a week), completing a 3-week React migration (266K additions/193K deletions), optimizing video rendering by 25x, and running multiple other ambitious projects with thousands of commits and millions of lines of code.
Cursor, a company building AI-assisted software development tools, published research in January 2026 describing their experiments with scaling autonomous coding agents to handle complex, long-running software projects. The research represents an ambitious exploration of multi-agent LLM systems operating in production-like scenarios, running continuously for weeks and generating millions of lines of code. This case study is particularly valuable for understanding the practical challenges and architectural decisions involved in deploying LLM agents at scale for extended autonomous operation.
The core motivation was to push beyond the current limitations where individual LLM agents work well for focused, short-duration tasks but struggle with projects that would typically take human teams months to complete. Rather than simply improving individual agent capabilities, Cursor’s approach focused on coordination mechanisms that would allow hundreds of agents to work together effectively on a single codebase.
Cursor’s first architectural approach was based on flat, peer-to-peer coordination where all agents had equal status. The system used a shared file as a coordination mechanism, with agents checking what others were doing, claiming tasks, and updating their status. To prevent race conditions where multiple agents would grab the same task, they implemented a locking mechanism.
This approach revealed several critical production challenges that are instructive for anyone building multi-agent LLM systems. The locking mechanism became a severe bottleneck—agents would hold locks for too long or forget to release them entirely. Even when the locking logic worked correctly, it fundamentally limited throughput. With twenty agents running, the effective throughput would drop to only two or three agents’ worth of work, with most computational time spent waiting for locks. The system also proved brittle in ways that highlight the challenges of building reliable LLM-based systems: agents could fail while holding locks, attempt to acquire locks they already held, or update the coordination file without properly acquiring locks first.
Cursor attempted to address these issues by replacing pessimistic locking with optimistic concurrency control, where agents could read state freely but writes would fail if the underlying state had changed since the last read. While this approach was simpler and more robust than explicit locking, it didn’t solve the deeper architectural problems.
The flat organizational structure created emergent behavioral problems that are particularly interesting from an LLMOps perspective. Without hierarchy or clear ownership, agents became risk-averse. They gravitated toward small, safe changes and avoided difficult tasks. No single agent took responsibility for hard problems or end-to-end implementation. This led to work churning for long periods without meaningful progress—a form of “diffusion of responsibility” that parallels similar phenomena in human organizations but manifests differently with LLM agents.
The breakthrough came from introducing hierarchy and role separation. Instead of a flat structure, Cursor implemented a pipeline with distinct agent roles:
Planner agents continuously explore the codebase and create tasks. Critically, planners can spawn sub-planners for specific areas of the codebase, making the planning process itself parallel and recursive. This allows the system to scale planning capacity as the project grows in complexity.
Worker agents pick up tasks and focus entirely on execution. They don’t coordinate with other workers or concern themselves with the broader project context. Their job is simply to grind on their assigned task until completion, then push their changes. This narrow focus eliminates the coordination overhead that plagued the flat architecture.
At the end of each cycle, a judge agent determines whether to continue, and then the next iteration starts fresh. This periodic reset mechanism helps combat drift and tunnel vision—common problems in long-running autonomous systems.
This architectural shift solved most of the coordination problems and enabled scaling to very large projects without individual agents getting stuck in unproductive loops. From an LLMOps perspective, this highlights an important principle: effective multi-agent systems often require carefully designed roles and responsibilities rather than trying to make every agent capable of everything.
The proof of this architecture came from several ambitious experiments that represent real production-scale deployments of LLM agents:
Web browser from scratch: Agents ran autonomously for close to a week, producing over 1 million lines of code across 1,000 files. Despite the massive codebase size, new agents could still understand it and make meaningful progress. Hundreds of workers ran concurrently, pushing to the same Git branch with minimal merge conflicts. The resulting code is available on GitHub, providing rare transparency into the output quality of long-running agent systems.
Solid to React migration: An in-place migration of Cursor’s own codebase from Solid to React framework took over 3 weeks, with +266K additions and -193K deletions. Cursor notes they believe it’s possible to merge this change after testing, suggesting the code quality meets production standards, though they express this cautiously rather than claiming complete success.
Video rendering optimization: A long-running agent improved an upcoming Cursor product by rewriting video rendering in Rust, achieving 25x performance improvements. It also added smooth zoom and pan with spring transitions and motion blur. This code was actually merged and deployed to production, representing a concrete example of agent-generated code reaching end users.
Other ongoing experiments include building a Java LSP (7.4K commits, 550K lines of code), a Windows 7 emulator (14.6K commits, 1.2M lines of code), and an Excel implementation (12K commits, 1.6M lines of code). The sheer scale of these experiments—thousands of commits and millions of lines of code—is notable, though Cursor doesn’t claim these projects are complete or production-ready.
One of the most valuable LLMOps insights from this research concerns model selection for different roles and tasks. Cursor found that model choice matters significantly for extremely long-running autonomous tasks, and different models excel at different roles.
They found GPT-5.2 models substantially better for extended autonomous work, specifically noting improvements in following instructions, maintaining focus, avoiding drift, and implementing features precisely and completely. In contrast, they observed that Claude Opus 4.5 tends to stop earlier and take shortcuts when convenient, yielding back control quickly rather than pushing through to completion. While this might be desirable for interactive use cases where human oversight is expected, it proved less effective for long-running autonomous operation.
Interestingly, they also discovered that GPT-5.2 is a better planner than GPT-5.1-codex, even though the latter is specifically trained for coding tasks. This led them to adopt a heterogeneous model strategy, using the model best suited for each role rather than applying one universal model across all agent types. This represents a sophisticated approach to production LLM deployment that goes beyond simple model benchmarking to consider task-specific fitness.
It’s worth noting that Cursor is describing future models (GPT-5.x, Opus 4.5) that don’t exist as of this analysis, suggesting either this is speculative research, they have early access to unreleased models, or they’re using placeholder names for competitive reasons. This introduces some uncertainty about the reproducibility and generalizability of their findings.
A particularly striking observation from Cursor’s research is that “a surprising amount of the system’s behavior comes down to how we prompt the agents.” They note that getting agents to coordinate well, avoid pathological behaviors, and maintain focus over long periods required extensive experimentation. Notably, they claim “the harness and models matter, but the prompts matter more.”
This is a critical LLMOps insight that’s often underappreciated. While much attention in AI research focuses on model capabilities and system architecture, the specific instructions given to agents through prompts can be the dominant factor in system behavior. Unfortunately, Cursor doesn’t share specific prompting strategies or examples, which limits the actionable value of this insight. The lack of prompt details also makes it difficult to assess whether their success comes primarily from clever engineering or from having access to more capable models.
The extensive experimentation required to develop effective prompts highlights a key challenge in LLMOps: unlike traditional software where behavior is deterministic and debugging is relatively straightforward, LLM-based systems require iterative refinement of natural language instructions, which is far less systematic and harder to version control effectively.
An important design principle that emerged from Cursor’s experimentation was that many improvements came from removing complexity rather than adding it. They initially built an “integrator” role for quality control and conflict resolution, drawing on concepts from distributed systems and organizational design. However, they found this created more bottlenecks than it solved—worker agents were already capable of handling conflicts themselves.
This finding challenges the intuition that more complex projects require more elaborate coordination mechanisms. Instead, Cursor found the optimal system architecture was simpler than expected, with the right amount of structure falling somewhere in the middle ground. Too little structure leads to conflicts, duplicated work, and drift. Too much structure creates fragility and bottlenecks. This represents a valuable lesson for LLMOps practitioners: not all patterns from traditional distributed computing or organizational design translate directly to multi-agent LLM systems.
Cursor is refreshingly candid about the limitations of their current system and the work that remains. They acknowledge that multi-agent coordination remains a hard problem and they’re “nowhere near optimal.” Specific issues include:
The need for periodic resets is particularly revealing—it suggests that even with their architectural improvements, LLM agents still suffer from context drift and lose effectiveness over very long operation periods. This is a fundamental challenge in production LLM systems that extends beyond coordination to the core capabilities of current language models.
From a critical perspective, while Cursor presents impressive results in terms of code volume and project scale, they provide limited information about code quality, test coverage, maintainability, or whether these massive codebases actually work as intended. The web browser experiment produced a “simple screenshot,” and the Solid-to-React migration is described as “possible to merge” after testing—not as successfully merged. Only the video rendering optimization is confirmed as reaching production.
Cursor mentions deploying “billions of tokens across these agents toward a single goal” and acknowledges “the system isn’t perfectly efficient.” This raises important questions about the economics of their approach that aren’t fully addressed. Running hundreds of agents continuously for weeks, generating trillions of tokens (as mentioned in the introduction) represents substantial computational cost and carbon footprint.
While the results are impressive in terms of code volume, there’s no analysis of whether this approach is cost-effective compared to human developers or more targeted AI assistance. A million lines of AI-generated code isn’t necessarily more valuable than a thousand carefully crafted lines, and the lack of quality metrics makes it difficult to assess the true productivity gains.
From an LLMOps perspective, several aspects of production deployment remain unclear:
These operational concerns are critical for anyone attempting to implement similar systems in production but aren’t addressed in detail in the blog post.
This research represents one of the most ambitious publicly documented attempts to scale autonomous LLM agents for extended operation on complex tasks. Several themes emerge that are relevant to the broader field of LLMOps:
Architectural patterns matter more than individual agent intelligence: The shift from flat to hierarchical organization was more impactful than any improvements to individual agent capabilities. This suggests that as we build more complex LLM-based systems, software architecture and coordination mechanisms will be at least as important as model capabilities.
Role specialization enables scale: Just as in human organizations, having agents specialize in planning versus execution proved more effective than generalist agents. This suggests a future where production LLM systems involve orchestrating multiple specialized models and agents rather than trying to build one superintelligent generalist.
Prompt engineering remains a critical bottleneck: Despite advances in model capabilities, getting agents to behave correctly over long periods requires extensive manual refinement of prompts. This represents a significant operational challenge for production systems and highlights the need for better tools and methodologies for prompt development and testing.
Model selection is task-dependent: Different models excel at different aspects of autonomous work, and production systems benefit from using multiple models strategically rather than defaulting to the “best” model on standard benchmarks.
Long-running autonomous operation introduces new challenges: Problems like drift, tunnel vision, and maintaining focus over extended periods are distinct from the challenges in short-duration AI assistance. Production LLMOps for autonomous agents requires different techniques than systems with human-in-the-loop.
Transparency and reproducibility remain limited: While Cursor’s research is more transparent than typical industry blog posts, critical details about prompts, quality metrics, failure rates, and cost economics are missing. This limits the ability of others to learn from and build on their work.
The research ultimately provides a cautiously optimistic answer to whether autonomous coding can be scaled by throwing more agents at a problem. Hundreds of agents can indeed work together on a single codebase for weeks and make real progress on ambitious projects. However, significant challenges remain, and the path from research experiment to reliable production system is still being charted. Cursor notes that “the techniques we’re developing here will eventually inform Cursor’s agent capabilities,” suggesting this work is still in the research phase rather than being deployed directly in their product.
This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.
This case study explores OpenAI's approach to post-training and deploying large language models in production environments, featuring insights from a post-training researcher working on reasoning models. The discussion covers the operational complexities of reinforcement learning from human feedback at scale, the evolution from non-thinking to thinking models, and production challenges including model routing, context window optimization, token efficiency improvements, and interruptability features. Key developments include the shopping model release, improvements from GPT-4.1 to GPT-5.1, and the operational realities of managing complex RL training runs with multiple grading setups and infrastructure components that require constant monitoring and debugging.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.