ZenML

Engineering and Optimizing an Agent Harness for Production AI Coding Assistants

Cursor 2026
View original source

Cursor, an AI-powered code editor company, details their approach to building and continuously improving their "agent harness"—the production infrastructure layer that orchestrates LLM-based coding agents. The challenge was creating a robust, measurable system that could effectively manage context windows, support multiple LLM providers with different characteristics, and maintain high code quality at scale. Their solution involves a sophisticated evaluation framework combining offline benchmarks (including their proprietary CursorBench) with online A/B testing, custom metrics like "Keep Rate" for measuring code retention, LLM-based sentiment analysis of user satisfaction, and model-specific prompt engineering and tool customization. Results include a 10x reduction in unexpected tool call errors, optimized context management that shifted from static to dynamic retrieval, and a production system capable of seamlessly supporting multiple models from different providers while maintaining quality and performance.

Industry

Tech

Technologies

Overview

Cursor’s case study provides an in-depth look at their production LLMOps infrastructure for their AI coding assistant, focusing specifically on what they call the “agent harness”—the layer that sits between raw language models and the end-user experience. This is a particularly valuable case study because it goes beyond surface-level product descriptions to detail the engineering challenges, measurement frameworks, and operational practices required to run LLM-based agents at scale in a production environment where code quality and reliability are paramount.

The company describes their approach as treating the harness like “any ambitious software product,” combining vision-driven development with rigorous experimentation, quantitative and qualitative evaluation, and continuous iteration. What makes this particularly notable from an LLMOps perspective is their emphasis on instrumentation and measurement as foundational requirements for improving their system over time.

Context Window Engineering and Evolution

At the core of Cursor’s agent architecture is sophisticated context window management. The context window structure follows a standard pattern: system prompt and tool descriptions first, followed by conversation state, and finally the user’s request. However, the way they populate and manage this window has evolved dramatically as underlying models have improved.

When they first developed their coding agent in late 2024, the approach was heavily guardrailed due to model limitations. They invested significant engineering effort into context engineering, including automatically surfacing lint and type errors after every edit, rewriting file read requests when the agent requested insufficient lines, and even limiting the maximum number of tools the agent could call in a single turn. They also provided substantial static context upfront—folder layouts, semantically matched code snippets, and compressed versions of manually attached files.

By 2026, this approach has shifted dramatically toward dynamic context retrieval. While they still include some useful static context (operating system, git status, current and recently viewed files), they’ve removed most guardrails and instead empower agents to fetch context dynamically as needed. This represents a significant architectural evolution driven by improving model capabilities, and reflects a broader LLMOps trend: as foundation models improve, the optimal harness design shifts from protective scaffolding toward enabling greater autonomy.

This evolution demonstrates an important LLMOps principle: production systems must be designed to adapt to rapidly changing model capabilities, requiring continuous re-evaluation of what should be handled by the harness versus what should be delegated to the model.

Multi-Layered Evaluation Strategy

Cursor has developed a sophisticated, multi-layered approach to evaluating harness changes that addresses a fundamental LLMOps challenge: determining whether a change actually improves real-world performance. Their strategy combines offline and online evaluation methods, each serving different purposes.

On the offline side, they maintain both public benchmarks and their own proprietary eval suite called CursorBench. These provide fast, standardized quality assessments and enable temporal comparisons. However, they explicitly acknowledge that “even the best benchmarks only approximate real usage,” recognizing a critical limitation that many LLMOps practitioners struggle with.

To address this gap, they’ve invested heavily in online experimentation infrastructure. They deploy multiple harness variants side-by-side and A/B test them on real user traffic. Their measurement framework operates at multiple levels of abstraction:

Basic operational metrics include latency, token efficiency, tool call count, and cache hit rate. While these don’t directly measure quality, they provide important signals about system health and efficiency.

Code retention metrics measure the “Keep Rate” of agent-generated code—tracking what fraction of proposed changes remain in the user’s codebase after fixed time intervals. This is an elegant proxy for quality: if users manually adjust or have the agent fix its own output, it indicates the initial response was suboptimal. This metric is particularly valuable because it’s objective, automatically measurable, and directly tied to user value.

LLM-based satisfaction assessment uses a language model to read user responses to the agent’s output and semantically determine satisfaction levels. They provide concrete examples: a user moving to the next feature signals success, while pasting a stack trace signals failure. This represents a meta-application of LLMs—using language models to evaluate language model outputs at scale—and demonstrates sophisticated thinking about evaluation in production environments.

The combination of these measurement layers allows Cursor to make data-driven decisions about harness changes. They provide a concrete example of this in action: when they experimented with using a more expensive model for context summarization, online testing revealed it made negligible difference in agent quality, leading them to shelve the idea despite its theoretical promise. This illustrates the value of production measurement over theoretical optimization.

Production Monitoring and Error Management

As the harness has grown more complex with additional models and capabilities, Cursor has developed sophisticated monitoring and error management practices that reflect mature LLMOps thinking. They treat the harness like production software, with particular attention to tool call errors which can be “extremely harmful” to agent sessions.

Their error taxonomy distinguishes between “unknown” errors (which always represent harness bugs) and “expected” errors (which may represent either bugs or normal behavior). Expected errors are classified by cause: InvalidArguments and UnexpectedEnvironment capture model mistakes and context contradictions, while ProviderError captures vendor outages from tools like image generation or web search. Additional classifications include UserAborted and Timeout.

They’ve implemented a sophisticated alerting system built on this taxonomy. Unknown errors trigger alerts whenever they exceed a fixed threshold for any tool, treating them as bugs requiring immediate attention. For expected errors, they use anomaly detection that fires when error rates significantly exceed baseline levels. Critically, they compute baselines per-tool and per-model, recognizing that different models exhibit different error characteristics—a nuanced approach that reflects deep operational experience.

An innovative aspect of their monitoring approach is the use of automation equipped with skills to search through logs, surface new or spiking issues, and automatically create or update tickets in their backlog. They “lean heavily on Cloud Agents to kick off fixes for many issues at once” and can trigger them directly from their project management tool (Linear). This represents what they call an automated “software factory” for their agent harness—using AI agents to maintain and improve the infrastructure that runs AI agents. Through a focused sprint, this approach helped them drive unexpected tool call errors down by an order of magnitude, demonstrating measurable impact.

The concept of “context rot”—where accumulated mistakes degrade subsequent model decisions—is particularly noteworthy. Even when agents can self-correct, errors remain in context and waste tokens while potentially degrading quality. This represents a production challenge specific to stateful LLM applications and highlights why error prevention matters beyond just the immediate failed operation.

Model-Specific Customization

A particularly sophisticated aspect of Cursor’s harness architecture is its deep customization for different models and providers. While their harness abstractions are model-agnostic in design, they heavily customize the actual implementation for every supported model. This reflects a mature understanding that optimizing LLM applications in production requires adapting to model-specific characteristics rather than treating all models as interchangeable.

Their customization operates at multiple levels. At the tool level, they provision different tools based on training format: OpenAI’s models receive patch-based file editing tools (matching their training), while Anthropic’s models receive string replacement tools. Using the “wrong” tool costs extra reasoning tokens and produces more mistakes, so this optimization has real performance and quality implications.

Prompt engineering is also highly customized per provider and even per model version. They note that OpenAI’s models tend to be “more literal and precise in their instruction following,” whereas Claude is “more intuitive and more tolerant to imprecise instructions.” This suggests they maintain different prompt libraries tuned to each model’s characteristics.

When they receive early access to new models before launch, they follow an iterative tuning process: starting from the closest existing model’s harness configuration, running offline evals to identify confusion points, having team members use it and surface problems, and tweaking the harness in response. They iterate until they have a model-harness combination they’re confident shipping.

Sometimes this process reveals genuine model quirks that can be mitigated through harness engineering. They provide a fascinating example: one model developed what they termed “context anxiety”—as its context window filled up, it would start refusing work and claiming tasks seemed too big. They were able to reduce this behavior through prompt adjustments, demonstrating how harness engineering can work around model limitations.

This level of model-specific customization raises important questions about the sustainability of supporting multiple models. Each new model requires weeks of tuning work, and each model version potentially requires prompt adjustments. While Cursor doesn’t explicitly discuss this tradeoff, it suggests significant ongoing engineering investment is required to maintain quality across multiple providers.

Mid-Chat Model Switching Challenges

Supporting users switching models mid-conversation presents unique technical challenges that Cursor has had to solve. Different models have different behaviors, prompts, and tool shapes, creating distribution shift when a model must operate on conversation history generated by a different model.

When a user switches models, Cursor automatically switches to the appropriate harness with that model’s customized prompts and tools. However, the new model must then apply its tools to a conversation history that’s out of distribution from its training data. To address this, they add custom instructions telling the model when it’s taking over from another model and steering it away from calling tools that appear in history but aren’t in its own tool set.

A second challenge involves caching. Since caches are provider- and model-specific, switching models means a cache miss and a slower, more expensive first turn. They’ve experimented with conversation summarization at switch time to provide a clean summary that reduces the cache penalty, but acknowledge this can lose important details for complex tasks. Their recommendation is to stay with one model for the duration of a conversation unless there’s a specific reason to switch.

An alternative they’ve implemented is using subagents, which start from a fresh context window and sidestep the challenges of mid-conversation switching. They recently added the ability for users to directly request a subagent with a particular model, providing a cleaner handoff mechanism.

These implementation details reveal the complexity of building flexible LLM applications that support multiple providers while maintaining quality. The challenges aren’t just about API compatibility—they involve deep questions about context management, caching strategies, and ensuring models can operate effectively on unfamiliar input distributions.

Future Architecture: Multi-Agent Systems

Cursor articulates a clear vision for the future of AI-assisted software engineering: multi-agent systems where specialized agents handle different subtasks rather than running everything through a single agent. They envision delegation across agents optimized for planning, fast edits, debugging, and other specific capabilities.

Critically, they frame this as “fundamentally a harness challenge.” The system needs to know which agent to dispatch, how to frame tasks for each agent’s strengths, and how to stitch results into coherent workflows. This orchestration logic will live in the harness rather than any individual agent.

This perspective is important for LLMOps practitioners because it suggests that as AI systems become more capable, the infrastructure layer becomes more rather than less important. The harness evolves from a simple wrapper around a single model to a sophisticated orchestration system managing multiple specialized agents, each potentially using different models with different configurations.

Critical Assessment and Limitations

While this case study provides valuable technical details about production LLMOps practices, several important considerations should be noted:

The text is promotional in nature, published on Cursor’s own blog, and naturally emphasizes their successes. We don’t get visibility into failed experiments beyond the context summarization example, or into the operational costs of maintaining their complex multi-model infrastructure.

The claim that their approach drives a 10x reduction in tool call errors is impressive but lacks context about the baseline, timeframe, or whether this is sustained over time. Similarly, while they describe sophisticated evaluation methods, we don’t see concrete benchmark numbers or comparisons to alternative approaches.

Their heavy customization per model raises sustainability questions. As new models are released frequently, the engineering effort required to tune and maintain harnesses for each one could become prohibitive. They don’t discuss how they decide which models to support or when to deprecate support for older ones.

The reliance on online A/B testing is powerful but requires significant user volume. Smaller organizations may not have sufficient traffic to run meaningful experiments, limiting the generalizability of this approach.

Finally, while they describe using LLMs to evaluate LLM outputs (for sentiment analysis), they don’t discuss the potential circularity or bias this might introduce, or how they validate these automated assessments.

Despite these limitations, the case study provides genuinely useful insights into production LLMOps practices at a company operating at significant scale with a product where reliability and quality directly impact user productivity.

More Like This

AI-Orchestrated Code Review System at Scale

Cloudflare 2026

Cloudflare built a production AI code review system to address the bottleneck of manual code reviews across their engineering organization, where median wait times for first review were measured in hours. Rather than using off-the-shelf tools or naive LLM prompting, they developed a CI-native orchestration system around OpenCode that deploys up to seven specialized AI reviewers (covering security, performance, code quality, documentation, release management, and compliance) managed by a coordinator agent. The system has processed over 131,000 review runs across 48,000 merge requests in 5,169 repositories in the first month, with a median review time of 3 minutes 39 seconds, average cost of $1.19 per review, and only 0.6% of reviews requiring manual override, while identifying 159,103 findings with deliberate bias toward high signal-to-noise ratio.

code_generation code_interpretation prompt_engineering +27

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +52

Context Engineering for Production AI Agents at Scale

Manus 2025

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

code_generation data_analysis visualization +34