Cursor developed Composer, a specialized coding agent model designed to balance speed and intelligence for real-world software engineering tasks. The challenge was creating a model that could perform at near-frontier levels while being four times more efficient at token generation than comparable models, moving away from the "airplane Wi-Fi" problem where agents were either too slow for synchronous work or required long async waits. The solution involved extensive reinforcement learning (RL) training in an environment that closely mimicked production, using custom kernels for low-precision training, parallel tool calling capabilities, semantic search with custom embeddings, and a fleet of cloud VMs to simulate the real Cursor IDE environment. The result was a model that performs close to frontier models like GPT-4.5 and Claude Sonnet 3.5 on coding benchmarks while maintaining significantly faster token generation, enabling developers to stay in flow state rather than context-switching during long agent runs.
Cursor, a company building an AI-powered IDE, developed Composer, their first custom agent model specifically designed for production software engineering workflows. This case study provides detailed insights into how Cursor approached the challenge of building and deploying a specialized LLM that needed to balance both intelligence and speed for real-world coding tasks. The presentation was given by a member of Cursor’s engineering team in New York, discussing the technical architecture, training methodology, and infrastructure challenges involved in bringing this model to production.
The fundamental problem Cursor identified was what they describe as the “airplane Wi-Fi problem” - existing coding agents were stuck in a “semi-async valley of death” where they were either too slow for synchronous work (taking 10-20 minutes for tasks) or needed to run for extended periods (30+ minutes to days) in the background. Developers wanted either very fast responses to stay in flow state, or extremely powerful models that could run autonomously for long periods. The middle ground was frustrating and broke developer workflow.
Composer was designed to perform better than the best open source models available at the time, compete with recent frontier models, but remain slightly below the absolute latest frontier models like Claude Sonnet 3.5 and GPT-4.5/Codex. The critical differentiator is efficiency - Composer generates tokens approximately four times faster than models at similar intelligence levels. This speed-intelligence tradeoff was intentional and represents a key insight into production LLM deployment: sometimes near-frontier performance with significantly better latency is more valuable than absolute best performance with slower response times.
The team measured performance against their own internal benchmarks that represented actual usage patterns on their own repositories and how they built software day-to-day. This is an important LLMOps practice - rather than optimizing solely for academic benchmarks, they created evaluation criteria based on real production workflows. The key success criterion was whether their own developers would choose to use the model checkpoint every single day to build their product.
The production system architecture consists of three interconnected server types that communicate extensively during both training and inference:
Training Server: Uses the standard ML stack with PyTorch to handle model parameter updates and gradient computations. This server receives advantages (reward signals) back from the inference server and updates model weights accordingly.
Inference Server: Manages the rollout process using Ray for distributed computing. This server handles the actual agent execution, making tool calls and managing the interaction between the model and the simulated environment. It also handles load balancing across different threads and processes to minimize idle time when rollouts complete at different rates.
Environment Servers: These simulate the actual Cursor IDE environment as closely as possible. Each environment server represents a sandboxed execution context where the agent can read files, edit code, run shell commands, and perform other development tasks.
The communication pattern between these servers is bidirectional - the inference server sends advantages back to the trainer to nudge parameters up or down based on rollout success, then receives updated model weights. The inference server constantly communicates with environment servers to make tool calls and receive results.
Composer has access to approximately 10 tools, with five core tools being emphasized in the presentation:
A critical capability that emerged from the RL training process was parallel tool calling. Rather than reading files sequentially one by one, the model learned to read 10 files in parallel, dramatically improving the end-to-end user experience. This wasn’t just about token generation speed but about reducing wall-clock time for completing real development tasks.
The presentation emphasized that one particularly valuable tool was semantic search powered by Cursor’s custom-trained embedding model. The system indexes the user’s codebase and allows the agent to make natural language queries to find relevant files. Research conducted by the team showed that semantic search improved performance for essentially every model they tested in the cursor agent harness, but it was particularly effective with Composer. This makes intuitive sense from an LLMOps perspective - since Composer was trained in the exact same environment it runs in at inference time, the model essentially became a “power user” of the semantic search tool.
The RL training methodology represents a sophisticated approach to creating domain-specific models. The core process involves:
Rollout Generation: Starting from an initial state (a user query), the model makes a series of tool calls autonomously, deciding whether to execute them serially or in parallel. Multiple rollouts are generated from the same starting point, exploring different tool call sequences and strategies.
Scoring and Comparison: Different rollouts are scored to determine which approaches are more successful at completing the task effectively.
Parameter Updates: The model’s parameters are updated based on which rollouts performed better, reinforcing successful strategies and discouraging less effective ones.
A key design principle was matching the training environment as closely as possible to the production inference environment. This environment fidelity is crucial in LLMOps - models trained in environments that don’t reflect production reality often exhibit performance degradation or unexpected behaviors when deployed. Cursor went to significant lengths to ensure training rollouts used exactly the same tool formats and tool responses that would be encountered in production.
The training data involved realistic complexity - models processing hundreds of thousands to millions of tokens per rollout and making hundreds of tool calls. This scale presents significant challenges for training infrastructure, as different rollouts can take vastly different amounts of time depending on the number and type of tool calls made (for example, rollouts that install packages or libraries take much longer).
The team identified three major challenges that manifested as infrastructure problems rather than pure ML problems:
Challenge 1: Training-Inference Environment Matching
Training a large mixture of experts (MoE) model parallelized across thousands of GPUs presents a speed challenge. The solution involved developing custom kernels that enabled very low precision training. These custom kernels provided approximately 3.5x speedup on Nvidia Blackwell chips specifically for the mixture of experts layers. This not only accelerated training but also made it easier to ship the model to the inference server. The team wrote detailed technical blog posts on these custom kernels for those interested in the implementation details.
Challenge 2: Variable Rollout Complexity and Load Balancing
Since rollouts complete at different times based on the number and type of tool calls made, a naive implementation would have significant idle time with processes waiting for the slowest rollout to complete. The inference server implements sophisticated load balancing across different threads and processes to shift work around dynamically. This ensures that when one rollout makes numerous tool calls or performs time-consuming operations like package installation, other processes aren’t sitting idle waiting.
Challenge 3: Bursty Compute Patterns and VM Orchestration
Training workloads are extremely bursty - intensive compute happens in concentrated bursts rather than the steady-state traffic patterns typical of production inference. Yet the training environment needed to closely match production behavior. The solution involved building extensive infrastructure to orchestrate a fleet of cloud VMs.
Interestingly, Cursor was simultaneously building their “cloud agents” product, which allows users to run Cursor agents offline from their phone, the web, or by kicking them off from Slack. This product spins up virtual machines in the cloud where each VM loads the user’s code and allows the agent to make file changes, run tools, and edit code in a secure sandbox. This infrastructure turned out to be the perfect foundation for RL training - providing a fleet of cloud VMs that closely match the production Cursor environment.
The team built internal dashboards (using Composer itself) to visualize the many clusters and hundreds of thousands of VMs in the fleet during training operations. This scale of orchestration represents a significant engineering investment beyond the core ML work.
One advantage Cursor emphasizes is having both the IDE product and the model research/training capabilities in-house. This allowed for co-design where the tools built for the product could inform and enhance the training process, and vice versa. The cloud agents product infrastructure directly enabled more effective RL training. The semantic search tool built for the product became a differentiating capability the model learned to leverage effectively during training.
This co-design approach is an important consideration for LLMOps at organizations - there are synergies between production infrastructure and training infrastructure that can be leveraged when they’re developed in coordination rather than isolation.
The team knew RL was working when they observed continuous improvement as they ran more rollouts and applied more compute. The model started at roughly the same performance level as the best open source models and progressively improved toward frontier model performance levels.
Beyond just benchmark improvements, the RL process yielded interesting emergent behaviors and property changes:
Improved Tool Usage Patterns: Early in training, the model made too many edits, sometimes unnecessarily. As training progressed, the model learned better strategies - searching and reading files more thoroughly before attempting edits. This represents the model learning more effective agent behavior patterns rather than just improving at individual tasks.
Faster End-to-End Experience: The model learned to leverage parallel tool calling effectively, which improved the perceived speed of task completion beyond just token generation rates. This shows how RL can optimize for real-world user experience metrics rather than just model-centric metrics.
Tool Specialization: The model became particularly adept at using the semantic search tool since it was trained in the exact environment where that tool was available and relevant. This demonstrates how domain-specific training can create models that are “power users” of specialized tools.
Composer was released as part of Cursor 2.0 and, based on the presentation given to an audience where multiple people had tried it, appears to have achieved reasonable adoption. The presenter describes the experience as bringing “joy back to coding with agents” by enabling developers to stay in flow state with synchronous, fast interactions rather than dealing with the frustration of long wait times or constant context switching.
The practical usage pattern that emerged was using the latest frontier models (like GPT-4.5 Codex) for high-level planning and then using Composer to execute those plans - taking the context engineering work and building out the implementation. This represents a tiered approach to production LLM usage where different models with different speed-intelligence tradeoffs serve different roles in the workflow.
The team shared several key insights from the project:
RL Effectiveness for Specialized Models: Reinforcement learning can work surprisingly well for training very specific models when provided with high-quality data and sufficient compute. Cursor isn’t trying to build AGI or general intelligence - they’re focused on building excellent coding models, and RL proved highly effective for this constrained domain.
AI Tools Accelerating AI Development: The team uses Cursor extensively to build Cursor itself, creating a compounding effect where improvements in the tool accelerate further development. They can try more ideas, ship products faster, and iterate on research more quickly because they’re using AI-powered development tools throughout their engineering process.
ML Problems Are Infrastructure Problems: Many challenges in ML training and deployment manifest as infrastructure problems. This parallels experiences in other domains like web frameworks where the “magic moments” often depend as much on the infrastructure and deployment environment as on the framework itself. The training and production environments need to be considered holistically rather than as separate concerns.
While the presentation provides valuable technical insights into building and deploying specialized coding agents, several considerations warrant balanced assessment:
Evaluation Methodology: The benchmarks used are primarily internal and based on Cursor’s own usage patterns. While this makes sense for their specific use case, it makes it difficult to independently verify the performance claims or understand how the model performs on code bases with different characteristics or in different programming paradigms.
Generalization Concerns: Training a model so specifically on the Cursor environment with Cursor’s specific tools raises questions about generalization. Would this model perform as well in different development environments? Is it overfitted to Cursor’s specific workflows? The tight coupling between model and environment is both a strength (better performance in that environment) and potential limitation (less flexible for other contexts).
Resource Requirements: The infrastructure described - thousands of GPUs for training, hundreds of thousands of VMs for environment simulation, custom kernel development - represents a substantial resource investment that may not be accessible to many organizations looking to apply similar techniques. The case study doesn’t discuss the cost-effectiveness or ROI of this approach.
Incomplete Transparency: While the presentation discusses architecture and approach, many details remain proprietary. The exact model size, specific training data characteristics, detailed benchmark results, and comparative performance metrics against named competitors aren’t fully disclosed. This is understandable from a competitive standpoint but limits the ability to fully assess the claims.
User Experience Claims: Descriptions like bringing “joy back to coding” and solving the “airplane Wi-Fi problem” are subjective user experience claims. While the audience response suggests positive reception, the presentation doesn’t provide quantitative metrics on developer productivity, task completion rates, or user satisfaction scores.
Long-term Maintenance: The presentation doesn’t address how the model will be maintained and updated over time, how it handles new programming languages or frameworks, or what the ongoing operational costs look like.
Despite these considerations, the case study provides valuable insights into practical LLMOps challenges for specialized agent models, particularly around environment simulation, RL training infrastructure, and the importance of latency in user experience. The architectural patterns and infrastructure solutions described can inform similar efforts in other domains requiring agent-based LLM deployments.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.