Cursor: Online Reinforcement Learning for Code Completion at Scale

Company

Cursor

Title

Online Reinforcement Learning for Code Completion at Scale

Industry

Tech

Link

https://cursor.com/blog/tab-rl

Year

2025

Summary (short)

Cursor developed a production LLM system called Cursor Tab that predicts developer actions and suggests code completions across codebases, handling over 400 million requests per day. To address the challenge of noisy suggestions that disrupt developer flow, they implemented an online reinforcement learning approach using policy gradient methods that directly optimizes the model to show suggestions only when acceptance probability exceeds a target threshold. This approach required building infrastructure for rapid model deployment and on-policy data collection with a 1.5-2 hour turnaround cycle. The resulting model achieved a 21% reduction in suggestions shown while simultaneously increasing the accept rate by 28%, demonstrating effective LLMOps practices for continuously improving production models using real-time user feedback.

Tags

continuous_deployment

monitoring

scalability

## Overview Cursor's case study describes their production deployment of Cursor Tab, an LLM-based code completion system that operates at substantial scale, handling over 400 million inference requests per day. This system represents a sophisticated implementation of LLMOps principles, particularly in how they continuously improve their production model through online reinforcement learning based on real user interactions. The case study is notable for its focus on the operational challenges of keeping a high-frequency inference system optimized for user experience rather than just raw accuracy. The core technical insight is that improving a code completion system isn't solely about making the underlying model more intelligent, but rather about teaching the system when to show suggestions and when to remain silent. This reflects a mature understanding that in production LLM systems, user experience metrics like acceptance rate matter as much as model capabilities. The system runs on every user action—whenever a developer types a character or moves their cursor—making latency and inference efficiency critical operational concerns. ## The Problem: Managing Suggestion Quality at Scale The fundamental challenge Cursor identified was managing "noisy suggestions"—instances where the model suggests code that the user doesn't accept. From an LLMOps perspective, this represents a common production challenge: the offline metrics used during model development (like perplexity or exact match accuracy on held-out test sets) don't necessarily translate to good user experience metrics. A model might be technically capable of generating correct code but still frustrate users by suggesting at inappropriate times or with insufficient confidence. Cursor explicitly wanted to maintain a high accept rate for suggestions, recognizing that low accept rates indicate the system is showing too many incorrect suggestions, which disrupts developer flow. This is a critical production insight—in real-world usage, false positives (bad suggestions) can be more costly than false negatives (missed opportunities to suggest), particularly in a tool used hundreds of times per session. The company examined prior art, specifically GitHub Copilot's approach circa 2022, which used a separate contextual filter model. This filtering approach employed logistic regression with 11 hand-engineered features (programming language, previous suggestion acceptance/rejection, cursor context, etc.) to predict whether a suggestion would be accepted, filtering out suggestions when this probability fell below 15%. While viable, Cursor sought a more integrated approach that could leverage the rich learned representations already present in their Tab model rather than relying on a separate filtering system with hand-crafted features. ## Technical Solution: Policy Gradient Methods Cursor's solution was to frame the code completion problem as a reinforcement learning task and apply policy gradient methods. This is an architecturally significant decision from an LLMOps perspective because it fundamentally changes how the model is optimized. Rather than training on static datasets with supervised learning objectives, they treat the model as a policy that takes actions (showing suggestions or showing nothing) in states (the current codebase context), with the goal of maximizing expected reward. The reward function is carefully designed to align with their product goals. In the simplified example they provide, accepted suggestions receive a reward of +0.75, rejected suggestions receive -0.25, and showing nothing receives 0. This reward structure mathematically ensures that the model will only show suggestions when it estimates the acceptance probability exceeds 25% (since 0.75p - 0.25(1-p) > 0 exactly when p > 0.25). The actual production reward function is more sophisticated, accounting for suggestion size and the possibility of suggesting at multiple code locations, but the core principle remains: the reward function encodes the desired behavior rather than requiring explicit modeling of acceptance probability. The mathematical foundation relies on the Policy Gradient Theorem, which provides a way to compute gradients of expected reward with respect to model parameters. The key insight is that ∇θ J(θ) = E[∇θ log π(a|s,θ) · R(s,a)], where the expectation is taken over states s sampled from the real distribution of codebase contexts and actions a sampled from the current policy. This is computationally tractable because: - States come naturally from user requests in production - Actions (suggestions shown) come from the model's own inference - Gradients ∇θ log π(a|s,θ) can be computed with standard deep learning frameworks like PyTorch - Rewards R(s,a) are observed directly through user acceptance/rejection behavior From an implementation perspective, this likely involves maintaining log probabilities during inference, storing the suggestion contexts and model states, and later computing gradient estimates once user feedback is available. ## The On-Policy Data Challenge The critical LLMOps challenge that makes this approach unusual is the requirement for on-policy data. The Policy Gradient Theorem requires that actions be sampled from the current policy being optimized, not from some previous version. Once the model parameters are updated through gradient descent, all the previously collected data becomes "off-policy"—it was generated by a different version of the model and can no longer provide unbiased gradient estimates. This creates a fundamental operational requirement: to continue training, Cursor must deploy updated models to production, collect fresh interaction data from real users, and use that data for the next training iteration. This is drastically different from typical LLM development workflows where models are trained on static datasets or with paid human labelers, then deployed only when major version releases occur (perhaps every few months). Cursor's solution required building sophisticated infrastructure to support rapid iteration cycles. They report achieving a 1.5-2 hour turnaround time from deploying a model checkpoint to collecting sufficient on-policy data for the next training step. While they acknowledge this is "fast relative to what is typical in the AI industry," they also note room for improvement—and indeed, for policy gradient methods to work optimally, faster cycles are better as they reduce the amount of wasted compute on near-off-policy data. This infrastructure likely encompasses several components: - **Model deployment infrastructure**: Systems to push new model checkpoints to production serving infrastructure rapidly, potentially with gradual rollouts or canary deployments to manage risk - **Data collection pipelines**: Real-time or near-real-time systems to capture user interactions, including the full context needed to reconstruct the state, the suggestion shown, the model's internal representations or log probabilities, and the user's response - **Data processing**: Systems to transform raw interaction logs into training-ready datasets, potentially including filtering for data quality, computing rewards, and preparing batches - **Training orchestration**: Automated pipelines that trigger training jobs once sufficient on-policy data is collected, likely with monitoring to ensure training stability - **Quality gates**: Evaluation systems to ensure new checkpoints meet quality standards before deployment The operational complexity here is significant. Unlike traditional A/B testing where you might compare two static models, this approach requires treating deployment and training as a continuous loop. The model is essentially learning from its own deployment in production, creating feedback loops that could potentially amplify both improvements and problems. ## Production Scale and Operational Considerations Operating at 400 million requests per day presents substantial infrastructure challenges. Assuming even distribution, this is roughly 4,600 requests per second, though real usage likely has significant peaks. Each request requires: - Loading the relevant code context (potentially across multiple files in the codebase) - Running inference on the Tab model - Computing confidence estimates or log probabilities - Deciding whether to show a suggestion - Serving the response with low latency (since it must feel instantaneous to users) The latency requirements are particularly stringent—users notice delays beyond ~100ms, and the system must run on every keystroke. This likely requires careful optimization of model serving infrastructure, potentially including: - Model quantization or distillation to reduce inference costs - Intelligent caching strategies for code context - Distributed inference infrastructure with geographic distribution - Efficient batching strategies that balance latency and throughput From a monitoring perspective, Cursor must track numerous metrics: - **User experience metrics**: Accept rate, suggestion frequency, user satisfaction signals - **Model performance metrics**: Inference latency, throughput, resource utilization - **Data quality metrics**: Volume of on-policy data collected, distribution shifts in user behavior - **Training metrics**: Gradient norms, reward trends, training stability indicators - **Deployment health**: Rollout progress, error rates, system availability The acceptance rate metric is particularly important as it serves dual purposes: it's both the primary user experience metric and a core component of the training signal. This creates an interesting dynamic where improving the metric through model updates also changes the data distribution for future training. ## Results and Balanced Assessment Cursor reports that their new model makes 21% fewer suggestions while achieving a 28% higher accept rate. These are substantial improvements that suggest the approach is working as intended. However, it's worth considering what these metrics actually measure and their limitations. The 21% reduction in suggestions shown indicates the model has learned to be more selective, which aligns with the goal of reducing noisy suggestions. The 28% increase in accept rate (not specified whether this is absolute or relative) suggests that when suggestions are shown, they're more likely to be accepted. If we assume the baseline accept rate was around 30% (a reasonable estimate for code completion systems), a 28% relative increase would bring it to about 38-40%, which is a meaningful improvement in user experience. However, there are important caveats to consider: **Selection effects**: By showing fewer suggestions, the model naturally increases its accept rate even if suggestion quality remains constant, since it's filtering out the lowest-confidence cases. The reported metrics don't tell us whether the suggestions themselves became better, or if the model simply became better at knowing when not to suggest. Both are valuable, but they represent different types of improvement. **Counterfactual effects**: We don't know what would have happened if the model had shown suggestions in cases where it now remains silent. Perhaps users would have accepted some of those, or perhaps seeing them would have inspired users to write better code. The absence of suggestions isn't neutral—it might slow users down or cause them to miss opportunities. **Long-term effects**: The case study doesn't address whether these improvements are stable over time or whether they might degrade as user behavior adapts to the new model. If users learn to rely more heavily on suggestions, they might become less tolerant of errors, creating a moving target for the accept rate metric. **Comparison baseline**: We don't know what the previous model was or how it was trained. The improvements might be partly attributable to the online RL approach, but they could also reflect other changes like better base models, more training data, or improved features. **Generalization**: The results are reported in aggregate, but code completion performance likely varies significantly across programming languages, project types, and user skill levels. The average metrics might hide important distributional effects. From a methodological standpoint, the policy gradient approach is theoretically sound but comes with well-known challenges. The variance of policy gradient estimates can be high, potentially requiring variance reduction techniques like baselines or advantage estimation (though the case study doesn't mention whether these are used). The approach is also sample-inefficient compared to off-policy methods, requiring substantial amounts of fresh interaction data—though at Cursor's scale, this may not be a limiting factor. ## Comparison to Alternative Approaches It's instructive to consider what alternatives Cursor could have pursued: **Contextual bandit methods**: Rather than full reinforcement learning, they could have used simpler contextual bandit algorithms that don't consider sequential decision-making. This would be computationally cheaper and potentially more sample-efficient, though it wouldn't capture the full problem structure if showing one suggestion affects future opportunities. **Off-policy RL methods**: Techniques like importance sampling or doubly-robust estimation could potentially allow learning from historical data without requiring on-policy samples. This would reduce infrastructure complexity but might introduce bias or increase variance. **Supervised learning with careful data curation**: They could have continued with supervised learning but carefully curated training data to include only high-quality examples, potentially using the acceptance data to weight examples. This would avoid the complexity of online learning but might not converge to as optimal a policy. **Learned confidence calibration**: Rather than changing the model itself, they could have trained the model to output well-calibrated confidence scores, then used simple thresholding. This separates the "what to suggest" and "when to suggest" problems, which might be easier to debug and maintain. The choice of policy gradient methods suggests Cursor values the tight integration between suggestion generation and filtering, and has the operational maturity to handle continuous deployment and retraining. The approach is ambitious but appears justified by their scale and engineering capabilities. ## LLMOps Maturity Indicators This case study reveals several indicators of LLMOps maturity: **Tight integration of deployment and training**: The ability to close the loop from deployment to data collection to retraining to re-deployment within hours shows sophisticated MLOps infrastructure. **User feedback as training signal**: Rather than relying solely on offline evaluation, they've instrumented their production system to collect high-quality behavioral feedback and use it directly for model improvement. **Focus on operational metrics**: The emphasis on accept rate and user experience rather than just model accuracy shows product-oriented thinking about LLM deployment. **Scale of deployment**: Handling 400+ million daily requests requires production-grade serving infrastructure with appropriate monitoring, alerting, and reliability engineering. **Willingness to iterate publicly**: Publishing this detailed technical account, including specific metrics and turnaround times, suggests confidence in their approach and a culture of learning and sharing. However, there are also areas where the case study leaves questions: - No discussion of A/B testing methodology or how they validate improvements before full rollout - Limited detail on safety considerations or guardrails to prevent the RL process from optimizing into problematic local minima - No mention of computational costs or resource requirements for the continuous training pipeline - Absence of discussion about model versioning, rollback procedures, or debugging degradations ## Broader Implications This case study represents an interesting direction for LLM deployment that diverges from the prevailing paradigm of large, infrequently-updated foundation models. Instead of deploying GPT-4 or Claude and leaving it static, Cursor is continuously updating their specialized model based on domain-specific feedback. This reflects a broader tension in the field between foundation model scaling and task-specific optimization. The approach is likely only feasible at certain scales. With too little traffic, the on-policy data collection would be too slow; with too much, the infrastructure costs might be prohibitive. Cursor seems to have found a sweet spot where they have enough users to generate substantial feedback quickly, but focused enough use cases (code completion) that the optimization problem remains tractable. For other organizations considering similar approaches, key prerequisites would include: - Sufficient scale of user interactions to generate meaningful training data quickly - Low-latency deployment infrastructure to enable rapid iteration - Strong observability and monitoring to detect issues early - Clear, measurable user feedback signals (like accept/reject) that can serve as training signals - Engineering culture comfortable with continuous deployment and experimentation The case study ultimately demonstrates that LLMOps for production systems involves much more than just serving predictions—it requires sophisticated infrastructure for continuous learning, careful design of objective functions that align with user experience, and the operational maturity to safely iterate on models in production based on real-world feedback.

Start deploying reproducible AI workflows today