Cursor: Building a Next-Generation AI-Enhanced Code Editor with Real-Time Inference

LLMOps Database

Tech

Cursor

Company

Cursor

Title

Building a Next-Generation AI-Enhanced Code Editor with Real-Time Inference

Industry

Tech

Link

https://www.youtube.com/watch?v=oFfVt3S51T4

Year

2023

Summary (short)

Cursor built a modern AI-enhanced code editor by forking VS Code and incorporating advanced LLM capabilities. Their approach focused on creating a more responsive and predictive coding environment that goes beyond simple autocompletion, using techniques like mixture of experts (MoE) models, speculative decoding, and sophisticated caching strategies. The editor aims to eliminate low-entropy coding actions and predict developers' next actions, while maintaining high performance and low latency.

Tags

## Overview Cursor is an AI-powered code editor built as a fork of VS Code, designed to provide significantly enhanced AI-assisted programming capabilities. The founding team—Michael Truell, Sualeh Asif, Arvid Lunnemark, and Aman Sanger—recognized early on that the scaling laws emerging from OpenAI's research in 2020 would lead to increasingly capable models, and that programming environments would need to fundamentally evolve to take advantage of these capabilities. Rather than building extensions on top of existing editors (which would limit their control over the user experience), they chose to fork VS Code and build a comprehensive AI-native editing experience. The team's journey began with observing GitHub Copilot's success in 2021 as the first major LLM consumer product, but they felt frustrated that despite models getting significantly better (particularly with GPT-4 access in late 2022), the coding experience wasn't evolving to match. This motivated them to build Cursor with a philosophy of rapid iteration and deep integration between model capabilities and user experience. ## Custom Model Training and Specialization A core principle of Cursor's approach is that they don't rely solely on frontier models. Instead, they train and deploy an ensemble of custom models specialized for specific tasks, combined with frontier models for reasoning-intensive operations. ### Cursor Tab Model The Cursor Tab feature represents one of their most sophisticated custom models. Unlike traditional autocomplete which predicts characters after the cursor, Cursor Tab aims to predict the next complete edit the user will make, including: - Multi-line code changes - The next location in the file to jump to - Eventually, jumps to different files and terminal commands The model is designed to eliminate "low entropy actions"—keystrokes that are highly predictable given the current context. The team describes this as making programming feel like the AI is "reading your mind" for the zero-entropy bits of your work. Training this model involves: - Using a Mixture of Experts (MoE) architecture, which is particularly well-suited because the input prompts are very long (seeing lots of code context) while the output is relatively short - Reinforcement learning to predict which of many possible suggestions humans will prefer - The model predicts around 100 different suggestions and uses RL to learn which suggestions are more amenable to human preferences ### Apply Model A critical discovery was that while frontier models are good at sketching out code changes and generating rough plans, they struggle with the seemingly simple task of actually applying those changes to existing code. Tasks like counting line numbers accurately in large files trip up even the best models like Sonnet and o1. The Apply model is specifically trained to take a rough code sketch from a frontier model and accurately implement it as a diff to the existing file. This separation of concerns allows them to use fewer tokens with the most intelligent models (reducing latency and cost) while delegating the implementation details to specialized models. ## Inference Optimization Techniques Speed is considered fundamental to the product experience—"fast is fun" as the team puts it. Several sophisticated techniques are employed to achieve low latency: ### KV Cache Management The KV (Key-Value) cache is central to their inference optimization strategy. When processing prompts with transformers, reusing computed keys and values from previous tokens avoids redundant forward passes through the model. Cursor implements: - **Cache warming**: As the user types, they proactively warm the cache with current file contents, so when the user presses Enter, very few tokens need to be computed before generation begins - **Caching-aware prompt design**: Prompts are structured to maximize cache hit rates across requests - **Speculative caching**: Predicting what the user might accept and pre-computing the next suggestions ### Speculative Edits This is a variant of speculative decoding tailored for code editing. Traditional speculative decoding uses a small draft model to predict tokens that a larger model then verifies. For code edits, Cursor leverages a strong prior: most of the output will be the same as the existing code. The technique works by: - Feeding chunks of the original code back into the model - The model agrees most of the time that it should output the same code - Processing these lines in parallel rather than generating one token at a time - Only generating new tokens at points of disagreement with the original code - This results in much faster streaming of diffs, allowing users to start reviewing before generation completes ### Efficient Attention Mechanisms The team discusses various attention optimizations: - **Multi-Query Attention (MQA)**: Reduces the number of key-value heads to just one while preserving query heads, dramatically shrinking KV cache size - **Grouped Query Attention (GQA)**: A less aggressive variant that maintains some separate key-value heads - **Multi-Latent Attention (MLA)**: A technique from DeepSeek that compresses all keys and values into a single latent vector per token, expanding it only at computation time These techniques are particularly important for handling large batch sizes and long contexts without degrading generation speed, as the bottleneck shifts from compute to memory bandwidth. ## Context and Retrieval Systems ### Code Base Indexing Cursor builds a semantic index of entire codebases to enable context-aware assistance. The architecture involves: - Chunking all code and sending it for embedding - Storing embeddings in a vector database (without storing any actual code on their servers for privacy) - Using a Merkle tree structure for efficient synchronization between local and server state—only reconciling when root hashes disagree, then drilling down hierarchically - Clever caching of embeddings by chunk hash, so when the nth person at a company indexes the same codebase, previously computed embeddings are reused The team notes that embedding is the cost bottleneck, not storage, which influenced their caching strategy. ### Prompt Engineering Framework: Preempt Cursor developed an internal system called "Preempt" inspired by React's declarative approach. Prompts are written using JSX-like components where: - Each component (e.g., a file component) can be assigned priorities - The "renderer" automatically fits content to the available context window - For example, lines near the cursor get highest priority, with priority decaying with distance - Retrieval scores from embeddings and rerankers can feed into component priorities This approach separates the raw data from prompt rendering, making it easier to debug, iterate on prompt templates, and evaluate changes across historical data. ## Model Selection and Evaluation The team provides candid assessment of different frontier models: - **Claude Sonnet**: Currently considered the best net model for coding—maintains capabilities well outside of benchmark distributions - **GPT-4/4o**: Very capable but may be more benchmark-optimized - **o1**: Excellent at reasoning-intensive problems like competitive programming, but doesn't understand user intent as well as Sonnet and has limitations like not streaming They express skepticism about public benchmarks due to: - Contamination in training data (e.g., SWE-Bench problems appearing in pre-training) - Mismatch between well-specified interview-style problems and real messy programming tasks - Models performing well on benchmarks but not generalizing to similar out-of-distribution tasks Instead, the team relies heavily on "vibe checks"—qualitative human evaluation of model outputs, which they acknowledge is imperfect but often more reliable than benchmarks. ## UX Innovation for AI-Assisted Coding ### Diff Interfaces The team has iterated extensively on how to display AI-suggested changes: - Started with Google Docs-style strikethroughs (too distracting) - Tried blue highlights with Option-key reveal (not intuitive) - Currently use side-by-side boxes for showing changes They envision needing 4-5 different diff interfaces optimized for different contexts (autocomplete vs. large block review vs. multi-file changes). Future ideas include: - Highlighting important parts of diffs while graying out low-entropy changes - Using models to detect likely bugs and mark them for review - Intelligent ordering of file reviews based on logical dependencies ### The Verification Problem As models propose larger and larger changes, human verification becomes increasingly burdensome. The team is actively researching ways to assist with this, including using AI to prioritize which parts of a diff actually need careful review. ## Infrastructure at Scale Running Cursor at scale on AWS has presented numerous challenges: - Database tables approaching overflows as usage grew - The retrieval/indexing system being particularly tricky to scale for large enterprise codebases - Handling companies with 20+ years of accumulated code across many programmers The team emphasizes that predicting where systems will break under scale is extremely difficult—there's always something unexpected. ## Future Directions ### Agents and Background Processing The team is excited about agents but notes they're "not yet super useful for many things." They envision: - Agents handling well-specified bugs autonomously (find, reproduce, fix, verify) - Background agents working on parts of a PR while the user works on other parts - Shadow Workspaces: hidden editor windows where AI can modify code and get feedback from linters/LSPs without affecting the user ### Test-Time Compute (o1-style) The team discusses the potential of test-time compute scaling: - Allows spending more inference compute only on the hardest problems - Potentially involves process reward models for tree search over chains of thought - Still early days—they haven't found compelling integration patterns for o1 in Cursor yet ### Synthetic Data and Training Strategies The team outlines three categories of synthetic data: - **Distillation**: Training smaller models on outputs from larger models - **Inverse problems**: Where one direction is easier (e.g., introducing bugs vs. finding them) - **Verified generation**: Having models generate lots of outputs and training on verified correct ones (works well for math/code where verification is tractable) They're bullish on distillation as a way to create capable, fast models for specific tasks without hitting the data wall. ## Philosophy on Human-AI Collaboration The team strongly believes in keeping humans in the driver's seat rather than moving to pure chatbot-style interfaces. Their reasoning: - Much of programming involves micro-decisions and tradeoffs that are hard to specify in natural language - Sometimes showing an example is easier than explaining what you want - The goal is to amplify human judgment and creativity, not replace it They envision a future where programmers can fluidly move up and down abstraction levels—editing pseudocode that gets compiled to real code, or diving into implementation details when needed—while maintaining control over all decisions.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source