## Overview
Cursor, a company building AI-powered coding tools, developed and deployed a production system that enhances their AI coding agent with custom semantic search capabilities. This case study from November 2025 provides insights into how they approached a core LLMOps challenge: improving agent performance in production through better retrieval mechanisms. The agent, which assists developers in understanding and modifying codebases, needed to answer natural language queries like "where do we handle authentication?" by retrieving relevant code segments across potentially large and complex codebases.
The case study is notable for its comprehensive approach to both offline and online evaluation, its investment in custom retrieval models trained on agent behavior, and its transparent reporting of incremental but meaningful improvements. While the text serves as a research blog post from Cursor and naturally emphasizes positive results, it provides concrete metrics and methodological details that offer valuable insights into production LLMOps practices for agentic systems.
## The Production Problem
At the core of this case study is a fundamental challenge in deploying coding agents: how to help an LLM-based system efficiently navigate and understand codebases to provide accurate assistance. Traditional command-line tools like grep provide regex-based searching, which works well for exact pattern matching but struggles with semantic understanding. When a developer asks conceptual questions or needs to find functionality described in natural language rather than exact code patterns, traditional search falls short.
This becomes particularly acute in large codebases where understanding the broader context and locating relevant code sections quickly is essential for agent effectiveness. The problem manifests in several ways: agents may provide incorrect answers due to missing context, require multiple iterations to arrive at correct solutions, or generate code changes that developers later discard because they don't properly integrate with the existing codebase.
## Technical Solution Architecture
Cursor's solution centers on building a custom semantic search system specifically optimized for their coding agent's needs. The architecture consists of three main components: a custom-trained embedding model, fast indexing pipelines, and integration into the agent's tool ecosystem alongside traditional search methods.
### Custom Embedding Model
Rather than using off-the-shelf embedding models, Cursor trained their own specialized embeddings. The distinctive aspect of their approach lies in how they generated training data. They leveraged agent session traces—records of how agents actually work through coding tasks in production. When an agent tackles a task, it performs multiple searches, opens various files, and eventually locates the right code. By analyzing these traces retrospectively, the team could identify what content should have been retrieved earlier in the conversation to accelerate the agent's work.
The training methodology involves a feedback loop with an LLM acting as a ranking mechanism. They provide the agent session traces to an LLM, which evaluates and ranks which content would have been most helpful at each step of the agent's problem-solving process. The embedding model is then trained to align its similarity scores with these LLM-generated rankings. This creates a system where the embeddings learn from actual agent behavior patterns rather than relying solely on generic code similarity metrics.
This approach is pragmatic from an LLMOps perspective—it uses the production system's own behavior as a source of training signal, creating a virtuous cycle where improved retrieval leads to better agent traces, which in turn can improve future retrieval model iterations. However, it's worth noting that the case study doesn't detail potential challenges like handling distribution shift as agent capabilities evolve, or how they avoid reinforcing suboptimal patterns that might exist in early agent traces.
### Indexing and Retrieval Pipeline
To support fast semantic search at scale, Cursor built indexing pipelines that can handle large codebases efficiently. While the case study doesn't provide extensive architectural details, the emphasis on "fast retrieval" suggests they've invested in production infrastructure for vector similarity search—likely involving approximate nearest neighbor search algorithms, though specific technologies aren't mentioned.
The integration into production requires maintaining synchronized indexes as codebases change, handling potentially millions of code segments across customer projects, and serving low-latency retrieval results to keep the agent responsive. These are non-trivial LLMOps challenges that require careful attention to infrastructure, monitoring, and reliability.
### Hybrid Search Strategy
An important aspect of their production deployment is that semantic search doesn't replace traditional tools but complements them. The agent has access to both grep-based searching and semantic search simultaneously. The conclusion explicitly states that "our agent makes heavy use of grep as well as semantic search, and the combination of these two leads to the best outcomes."
This hybrid approach reflects a mature LLMOps perspective—recognizing that different search modalities have different strengths and that production systems often benefit from multiple complementary capabilities rather than single "silver bullet" solutions. Traditional search excels at exact pattern matching and is deterministic, while semantic search handles conceptual queries better but may be less precise for specific syntax patterns.
## Evaluation Framework
One of the most instructive aspects of this case study from an LLMOps perspective is Cursor's multi-faceted approach to evaluation, combining both offline benchmarks and online A/B testing to understand the impact of semantic search.
### Offline Evaluation
Cursor maintains "Cursor Context Bench," a custom evaluation dataset focused on information retrieval tasks within codebases with known correct answers. This benchmark-driven approach to evaluation is essential LLMOps practice—having reliable offline metrics allows for rapid iteration and testing before deploying changes to production users.
They run this evaluation across all major models used in their product, including frontier coding models from various providers as well as their own Composer model. The comparison methodology is clean: they test each model configuration with two tool sets—one including semantic search and one without it, relying only on traditional search tools.
The results showed semantic search improved accuracy across all tested models, with an average improvement of 12.5% and a range of 6.5% to 23.5% depending on the specific model. This model-agnostic improvement is significant from an LLMOps standpoint—it suggests that better retrieval helps across different model capabilities and architectures, making it a robust investment even as the underlying LLMs evolve.
However, it's important to approach these numbers with appropriate context. The case study doesn't provide baseline accuracy figures, so we don't know if this represents, for example, improving from 50% to 62.5% accuracy or from 80% to 92.5%. The relative improvement is impressive, but the absolute performance levels matter greatly for production viability. Additionally, as with any benchmark, there's always the question of how well Cursor Context Bench represents real-world usage patterns and edge cases.
### Online A/B Testing
Complementing their offline evaluation, Cursor conducted A/B tests with actual production traffic. One group's agent had access to semantic search while the control group relied solely on traditional search tools, with both groups using the same underlying model. This is exemplary LLMOps practice—validating offline improvements with real user impact metrics.
They measured two key outcomes that reflect actual business and user value:
**Code Retention** serves as a proxy for code quality and usefulness. The hypothesis is that code written by more effective agents is more likely to remain in codebases over time, while code from less effective agents gets modified or removed. They observed a 0.3% increase in code retention with semantic search available. While this might seem modest, at scale with potentially millions of code suggestions, this represents meaningful impact. More notably, the effect scaled with codebase size—for large codebases with 1,000+ files, the improvement increased to 2.6%, suggesting semantic search provides the most value in complex contexts where it's most needed.
**Dissatisfied User Requests** captures whether users needed to make follow-up corrections or clarifications. When semantic search wasn't available, they observed a 2.2% increase in dissatisfied follow-up requests. This indicates that better retrieval helps agents get things right the first time more often, reducing user friction.
The case study transparently notes that effect sizes in the A/B test are lower than in offline evaluation because "the A/B test is on all agent queries and not all requests require search." This is an honest acknowledgment that not every use case benefits equally from the improvement—some queries might be answerable without any search, while others are heavily dependent on retrieval quality. This heterogeneity is typical in production systems and underscores the value of measuring real user impact rather than relying solely on aggregate benchmarks.
### Critical Evaluation Considerations
While Cursor's evaluation approach is quite thorough, there are some areas where additional detail would strengthen the case study from an LLMOps perspective:
The offline metrics focus on accuracy and retrieval quality, but production systems also need to consider latency, cost, and reliability. Semantic search with embeddings likely adds computational overhead compared to grep. The case study doesn't discuss latency impacts or the infrastructure costs of maintaining embedding models and vector indexes at scale.
The A/B test methodology isn't fully detailed—we don't know the test duration, sample sizes, statistical significance levels, or whether they accounted for potential novelty effects or user learning over time. For code retention metrics in particular, the measurement window matters significantly—code retained for a week might have different implications than code retained for months.
There's also no discussion of failure modes or edge cases. Every production system has scenarios where it underperforms, and understanding where semantic search might hurt rather than help would provide a more complete picture.
## Training Data and Feedback Loops
The approach of using agent session traces to train retrieval models represents an interesting LLMOps pattern that's worth examining more closely. By having agents work through tasks, recording what they search for and what ultimately proves useful, Cursor creates a self-improving system where production usage directly informs model training.
This approach has several advantages. First, it grounds the training in actual use cases rather than synthetic data or generic code similarity. Second, it can potentially adapt to the specific patterns and needs of the coding domains their users work in. Third, it creates alignment between the retrieval system and the agent's behavior patterns.
However, this approach also introduces interesting challenges and considerations:
**Bootstrapping**: How did they train the initial version before having high-quality agent traces? The case study doesn't address this cold-start problem.
**Bias amplification**: If early agent behavior includes suboptimal patterns or biases, training on those traces could reinforce them. The use of an LLM to rank content helpfulness might mitigate this to some extent, but that LLM itself could introduce biases.
**Distribution shift**: As agents improve (either from better retrieval or better underlying models), the distribution of agent traces changes. How do they handle this non-stationarity in their training data?
**LLM-as-judge concerns**: Using an LLM to rank what content would have been helpful introduces another layer of model dependency. The quality of the ranking LLM directly impacts the training signal for the embedding model. There's no discussion of how they validate the LLM's judgments or handle cases where the ranker might be wrong.
Despite these considerations, the overall approach of learning from production usage is sophisticated and represents mature thinking about LLMOps—recognizing that the best training signal often comes from how systems are actually used in practice.
## Production Integration and Tool Design
An often underappreciated aspect of LLMOps is how individual components integrate into broader agentic systems. Cursor's approach of providing semantic search as one tool among several available to the agent represents thoughtful system design.
The agent can choose when to use semantic search versus grep versus other tools based on the task at hand. This requires the agent (or the underlying model) to develop effective tool-use strategies. The case study doesn't detail how they handle tool selection—whether it's purely left to the model's judgment, whether they have prompting strategies to encourage appropriate tool use, or whether they've implemented more sophisticated agent architectures with explicit planning or reflection capabilities.
The finding that combining semantic search with traditional tools yields the best results is pragmatically valuable. It suggests that even as new AI-powered capabilities emerge, traditional deterministic tools retain value in production systems. This is important guidance for LLMOps practitioners who might be tempted to replace all traditional tooling with AI-powered alternatives.
## Scalability and Infrastructure
While the case study doesn't provide extensive infrastructure details, building and operating semantic search at scale for a production coding tool involves significant LLMOps challenges:
**Embedding generation**: For each customer's codebase, potentially containing thousands or millions of code segments, embeddings need to be generated and indexed. This requires computational resources and needs to happen efficiently to support onboarding and codebase updates.
**Index maintenance**: As code changes, indexes need to be updated incrementally without requiring full reindexing. The system needs to handle high-velocity codebases where files are constantly being modified.
**Query latency**: When an agent needs to retrieve information, the search needs to complete quickly enough not to disrupt the user experience. This likely requires careful optimization of the vector search infrastructure.
**Model serving**: The custom embedding model needs to be served reliably with appropriate redundancy, monitoring, and potentially multiple versions during rollout of improvements.
**Cost management**: Running embeddings and vector search at scale has real infrastructure costs that need to be balanced against the value provided.
The case study's emphasis on "fast retrieval" and "indexing pipelines" suggests Cursor has invested meaningfully in these infrastructure concerns, but the lack of detail makes it difficult to extract specific lessons about their approach.
## Model Agnosticism and Evolution
An encouraging finding is that semantic search improved performance across all tested models, including frontier models from various providers. This suggests the improvement is relatively model-agnostic and doesn't depend on specific model architectures or capabilities.
From an LLMOps perspective, this is valuable because it means the investment in better retrieval remains relevant even as underlying models evolve. As new and better coding models are released, Cursor can integrate them while continuing to benefit from their semantic search infrastructure.
However, this also raises questions about the nature of the improvement. If even the most capable frontier models benefit from better retrieval, it suggests that context window limitations, attention patterns, or reasoning capabilities still constrain performance in ways that better input selection can address. As models continue to improve—with longer context windows, better reasoning, or more sophisticated planning capabilities—the relative value of enhanced retrieval might change.
The conclusion notes "we're continuing to test and evaluate all tools we give to the agent harness as models improve," which reflects appropriate awareness that LLMOps in the rapidly evolving AI landscape requires continuous reevaluation of what components provide value.
## Transparency and Reproducibility
From a scientific and engineering perspective, the case study provides reasonable transparency about their approach and results, though there are areas where more detail would be valuable. They clearly describe their training methodology, evaluation framework, and key results with specific numbers.
However, some important details are missing:
- No information about the embedding model architecture, size, or training data scale
- Limited details about the indexing and retrieval infrastructure
- No discussion of latency, costs, or other operational metrics
- Incomplete methodology details for the A/B tests
- No exploration of failure modes or limitations
Some of this is understandable given competitive considerations—Cursor is building a commercial product and may not want to reveal all technical details. However, from an LLMOps learning perspective, understanding the tradeoffs, challenges, and limitations would make this a more complete case study.
## Business and User Impact
The ultimate question for any LLMOps initiative is whether it delivers meaningful value. Cursor's results suggest semantic search provides measurable improvements in user experience and code quality, even if the effect sizes are modest.
The 12.5% average accuracy improvement in offline evaluation is substantial. The online metrics—0.3% to 2.6% code retention improvement and 2.2% reduction in dissatisfied requests—are more modest but still meaningful at scale. For a product likely serving many thousands of developers making millions of coding requests, these improvements translate to significant cumulative value.
The scaling effect with codebase size (stronger improvements for larger codebases) is particularly important because those are often the scenarios where developer productivity tools provide the most value. Complex, large codebases are exactly where developers most need assistance, and where traditional search tools most struggle.
However, it's worth noting that the case study comes from Cursor itself, so there's natural incentive to emphasize positive results. The lack of discussion of limitations, costs, or tradeoffs should be viewed with appropriate skepticism. Real production systems always involve compromises, and the absence of any discussion of downsides or challenges may reflect selective reporting rather than a truly challenge-free implementation.
## Lessons for LLMOps Practitioners
This case study offers several valuable lessons for practitioners building and deploying LLM-based systems:
**Invest in custom components where they provide leverage**: Rather than relying solely on off-the-shelf embeddings, Cursor built custom models optimized for their specific use case. This requires more investment but can provide differentiated value.
**Use production data as training signal**: The approach of learning from agent session traces exemplifies using production systems to generate training data that's grounded in actual use cases.
**Combine AI and traditional tools**: The finding that semantic search plus grep outperforms either alone suggests hybrid approaches often work best in production.
**Implement comprehensive evaluation**: Using both offline benchmarks and online A/B tests provides complementary insights—offline metrics for rapid iteration, online metrics for validating real impact.
**Measure business-relevant outcomes**: Code retention and user satisfaction are more meaningful than pure accuracy metrics for understanding whether an LLMOps improvement actually delivers value.
**Expect modest but meaningful improvements**: The online metrics show relatively small percentage improvements, but in aggregate these provide real value. Not every LLMOps initiative needs to be transformative to be worthwhile.
That said, practitioners should also note what this case study doesn't address: infrastructure costs, operational complexity, failure modes, latency impacts, and the engineering effort required to build and maintain these systems. These factors are critical for making informed decisions about whether similar investments make sense for different organizations and use cases.