Company
Cursor
Title
Building AI-Assisted Development at Scale with Cursor
Industry
Tech
Year
2025
Summary (short)
This case study explores how Cursor's solutions team has observed enterprise companies successfully deploying AI-assisted coding in production environments. The problem addressed is helping developers leverage LLMs effectively for coding tasks while avoiding common pitfalls like context window bloat, over-reliance on AI, and hallucinations. The solution involves teaching developers to break down problems into appropriately-sized tasks, maintain clean context windows, use semantic search for brownfield codebases, and build deterministic harnesses around non-deterministic LLM outputs. Results include significant productivity gains when developers learn proper prompt engineering, context management, and maintain responsibility for AI-generated code, with specific improvements like bench scores jumping from 45% to 65% through harness optimization.
## Overview This case study provides insights from Cursor's solutions team and forward-deployed engineers who work with hundreds to thousands of enterprise companies implementing AI-assisted coding in production. The discussion features Randy from Cursor's solutions team sharing patterns, anti-patterns, and best practices observed from real-world deployments of LLM-powered development tools. The conversation reveals the operational challenges of deploying LLMs for code generation at scale and the engineering practices that separate successful implementations from failed ones. ## The Fundamental Shift: AI as a New Skill Adjacent to Engineering The case study emphasizes that working with AI represents a fundamentally new skill that is adjacent to but distinct from traditional software engineering. Just as great engineers learn how to think correctly before memorizing syntax, developers working with LLMs must develop new mental models for collaborating with AI systems. The most successful users demonstrate proficiency in breaking down large problems into appropriately-sized tasks that LLMs can reliably accomplish, rather than going too small (inefficient) or too large (prone to hallucinations). This skill of task decomposition is highlighted as the primary differentiator between power users and those who struggle with AI-assisted development. The challenge involves understanding how to manage context and build effective prompts, which requires intimate knowledge of both the codebase and the capabilities of different LLM models. ## Model Selection and Understanding Model Personalities A critical operational insight is that different LLM models have distinct "personalities" and capabilities. The case study references the rapid evolution of models throughout the year, using Anthropic's Claude as an example: Claude 3.5 evolved to 3.7, then Claude 4, then 4.5, with variants like Sonnet and Opus. Each model release changes what problems can be tackled and at what scale. Engineers working with these models daily report that models have different strengths, weaknesses, and behavioral characteristics. For instance, GPT-4.5 mini codeex (recently released) can solve problems that GPT-4 or even GPT-5 couldn't handle months earlier. This rapid evolution means that developers must continuously calibrate their internal understanding of what each model can accomplish and where it tends to hallucinate. There is no easy framework for determining optimal problem size—it requires ongoing familiarity with the tools and experimentation. ## Context Management: The Core LLMOps Challenge Context window management emerges as one of the most critical operational challenges in production LLM deployments. A common myth that the case study debunks is that "more context is always better." Research shows that even models with million-token context windows experience steep performance degradation and increased hallucination rates when context windows exceed 50-60% capacity. Sending 800,000 tokens to models like Gemini or Claude Sonnet 4.5 can actually harm performance. The recommended approach is to maintain pristine context windows containing only information relevant to the current task. A key operational practice is starting a new chat window for each new task, even if the tasks seem related. Engineers often make the mistake of continuing in an existing chat window because previous context seems relevant, leading to bloated context that confuses the AI and triggers arbitrary decisions based on outdated information. The principle articulated is "just enough context as needed"—providing sufficient context without superfluous information. This requires developers to learn the discipline of context hygiene, which is described as a skill that must be developed over time. ## Debugging Challenges and Context Contamination A particularly insightful operational finding concerns debugging with LLMs. When a bug appears in code that was generated by an AI within a specific context window, attempting to debug within that same context window often leads to frustration. The AI has accumulated context suggesting that its previous decisions were correct, making it harder for the model to identify the bug. This "context contamination" effect can cause the AI to rabbit-hole down the same incorrect solution repeatedly. The recommended practice is to start a fresh context window when debugging AI-generated code, allowing the model to approach the problem without the weight of previous decisions that created the bug in the first place. This insight has important implications for how teams structure their development workflows and agent interactions. ## Semantic Search and Brownfield Codebase Handling The case study challenges another common myth: that AI is far better at greenfield development than brownfield work. While this may have been true when tools could only grep through codebases during prompt generation, modern implementations like Cursor have achieved significant success by indexing entire codebases before any questions are asked. Cursor creates semantic search capabilities that can quickly find relevant code across massive codebases—500,000 files or millions of lines of code. When a developer prompts "I want to add authentication to this page," the system can determine which areas of the codebase are most relevant and pull that context in automatically. This is complemented by lightning-fast parallel grepping capabilities. The key to brownfield success is giving AI examples to follow. When building new APIs, refactoring code, or fixing bugs, developers should point the AI to well-built existing patterns in the codebase. By providing concrete examples of the desired architectural style, design system, or API framework, engineers can minimize arbitrary decisions the LLM must make. Since an arbitrary decision might be correct 80% of the time, but a task requiring 100 arbitrary decisions will likely contain errors, reducing these decisions dramatically improves output quality. For frontend work specifically, developers can instruct the AI to follow specific design systems, ensuring it uses established components, colors, and frameworks rather than inventing new patterns. This approach makes brownfield development potentially easier than greenfield since there are concrete patterns to match against. ## Context Preservation Techniques: Plan Mode and Markdown Artifacts The case study describes operational techniques for preserving and transferring context between chat sessions when needed. One approach involves creating markdown files that summarize key decisions and requirements, which can then be included in new chat windows. Cursor has formalized this with "Plan Mode," a feature that helps developers build out plans for larger projects or tasks. Plan Mode generates a markdown file listing all the tasks that need to be completed and presents multiple-choice questions to the engineer about different decision trees. This interactive planning process helps the engineer make explicit architectural decisions upfront, and the resulting plan document serves as a clean context artifact for subsequent implementation work. This approach separates strategic planning from tactical execution while maintaining context across different work sessions. ## Developer Responsibility and the Anti-Pattern of Over-Reliance A critical anti-pattern identified is over-reliance on AI that leads developers to abdicate strategic thinking and architectural decision-making. The case study distinguishes between "vibe coding on the weekend" (building something fun without full understanding) and "building sustainable software in the enterprise" (code that serves millions or billions of users). The fundamental principle articulated is that engineers must understand any code that goes into production, regardless of whether they wrote it or the AI wrote it. When bugs surface in production, developers cannot blame the AI—they are responsible for the pull requests they ship. The best organizational cultures around AI enforce this responsibility: there is no acceptable excuse of "I don't know, the AI wrote it." This doesn't mean developers must always choose their own solution over the AI's suggestion. Rather, when the AI proposes an unfamiliar approach, developers should use the AI itself as a learning tool, asking questions like "I would have done it this way. What are the pros and cons of my way versus your way?" This creates an informed decision-making process where AI augments rather than replaces developer judgment. The danger is that AI's speed can lull developers into a false sense of security, causing them to give increasingly vague prompts and push off strategic decisions to the model. While future models might handle this level of delegation, current systems require developers to stay "in the driver's seat," making architectural decisions and actively steering the agent in the right direction. ## Rules, Recommendations, and the Challenge of Steerability The case study addresses a nuanced operational challenge: how to distinguish between hard rules (must follow) and soft recommendations (generally follow) when instructing LLMs. Developers often have guidelines like "don't output functions longer than 21 lines of code" that are recommendations rather than absolute rules—if the function needs to be 23 lines, that's acceptable. The honest assessment is that this represents an unsolved problem in AI research. Engineers are working with fundamentally non-deterministic systems, and even the most emphatically stated rules (like "NEVER start a new dev server") are sometimes violated by the AI. Research is ongoing into how to build models that are both thoughtful and creative while also closely following instructions and being easily steerable. The current practical approach is to write rules in plain English with explicit priority indicators, similar to instructing a junior engineer. Developers can say "this is a rule to always follow—highest priority" versus "I want it to stay generally under 21 lines of code; if it goes over it's not a big deal, but please try to prioritize under 21 lines." This natural language prioritization actually works to create loose versus strong adherence, though different models have different steering characteristics and results will vary. ## Deterministic Hooks in Non-Deterministic Agent Flows To address the limitations of non-deterministic LLM behavior, Cursor has implemented "hooks"—deterministic control points that can be inserted into agent workflows. These hooks can execute at specific moments: right before a prompt is sent, right after a response is received, before code is generated, or after code is generated. This allows organizations to implement deterministic checks and processes even while working with non-deterministic AI systems. For example, some companies use hooks to check for copyleft data in generated code—something that must be checked every time and cannot be left to probabilistic rule-following by the AI. This hybrid approach combines the creativity and power of LLMs with the reliability of deterministic processes where needed. The distinction is clear: if you have a rule that must always be followed without exception, implement it as a deterministic hook rather than as an instruction to the LLM. ## Agent Harness Optimization and Benchmarking The case study provides insights into building and optimizing agent harnesses—the scaffolding and instructions that guide how an LLM interacts with tools and environments. Cursor's own experience building their agent harness reveals the importance of this component and the difficulty of getting it right. The harness requirements vary significantly based on the agent's purpose. A simple chatbot might need relatively loose guidelines (with safety measures already built into the underlying LLM), but a system like Cursor that can perform almost any action within a codebase requires extensive, prescriptive harness engineering. This includes detailed instructions for working with various tools: edit tools, search tools, web tools, planning tools, and MCP (Model Context Protocol) tools. Cursor discovered through testing that they had a bug in their GPT-5 harness that prevented optimal performance. When fixed, their bench score jumped from 45% to 65%—a dramatic improvement from harness optimization alone. This illustrates that the harness is not a one-time configuration but requires iterative testing, measurement, and refinement. The case study emphasizes the importance of measuring outcomes and using data to drive harness improvements. Benchmark scores, while imperfect and gameable, provide useful signals when used appropriately. The key is to test harness variations against real-world use cases and continuously optimize based on observed performance. ## Temperature Settings and the Creativity-Determinism Tradeoff The discussion touches on the classic temperature parameter as an example of controlling the creativity-determinism tradeoff. High temperature settings produce novel, creative ideas and make the AI an interesting partner, while very low temperature produces more deterministic, point-solution answers. However, there's a Goldilocks zone to consider: if an AI only ever suggests one solution (even when it's suboptimal), the value proposition of working with AI diminishes. The challenge is finding the right balance that allows the AI to be creative and suggest novel approaches while still being sufficiently steerable for production use. This represents an ongoing area of research and a fundamental tension in LLMOps: the very characteristics that make LLMs powerful (creativity, novel solutions) can also make them unreliable in production contexts that demand consistency. ## Operational Implications and Cultural Practices Throughout the case study, several operational practices emerge as critical for successful enterprise deployment: - **Start new chat windows for new tasks** to maintain context hygiene - **Use Ask mode** (question-only, no code generation) to deeply understand codebases before making changes - **Provide concrete examples** from existing code when working with brownfield applications - **Explicitly map architectural decisions** by pointing to well-built patterns - **Minimize arbitrary decisions** the AI must make through clear constraints and examples - **Take responsibility for all AI-generated code** as if you wrote it yourself - **Use AI as a learning tool** when it suggests unfamiliar approaches - **Implement deterministic hooks** for requirements that must never be violated - **Measure and optimize agent harnesses** based on real-world performance - **Build Plan Mode workflows** that separate strategic planning from tactical execution - **Understand model capabilities and personalities** through hands-on experience The cultural dimension is equally important: organizations must establish norms where developers cannot blame the AI for bugs and must maintain full understanding of code going to production. This requires slowing down at key decision points to inform oneself, learn, dive in, ask questions, and understand before proceeding—even when the AI's speed makes it tempting to move faster. ## Model Evolution and Continuous Adaptation The rapid pace of model evolution represents both an opportunity and a challenge for LLMOps. What couldn't be solved two months ago might now be easily accomplished with newer models. This requires organizations to maintain awareness of model releases, test new models against their use cases, and potentially adjust harnesses and workflows as capabilities improve. The case study mentions several model families: Claude (Anthropic), GPT (OpenAI), and Gemini (Google), each with different versions and capabilities. Organizations must develop processes for evaluating new models, migrating to better-performing options, and understanding the tradeoffs of different model choices for different tasks. ## Conclusion This case study provides a rare glimpse into the operational realities of deploying LLM-assisted development at enterprise scale. Rather than focusing on idealized scenarios or vendor promises, it presents hard-won lessons from hundreds of enterprise deployments. The insights span technical challenges (context management, semantic search, harness optimization), process challenges (when to start new chats, how to structure workflows), and cultural challenges (maintaining developer responsibility, avoiding over-reliance). The overarching message is that successful LLMOps for code generation requires treating AI collaboration as a distinct skill that must be learned and practiced. It demands careful attention to context management, explicit architectural guidance, continuous measurement and optimization, and most importantly, maintaining human responsibility and understanding for all code deployed to production. Organizations that master these practices can achieve significant productivity gains, while those that simply "turn on" AI coding assistants without these disciplines will struggle with quality, reliability, and maintainability issues.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.