Cursor: Building an AI-Powered IDE at Scale: Architectural Deep Dive

Company

Cursor

Title

Building an AI-Powered IDE at Scale: Architectural Deep Dive

Industry

Tech

Link

https://newsletter.pragmaticengineer.com/p/cursor

Year

2025

Summary (short)

Cursor, an AI-powered IDE built by Anysphere, faced the challenge of scaling from zero to serving billions of code completions daily while handling 1M+ queries per second and 100x growth in load within 12 months. The solution involved building a sophisticated architecture using TypeScript and Rust, implementing a low-latency sync engine for autocomplete suggestions, utilizing Merkle trees and embeddings for semantic code search without storing source code on servers, and developing Anyrun, a Rust-based orchestrator service. The results include reaching $500M+ in annual revenue, serving more than half of the Fortune 500's largest tech companies, and processing hundreds of millions of lines of enterprise code written daily, all while maintaining privacy through encryption and secure indexing practices.

## Overview and Business Context Cursor is an AI-powered integrated development environment (IDE) built by Anysphere, a startup founded in 2022 that launched its first product in March 2023. This case study provides a rare technical deep dive into how a rapidly scaling GenAI product operates in production, moving from zero to over $500M in annual revenue within approximately two years. The company raised a $900M Series C round in 2025 at a $9.9B valuation and serves more than half of the Fortune 500's largest tech companies including NVIDIA, Uber, Stripe, Instacart, Shopify, Ramp, and Datadog. The engineering challenges described here are particularly relevant because Cursor experienced 100x growth in load within just 12 months, at times doubling month-on-month, while maintaining sub-second latency for code completions and processing over 1 million transactions per second at peak. The scale is impressive: Cursor processes billions of code completions daily, with enterprise clients alone writing 100M+ lines of code per day using the tool. The company manages hundreds of terabytes of indexes (embeddings, not raw code) and operates tens of thousands of NVIDIA H100 GPUs primarily for inference workloads. This case study is valuable because it shows real-world production LLMOps at massive scale with actual business results, not theoretical architectures. ## Technical Architecture and Stack Cursor's technical foundation is built on a fork of Visual Studio Code, which was a strategic decision to allow the team to focus on changing how developers program rather than building a stable editor from scratch. The editor uses TypeScript for business logic and Electron as the framework. Cofounder Sualeh Asif explained that forking allowed them to build features like the "tab model" (autocomplete) incrementally, which would have been very difficult when building from scratch. This decision reflects pragmatic engineering: leverage existing stable infrastructure to focus on core value propositions. The backend architecture is primarily a TypeScript monolith with performance-critical components written in Rust. This is an important LLMOps lesson: even a company processing billions of inferences daily and handling 1M+ QPS operates successfully with a monolithic architecture, challenging the assumption that microservices are necessary for scale. The team uses a Node API to bridge between TypeScript business logic and Rust performance code, with indexing logic being a primary example of this pattern. For data storage, Cursor uses Turbopuffer as a multi-tenant database for storing encrypted files and Merkle trees of workspaces, and Pinecone as a vector database for documentation embeddings. The article notes that Cursor previously used Yugabyte (a database marketed as infinitely scalable) but migrated to PostgreSQL, suggesting that marketing claims about databases don't always match production realities. The team also experienced an "epic effort" of migrating to Turbopuffer in hours during a large indexing outage, demonstrating the operational challenges of LLMOps at scale. The infrastructure is entirely cloud-based, running primarily on AWS for CPU workloads and Azure for GPU inference workloads. They also use several newer GPU clouds and manage infrastructure with Terraform. The observability stack includes Datadog for logging and monitoring (described as having vastly superior developer experience compared to alternatives), PagerDuty for oncall management, and Sentry for error monitoring. Model training and fine-tuning leverage Voltage Park, Databricks MosaicML, and Foundry. ## Low-Latency Autocomplete: The "Tab Model" One of Cursor's core features is the autocomplete suggestion system, which the team calls the "tab model." This system must generate suggestions in under a second to maintain developer flow, presenting a significant LLMOps challenge given the need for relevant context and quick inference. The architecture demonstrates how production LLM systems balance multiple competing constraints: latency, context quality, bandwidth, and security. The workflow operates as follows: when a developer types code, a small portion of the current context window is collected locally by the client, encrypted, and sent to the backend. The backend decrypts the code, generates a suggestion using Cursor's in-house LLM model, and sends the suggestion back to be displayed in the IDE. The developer can accept the suggestion by hitting Tab, and the process repeats continuously. This is a classic low-latency sync engine architecture pattern in LLMOps. The engineering challenge centers on the tradeoff between context window size and latency. Sending more context improves suggestion quality because the model has more information about the codebase, coding style, and intent. However, larger context windows increase both network transfer time and inference time, degrading the user experience. The Cursor team must constantly optimize this balance, which is a common challenge in production LLM systems where user experience expectations are measured in milliseconds, not seconds. The encryption-at-rest and encryption-in-transit approach reflects Cursor's privacy-first architecture, where sensitive code never persists unencrypted on their servers. This security posture is critical for enterprise adoption, as demonstrated by their success with over 50% of Fortune 1000 companies. The technical implementation of this privacy model adds complexity to the LLMOps pipeline but is non-negotiable for the business model. ## Semantic Code Search Without Storing Code Cursor's Chat mode allows developers to ask questions about their codebase, request refactoring, or have agents add functionality. The architectural constraint is that no source code can be stored on Cursor's backend servers, yet all LLM operations must occur there (for compute efficiency and to leverage GPUs). This creates a fascinating LLMOps challenge: how do you enable semantic search over code without storing the code itself? The solution involves a sophisticated indexing and retrieval system built on embeddings and clever use of cryptographic primitives. When a developer asks a question in Chat mode (for example, about a createTodo() method), the prompt is sent to the Cursor server, which interprets it and determines a codebase search is needed. The search operates on previously-created embeddings stored in Turbopuffer, using vector search to locate the embeddings that best match the query context. Importantly, even filenames are obfuscated on the server side to protect confidentiality. Once vector search identifies potentially relevant code locations, the server requests the actual source code from the client for only those specific files. This is a critical architectural decision: the index lives on the server (as embeddings), but the source of truth (raw code) always remains on the client. The server never persists the source code it receives; it only uses it temporarily in memory to analyze and generate responses. This architecture pattern could be valuable for other LLMOps use cases where data sensitivity prohibits server-side storage but semantic search capabilities are required. ## Embeddings Creation and Management To enable vector search, Cursor must first break code into chunks, create embeddings, and store them server-side. The process begins with slicing file contents into smaller parts, each of which becomes an embedding. The client sends obfuscated filenames and encrypted code chunks to the server. The server decrypts the code, creates embeddings using OpenAI's embedding models or Cursor's own models, and stores the embeddings in Turbopuffer. This embedding creation is computationally expensive, which is why it's performed on Cursor's GPU-equipped backend rather than client-side. Indexing typically takes less than a minute for mid-sized codebases but can extend to minutes or longer for large codebases. The computational cost and bandwidth requirements of indexing present real LLMOps challenges: if done too frequently, it wastes resources and money; if done too infrequently, the index becomes stale and search quality degrades. For very large codebases—often monorepos with tens of millions of lines of code—indexing the entire codebase becomes impractical and unnecessary. Cursor provides a .cursorignore file mechanism to exclude directories and files from indexing, similar to how .gitignore works for version control. This reflects a pragmatic approach to LLMOps: not all data needs to be indexed, and letting users control scope can significantly improve system performance and cost. ## Keeping Indexes Fresh with Merkle Trees As developers edit code in Cursor or another IDE, the server-side embeddings index becomes stale. A naive solution would be continuous re-indexing every few minutes, but given the computational expense and bandwidth requirements of creating embeddings, this is wasteful. Instead, Cursor employs a clever use of Merkle trees and a high-latency sync engine that runs every three minutes to efficiently determine which files need re-indexing. A Merkle tree is a cryptographic data structure where each leaf node is a hash of a file's contents and each parent node is a hash derived from its children's hashes. Cursor maintains Merkle trees both on the client (representing the current state of the codebase) and on the server (representing the state of indexed files). Every three minutes, Cursor compares these two Merkle trees to identify differences. Where hashes match, no action is needed; where they differ, tree traversal efficiently identifies exactly which files changed and need re-indexing. This approach minimizes sync operations to only files that have actually changed, which is particularly effective given real-world usage patterns. For example, when a developer pulls updates from a git repository in the morning, many files might change, but the Merkle tree structure allows Cursor to quickly identify exactly which ones need re-indexing without examining every file individually. This is an elegant LLMOps pattern for keeping embeddings fresh while minimizing compute costs, and it demonstrates how classical computer science concepts (Merkle trees, tree traversal) remain valuable in modern AI systems. The three-minute sync interval represents another engineering tradeoff. More frequent syncing would keep indexes fresher but consume more bandwidth and compute; less frequent syncing would save resources but degrade search quality when code changes rapidly. The three-minute interval appears to be an empirically-derived balance point that works for Cursor's user base, though this likely varies by use case and could be user-configurable in future iterations. ## Security and Privacy in Indexing Even with encryption and obfuscation, certain parts of a codebase should never leave the client machine: secrets, API keys, passwords, and other sensitive credentials. Cursor's approach to this challenge combines multiple defense layers. First, Cursor respects .gitignore files and will not index or send contents of files listed there. Since best practices dictate that secrets should be stored in local environment files (like .env) that are gitignored, this catches most cases automatically. Second, Cursor provides a .cursorignore file for additional control over what gets indexed, allowing developers to explicitly exclude sensitive files even if they aren't in .gitignore. Third, before uploading code chunks for indexing, Cursor scans them for possible secrets or sensitive data and filters these out. This multi-layered approach reflects mature security thinking in LLMOps: no single mechanism is perfect, so defense in depth is essential. This security model has implications for LLMOps more broadly. As LLM systems increasingly operate on sensitive enterprise data, the architecture must be designed from the ground up with privacy and security as first-class constraints, not afterthoughts. Cursor's success with large enterprises suggests that this approach works in practice, not just in theory. However, the article doesn't detail what happens if secrets accidentally make it through these filters, which would be an important consideration for any organization evaluating Cursor for use with highly sensitive codebases. ## Anyrun: The Orchestrator Service Anyrun is Cursor's orchestrator component, written entirely in Rust for performance. While the article notes that Anyrun is "responsible for this" and then cuts off (the text appears incomplete), the context suggests that Anyrun handles launching and managing agents in the cloud environment. Given that Cursor 1.0 introduced "background agents" that can perform complex tasks like refactoring or adding features, Anyrun likely manages the lifecycle of these agent processes. The choice of Rust for the orchestrator reflects a common pattern in LLMOps: using systems programming languages for components that require high performance, fine-grained resource control, and strong safety guarantees. Orchestration of multiple concurrent agent processes, especially at Cursor's scale (billions of completions daily), requires careful management of CPU, memory, and process isolation to prevent one agent from affecting others or consuming excessive resources. Based on the earlier mention of Amazon EC2 and AWS Firecracker in the description, Anyrun likely uses Firecracker (AWS's lightweight virtualization technology) to provide secure isolation between agent instances. Firecracker is designed for exactly this use case: running many lightweight virtual machines with minimal overhead and strong security boundaries. This would allow Cursor to safely execute agent code in the cloud while preventing malicious or buggy code in one agent from affecting others or accessing unauthorized resources. ## Scaling Challenges and Database Migrations The article mentions several engineering challenges that emerged from Cursor's rapid 100x growth, though specific details are behind a paywall. However, we learn about two significant database migrations that reveal important LLMOps lessons. First, Cursor migrated from Yugabyte, a database marketed as infinitely scalable, to PostgreSQL. This is a striking reversal: moving from a distributed database designed for horizontal scaling to a traditional relational database typically indicates that the distributed system's operational complexity outweighed its scaling benefits, at least at Cursor's current scale. This migration suggests several possible issues: Yugabyte may have introduced too much latency, operational complexity, or cost compared to PostgreSQL; the team may have lacked expertise to operate Yugabyte effectively; or PostgreSQL's maturity and ecosystem may have provided better tools for Cursor's specific workload. The lesson for LLMOps practitioners is that "infinitely scalable" marketing claims should be evaluated skeptically, and sometimes boring, proven technology like PostgreSQL works better than newer distributed systems. The second migration involved moving to Turbopuffer "in hours, during a large indexing outage." This emergency migration indicates both the criticality of the indexing system to Cursor's operations and the team's ability to execute under pressure. However, it also suggests that the original database choice (likely Yugabyte based on the timeline) was not meeting their needs, particularly for the high-volume, high-throughput workload of storing and retrieving embeddings. The fact that this migration happened during an outage rather than as a planned transition points to the kinds of operational challenges that arise when scaling LLMOps systems 100x in a year. ## Model Training and Inference at Scale Cursor operates its own LLM models rather than solely relying on third-party APIs like OpenAI's GPT-4. This is evidenced by their use of multiple training providers (Voltage Park, Databricks MosaicML, Foundry) and the mention of fine-tuning existing models. Operating tens of thousands of NVIDIA H100 GPUs, with GPU infrastructure split across AWS, Azure, and newer GPU clouds, reveals the massive computational requirements of LLMOps at Cursor's scale. The architectural decision to use Azure GPUs solely for inference while using other providers for training and fine-tuning suggests cost or availability optimization strategies. Inference is by far Cursor's biggest GPU use case given the billions of completions served daily, so having dedicated, optimized infrastructure for inference makes sense. The separation also allows different optimization strategies: inference requires low latency and high throughput, while training can tolerate higher latency but benefits from high bandwidth between GPUs. The choice to build and fine-tune custom models rather than rely entirely on third-party APIs reflects the maturity of Cursor's LLMOps practice. Custom models allow optimization for specific use cases (code completion, code understanding), better cost control at scale, reduced dependency on external providers, and potential competitive advantages in model quality. However, this also represents a massive investment in ML engineering talent, infrastructure, and ongoing operational costs that many organizations couldn't justify. ## Engineering Culture and Development Practices The article mentions that Cursor ships releases every two to four weeks, uses "unusually conservative feature flagging," maintains a dedicated infrastructure team, and fosters an experimentation culture. The 50-person engineering team managing this scale and complexity suggests high productivity, though it's worth noting that with $500M+ in revenue, the company has resources to hire top talent and invest heavily in infrastructure. The "unusually conservative feature flagging" is interesting from an LLMOps perspective. Feature flags allow gradual rollout of new features, A/B testing, and quick rollback if issues arise. Being "unusually conservative" suggests that Cursor is very careful about changes to production systems, likely because even small degradations in autocomplete latency or suggestion quality directly impact user experience and, at this scale, could affect millions of lines of code being written. This conservative approach to deployment contrasts with the rapid growth and suggests mature operational discipline. Using Cursor to build Cursor (dogfooding) is notable because it means the engineering team experiences the same latency, quality, and reliability issues as customers, providing immediate feedback on problems. However, as the article notes, "every engineer is responsible for their own checked-in code, whether they wrote it by hand, or had Cursor generate it." This reflects a broader question in LLMOps: as AI coding assistants become more capable, how do we maintain code quality and ensure engineers understand the code being checked in? Cursor's approach puts responsibility squarely on the human engineer, not the AI tool. ## Critical Assessment and Lessons for LLMOps Practitioners This case study provides valuable insights into production LLMOps, but readers should consider several caveats. First, Cursor has raised nearly $1 billion in funding and generates $500M+ in annual revenue, giving them resources that most organizations lack. The architectural choices—operating tens of thousands of H100 GPUs, building custom models, maintaining a 50-person engineering team—may not be replicable or necessary for smaller-scale LLMOps deployments. Second, the article is written by an industry publication and based on an interview with a company cofounder, which may present an overly positive view. The mention of database migrations and outages suggests there have been significant operational challenges, but details about failures, incorrect architectural choices, or ongoing problems are not extensively covered. The reality of scaling 100x in a year is likely messier than presented. Third, some architectural choices may be specific to Cursor's use case (code completion and understanding) and may not generalize to other LLMOps domains. The low-latency requirements for autocomplete (sub-second) are more stringent than many LLM applications; the privacy constraints (not storing code) are domain-specific; and the scale of inference operations (billions daily) is unusual even among successful LLM products. That said, several lessons appear broadly applicable to LLMOps: - Monolithic architectures can work at significant scale when properly designed, challenging the microservices orthodoxy - Privacy and security must be architectural concerns from day one for enterprise adoption - Clever use of classical computer science concepts (Merkle trees, tree traversal) can solve modern LLMOps challenges efficiently - There are fundamental tradeoffs between context quality, latency, bandwidth, and cost that require continuous optimization - Database and infrastructure choices should be based on actual operational experience, not marketing claims - Custom models may be justified at sufficient scale but represent massive investment - Conservative deployment practices and feature flags become more important as scale increases The Cursor case study demonstrates that production LLMOps at massive scale is feasible but requires sophisticated architecture, significant computational resources, strong engineering talent, and continuous optimization across multiple competing constraints. The technical approaches described—particularly the privacy-preserving semantic search architecture and efficient index synchronization—offer patterns that other LLMOps practitioners can adapt to their own domains.

Start deploying reproducible AI workflows today