ZenML

Scaling AI-Assisted Coding Infrastructure: From Auto-Complete to Global Deployment

Cursor 2023
View original source

Cursor, an AI-assisted coding platform, scaled their infrastructure from handling basic code completion to processing 100 million model calls per day across a global deployment. They faced and overcame significant challenges in database management, model inference scaling, and indexing systems. The case study details their journey through major incidents, including a database crisis that led to a complete infrastructure refactor, and their innovative solutions for handling high-scale AI model inference across multiple providers while maintaining service reliability.

Industry

Tech

Technologies

Overview

This case study comes from a Stanford CS153 lecture featuring Sualeh Asif (referred to as “Swal” or “Swali”), the CTO and co-founder of Cursor, an AI-powered code editor that has become one of the most popular tools for AI-assisted coding. The discussion provides an unusually candid and technically detailed look at the infrastructure challenges of running LLM-based services at massive scale. Cursor has scaled by a factor of 100 or more in the past year, processing approximately 100 million custom model calls per day on their self-hosted models alone, in addition to being a substantial portion of frontier model (Anthropic, OpenAI) API traffic globally.

Three Pillars of Cursor’s Infrastructure

Cursor’s production infrastructure is built around three core pillars:

Indexing Systems

The indexing infrastructure is responsible for understanding users’ codebases, which can range from small personal projects to enterprise-scale repositories comparable to companies like Instacart. The indexing system includes:

The indexing system uses a Merkle Tree structure for efficient synchronization between client and server. Each file gets hashed, and folder hashes are computed from their children up to a root hash. This allows efficient detection of changes when users return to their editor - instead of re-scanning everything, the system descends the tree to find only what has changed, enabling automatic re-indexing without user intervention.

Model Inference

Cursor runs both self-hosted custom models and relies on frontier model providers. For autocomplete alone - which runs on every keystroke - they handle approximately 20,000 model calls per second at any given time. This runs on a fleet of around 1,000-2,000 H100 GPUs distributed globally across:

They attempted a Frankfurt deployment but encountered stability issues. The geographic distribution is critical for latency-sensitive features like autocomplete, where users in Japan still need a responsive experience.

Beyond autocomplete, they have an “apply model” that handles applying generated code changes to codebases - a particularly challenging inference problem that may involve 100,000 to 200,000 tokens while still needing to feel instantaneous to users.

Streaming Infrastructure

The third pillar encompasses data pipelines for storing incoming data, running continuous training/improvement processes, and background analytics. This isn’t user-facing in real-time but powers the continuous improvement of Cursor’s models and product.

Architectural Philosophy

Cursor operates on a monolithic architecture - “everything is one big monolith that we deploy.” However, they’ve learned hard lessons about compartmentalization. Early on, an infinite loop in experimental code could take down critical services like chat or even login. The solution has been to create strict isolation between blast radii - experimental code paths are segregated from core services that must remain available.

They’ve also adopted a philosophy of simplicity: “there’s a strict rule on the server because if it’s too complicated you don’t understand it, you can’t run it.” This pragmatic approach to managing complexity has become essential as the system has grown.

Database Lessons and Incident Response

The YugaByte to PostgreSQL Migration

The talk includes a detailed incident story from around September 2023 involving their indexing system. Initially, they chose YugaByte, a distributed database descended from Google’s Spanner, expecting infinite scalability. The reality was that they couldn’t get YugaByte to run reliably despite significant investment.

The lesson was stark: “Don’t choose a complicated database. Go with hyperscalers - they know what they’re doing, their tooling is really good. Use Postgres, don’t do anything complicated.”

The 22TB PostgreSQL Crisis

After moving to RDS PostgreSQL, the system ran beautifully for about eight months. Then disaster struck. The database had grown to 22TB of data (RDS has a 64TB limit), but the problem wasn’t storage space - it was PostgreSQL’s internal mechanisms.

The issue stemmed from their workload pattern: constant file updates. In PostgreSQL, an UPDATE is actually a DELETE plus an INSERT - it doesn’t modify records in place. With users constantly typing and updating files, the database accumulated massive amounts of dead tuples (tombstoned records). The VACUUM process responsible for cleaning up this dead data couldn’t keep pace, and the database began grinding to a halt.

AWS support couldn’t help. Even the RDS architects who built the database said they were “in big trouble.” The database eventually stopped booting entirely.

Real-Time Migration Under Pressure

The incident response involved parallel workstreams:

Remarkably, the object storage rewrite succeeded faster than any of the repair attempts. This led to a live migration to a completely new architecture during an active outage. The new system was deployed with essentially zero testing - “no tests needed, no nothing, I just wrote it two hours ago” - because the alternative was indefinite downtime.

The winning solution leveraged object storage (S3, R2, etc.), which the speaker describes as the most reliable infrastructure available. They use Turbopuffer for vector database capabilities on top of object storage. The lesson: “the best way to scale a database is to just not have a database.”

Working with Frontier Model Providers

Cursor is one of the largest consumers of frontier model APIs globally. The speaker describes the relationship with providers as ongoing “live negotiations” reminiscent of early cloud computing.

Key challenges include:

The cold-start problem is particularly challenging during incident recovery. If all nodes go down and you bring up 10 of your 1,000 nodes, every user’s requests will immediately overwhelm those 10 nodes before they can become healthy. Various mitigation strategies include prioritizing certain users, killing traffic entirely, or (like WhatsApp) bringing up specific prefixes first.

Security Considerations

Given that Cursor processes proprietary code for many users, security is a significant concern. They implement encryption at the vector database level - embeddings are encrypted with keys that live on user devices. Even if someone compromised the vector database (which would require breaching Google Cloud security), they couldn’t decode the vectors without the encryption keys.

The speaker notes this is belt-and-suspenders security: “I’m 99.99% sure there’s no way to go from a vector to code, but because it’s not some provable fact about the world, it’s better to just have this encryption key.”

Abuse Prevention

Free tier abuse is an ongoing challenge. Recent attempts include someone creating hundreds of thousands of Hotmail accounts, distributing requests across 10,000 IP addresses to evade rate limiting. Significant engineering effort goes into analyzing traffic patterns to block such abuse while maintaining service for legitimate users.

Lessons for LLMOps at Scale

Several key takeaways emerge from Cursor’s experience:

The speaker also notes that LLM-assisted development has actually enabled them to cover more infrastructure surface area with a small team - rather than eliminating engineering jobs, the tools enable more ambitious system designs by handling routine aspects of implementation.

More Like This

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61