Windsurf: Building Enterprise-Ready AI Development Infrastructure from Day One

Overview

Windsurf (formerly known as Codium) represents an interesting case study in building and scaling LLM-powered developer tools for both individual developers and large enterprises. The company evolved from a GPU virtualization company to building AI-powered coding assistants, ultimately launching their own IDE called Windsurf with an agentic code assistance feature called Cascade. Their approach demonstrates several key LLMOps principles around model selection, evaluation infrastructure, retrieval systems, and balancing first-party versus third-party model usage.

Technical Architecture and Model Strategy

Windsurf employs a hybrid model strategy that leverages both proprietary and third-party models based on the specific requirements of each feature:

For autocomplete and supercomplete features that run on every keystroke, Windsurf uses entirely proprietary models. This decision stems from the observation that fill-in-the-middle (FIM) capabilities in major commercial models like Claude and GPT-4 are still quite poor. The team notes that large model providers have focused primarily on chat-like assistant APIs optimized for multi-turn conversations, making them suboptimal for the specific requirements of inline code completion where precise, context-aware insertions are critical.

For high-level planning and reasoning tasks within their agentic Cascade system, Windsurf relies on third-party models from Anthropic (Claude) and OpenAI. The team acknowledges that these providers currently have the best products for planning capabilities, and that Cascade would not have been possible without the rapid improvements in models like GPT-4 and Claude 3.5.

For retrieval, Windsurf has built custom models and distributed systems rather than relying purely on embeddings. Their insight is that while embeddings work for many retrieval use cases, complex queries require more sophisticated approaches. They give the example of finding “all quadratic time algorithms in a codebase” - a query that embeddings cannot encapsulate effectively since they cannot capture the semantic meaning that a particular function has quadratic time complexity. To address this, they’ve built large distributed systems that run custom LLMs at scale across large codebases to perform higher-quality retrieval.

Evaluation Infrastructure

Windsurf has invested heavily in building proprietary evaluation systems, reflecting a philosophy that most existing benchmarks for software development are inadequate for their use cases. Their critique of existing benchmarks like SWE-Bench and HumanEval is that they don’t reflect actual professional software development work.

Their evaluation approach involves:

Taking open-source repositories and finding commits that have associated tests
Stripping the commits to simulate code in an incomplete state (mimicking real development where code hasn’t been fully written yet)
Testing whether their system can retrieve the right snippets, create a cohesive plan, and execute an iterative loop to get the code to pass tests
Decomposing the complex problem into planning, retrieval, and multi-step execution components that can each be measured for improvement
Using mask prediction to test both high-level intent guessing and the actual changes needed to make tests pass

This approach converts what would be a discontinuous, discrete problem (PR works vs. doesn’t work) into a continuous optimization problem that can be systematically improved.

For retrieval specifically, they’ve built custom evaluations that look at “retrieval at 50” rather than “retrieval at 1” - recognizing that code is a distributed knowledge store where you need to pull in snippets from many different parts of a codebase to accomplish tasks. They create golden sets by looking at old commits to identify which files were edited together, creating semantic groupings that might not be apparent from code graph analysis alone.

Preference Learning and Feedback Loops

A key advantage of being embedded in the IDE is access to rich preference data for model improvement. Unlike chat-based interfaces where you only get explicit feedback, Windsurf can observe:

Whether completions are accepted
What happens after acceptance - did the developer delete or modify parts of the suggestion?
The trajectory of developer actions (opening files, making edits, running terminal commands)
The ground truth of what the developer actually produced

This allows them to train on preference data that goes beyond simple acceptance metrics. If a developer accepts a suggestion but then deletes several items, that’s valuable signal for improving future suggestions. The team uses synthetic data combined with this preference data to improve their supercomplete product over time.

The Case for Building a Custom IDE

Windsurf’s decision to build their own IDE rather than continuing as a VS Code extension reflects specific technical limitations they encountered:

API restrictions: VS Code’s extension APIs wouldn’t expose functionality needed for features like SuperComplete (which suggests refactors). They resorted to dynamically generating PNGs to display features - a workaround that consumed engineering time fighting the system rather than building features.
Stability issues: VS Code’s APIs would frequently break, forcing them to show degraded experiences.
Limited trajectory tracking: A core innovation in Cascade is understanding the “trajectory” of developer actions within the editor - what files they opened, what edits they made, what terminal commands they ran. This intent inference is central to reducing the need for developers to explicitly specify what they want the AI to do.
UX flexibility: Building their own IDE allows for custom UI elements like inline diff viewing, background terminal command execution, and other workflow improvements that weren’t possible within VS Code’s constraints.

Enterprise Considerations

The company has deliberately built infrastructure that serves both individual developers and large enterprises with the same systems. This means:

They don’t pay external providers for indexing or serving custom models
The same infrastructure serving hundreds of thousands of individual developers can scale to serve enterprise customers with tens of thousands of developers
They’ve invested in source code management tool agnosticism (supporting GitLab, Bitbucket, Gerrit, Perforce, CVS, and others) since GitHub has less than 10% penetration in the Fortune 500 and high switching costs

Their enterprise product achieved $10M ARR in under a year, which they attribute partly to having a product that developers genuinely love using - enterprise software decisions often involve asking developers directly about their experience.

Agentic Capabilities and Future Direction

The Cascade system represents their agentic approach, combining:

Human trajectory information (what the developer is doing)
AI trajectory information (what actions the AI has taken)
Knowledge retrieval across the codebase
Tool access including terminal execution

Current limitations they’re working to address include:

Command execution safety: Currently users must approve every terminal command, which is friction-inducing. The solution likely involves either sophisticated sandboxing or remote execution to prevent destructive actions.
Multi-agent exploration: They’ve investigated spawning multiple trajectories to test different hypotheses, but this is blocked by the inability to safely execute parallel operations with side effects.
Latency: Even with their current optimizations, the team feels the product is sometimes too slow and sees opportunities for both systems work and modeling improvements.
Rollback capabilities: They envision checkpoint/snapshot functionality that allows rolling back system state, which is necessary for safe experimentation by the AI.

Lessons on Build vs. Buy

The team revised their earlier thinking on first-party vs. third-party models. In 2022, they believed building first-party was essential. Now they take a more nuanced view:

First-party is still essential for specific applications like autocomplete where prompting or fine-tuning can’t achieve required quality
Third-party models have advanced rapidly and enabled capabilities like Cascade that wouldn’t have been possible otherwise
The decision should be based on whether a use case has specific requirements that general-purpose models can’t address

Platform and Language Diversity

A notable aspect of their infrastructure work is supporting the actual diversity of enterprise developers:

Over 80% of professional developers use Windows (not Mac)
Many developers write Java, Golang, and other languages beyond just TypeScript and Python
JetBrains has superior debuggers for languages like Java
They’ve built Eclipse support because enterprise customers still use it

This reality-grounded approach to supporting where developers actually work (rather than just Silicon Valley preferences) has been important for enterprise adoption.

Skeptical Engineering Culture

The company intentionally maintains engineers who are skeptical of AI claims - many from autonomous vehicle backgrounds where the gap between demos and production reliability is well understood. This creates healthy tension between enthusiastic adoption and realistic assessment of what actually works, helping them avoid optimizing for benchmarks that don’t reflect real-world value and kill bad ideas before they waste resources.

Building Enterprise-Ready AI Development Infrastructure from Day One

Industry

Technologies