Windsurf began as a GPU virtualization company but pivoted in 2022 when they recognized the transformative potential of large language models. They developed an AI-powered development environment that evolved from a VS Code extension to a full-fledged IDE, incorporating advanced code understanding and generation capabilities. The product now serves hundreds of thousands of daily active users, including major enterprises, and has achieved significant success in automating software development tasks while maintaining high precision through sophisticated evaluation systems.
Windsurf represents a fascinating case study in LLMOps evolution, having undergone a dramatic pivot from GPU virtualization infrastructure to becoming one of the leading AI-powered code editors. The company, founded by Verun and his co-founder, began as Exofunction in 2020, managing GPU infrastructure for deep learning workloads. By mid-2022, they were managing over 10,000 GPUs and generating several million in revenue. However, recognizing that the rise of transformer models would commoditize their infrastructure business, they made a bold “bet the company” decision to pivot entirely to AI coding tools within a single weekend.
The company’s journey illustrates several key LLMOps principles: the importance of owning inference infrastructure, the value of custom model training for specific use cases, sophisticated context retrieval beyond simple RAG, and rigorous evaluation systems for continuous improvement.
One of Windsurf’s key competitive advantages came directly from their GPU virtualization background. When they initially launched their VS Code extension (Codium), they were able to offer the product for free because they had built their own inference runtime. This infrastructure heritage meant they could run models efficiently without relying on expensive third-party API calls, giving them both cost advantages and performance control.
This is a critical LLMOps lesson: companies with deep infrastructure expertise can leverage that knowledge when pivoting to AI applications. The ability to run inference efficiently at scale is a significant operational advantage, particularly for latency-sensitive applications like code autocomplete where suggestions need to appear in real-time as developers type.
Rather than relying solely on off-the-shelf models, Windsurf invested heavily in training their own specialized code models. Their initial shipped product used an open-source model that was “materially worse than GitHub Copilot,” but within two months they had trained custom models that surpassed Copilot in key capabilities.
A notable example was their fill-in-the-middle (FIM) capability. Traditional code models were trained primarily on complete code, but developers frequently need to insert code between existing lines where the surrounding context looks nothing like typical training data. Windsurf trained models specifically for this use case, achieving state-of-the-art performance in autocomplete suggestions for incomplete code contexts. This demonstrates a key LLMOps principle: general-purpose foundation models often need task-specific fine-tuning to excel in production applications.
Perhaps the most technically interesting aspect of Windsurf’s architecture is their approach to context retrieval for large codebases. The company explicitly rejected the standard “vector database RAG” approach that became prevalent in the industry, instead building a multi-layered retrieval system.
Their retrieval pipeline combines:
The rationale for this complexity is precision and recall. As Verun explained, if a developer asks to “upgrade all versions of this API,” embedding search might find 5 out of 10 instances. That’s not useful. The system needs near-perfect recall for such operations. The multi-modal retrieval approach ensures they can handle both semantic queries and exhaustive searches.
This is a crucial lesson for LLMOps practitioners: RAG is not a monolithic solution. The best retrieval systems often combine multiple approaches tailored to the specific domain and query types expected.
Windsurf places enormous emphasis on evaluation systems, which they attribute partly to their founders’ background in autonomous vehicles—systems where you cannot simply “YOLO” and let software run without rigorous testing.
Their evaluation approach leverages a unique property of code: it can be executed. They developed eval systems using open-source projects with commits that include tests. The evaluation pipeline:
They also test “masked” scenarios similar to Google’s approach—providing partial changes and testing whether the system can infer intent and complete the work.
This granular evaluation allows them to measure:
Having these metrics creates a “hill to climb”—a clear optimization target before adding system complexity. This disciplined approach prevents the common trap of adding complexity without measurable benefit.
The company acknowledges that not everything can be captured in formal evaluations. Some improvements start as “vibes”—intuitions about what would help users, like leveraging open files in a codebase. User behavior data helps validate these intuitions, and formal evals can be built afterward. This hybrid approach of quantitative evaluation plus qualitative product sense reflects mature LLMOps practice.
Windsurf serves enterprise customers with extremely large codebases—some exceeding 100 million lines of code. Companies like Dell and JP Morgan Chase have tens of thousands of developers using the product. This required significant investment in:
The enterprise focus meant building a go-to-market team alongside engineering, which is unusual for early-stage startups but necessary for Fortune 500 sales cycles.
A key architectural decision was building cross-IDE support early (VS Code, JetBrains, Eclipse, Vim). Rather than building separate products, they architected shared infrastructure with thin per-editor layers. This decision, made early, prevented significant technical debt as they scaled.
In 2024, the company forked VS Code to create their own IDE (Windsurf), shipped within 3 months. The motivation was that VS Code’s extension model limited the agentic capabilities they wanted to deliver. They believed developers would increasingly spend time reviewing AI-generated code rather than writing it, requiring a fundamentally different interface.
Windsurf was reportedly the first “agentic editor”—moving beyond chat-based interfaces to autonomous code modification. Their philosophical approach differed from competitors:
Looking forward, they’re thinking about scenarios with multiple agents operating on a codebase simultaneously. This introduces challenges around:
They’re exploring git worktrees and other mechanisms to support this future architecture.
Remarkably, the engineering team remained under 25 people even as they shipped major products. The company maintained this lean structure by:
They did maintain a larger go-to-market team due to enterprise sales requirements.
Their hiring approach evolved with AI capabilities:
A distinctive aspect of Windsurf’s culture is accepting—even celebrating—that many bets won’t work. Verun explicitly states he’s happy when only 50% of initiatives succeed, viewing 100% success as a warning sign of insufficient ambition, hubris, or failure to test hypotheses at the frontier.
This connects to their view that competitive advantages are “depreciating insights”—what works today won’t work tomorrow. Continuous innovation is required, and pivots should be treated as “a badge of honor.”
The Windsurf case study offers several generalizable lessons for LLMOps practitioners:
Infrastructure ownership matters: Having their own inference runtime gave cost and performance advantages that enabled rapid iteration and competitive pricing.
Custom training for specialized tasks: Generic foundation models often need domain-specific fine-tuning. The fill-in-the-middle capability exemplifies how task-specific training can create differentiation.
RAG is not enough: Complex retrieval often requires combining multiple approaches (vector search, keyword search, structural parsing) tailored to domain requirements.
Evaluation systems enable complexity: Without rigorous evals, it’s impossible to know whether architectural additions improve outcomes. The autonomous vehicle background instilled this discipline.
Balance evals with user data: Not everything can be pre-evaluated. Production user behavior provides essential feedback for improving AI systems.
Agent architectures require unified context: Maintaining a single timeline of human and AI actions enables coherent agentic behavior.
The product surface matters: Sometimes delivering AI capabilities requires owning more of the stack (hence forking VS Code).
Depreciation of insights: Competitive advantages in AI are temporary. Continuous innovation and willingness to pivot are essential for survival.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.