ZenML

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024
View original source

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

Industry

Tech

Technologies

Overview

Windsurf (formerly known as Codium) represents an interesting case study in building and scaling LLM-powered developer tools for both individual developers and large enterprises. The company evolved from a GPU virtualization company to building AI-powered coding assistants, ultimately launching their own IDE called Windsurf with an agentic code assistance feature called Cascade. Their approach demonstrates several key LLMOps principles around model selection, evaluation infrastructure, retrieval systems, and balancing first-party versus third-party model usage.

Technical Architecture and Model Strategy

Windsurf employs a hybrid model strategy that leverages both proprietary and third-party models based on the specific requirements of each feature:

For autocomplete and supercomplete features that run on every keystroke, Windsurf uses entirely proprietary models. This decision stems from the observation that fill-in-the-middle (FIM) capabilities in major commercial models like Claude and GPT-4 are still quite poor. The team notes that large model providers have focused primarily on chat-like assistant APIs optimized for multi-turn conversations, making them suboptimal for the specific requirements of inline code completion where precise, context-aware insertions are critical.

For high-level planning and reasoning tasks within their agentic Cascade system, Windsurf relies on third-party models from Anthropic (Claude) and OpenAI. The team acknowledges that these providers currently have the best products for planning capabilities, and that Cascade would not have been possible without the rapid improvements in models like GPT-4 and Claude 3.5.

For retrieval, Windsurf has built custom models and distributed systems rather than relying purely on embeddings. Their insight is that while embeddings work for many retrieval use cases, complex queries require more sophisticated approaches. They give the example of finding “all quadratic time algorithms in a codebase” - a query that embeddings cannot encapsulate effectively since they cannot capture the semantic meaning that a particular function has quadratic time complexity. To address this, they’ve built large distributed systems that run custom LLMs at scale across large codebases to perform higher-quality retrieval.

Evaluation Infrastructure

Windsurf has invested heavily in building proprietary evaluation systems, reflecting a philosophy that most existing benchmarks for software development are inadequate for their use cases. Their critique of existing benchmarks like SWE-Bench and HumanEval is that they don’t reflect actual professional software development work.

Their evaluation approach involves:

This approach converts what would be a discontinuous, discrete problem (PR works vs. doesn’t work) into a continuous optimization problem that can be systematically improved.

For retrieval specifically, they’ve built custom evaluations that look at “retrieval at 50” rather than “retrieval at 1” - recognizing that code is a distributed knowledge store where you need to pull in snippets from many different parts of a codebase to accomplish tasks. They create golden sets by looking at old commits to identify which files were edited together, creating semantic groupings that might not be apparent from code graph analysis alone.

Preference Learning and Feedback Loops

A key advantage of being embedded in the IDE is access to rich preference data for model improvement. Unlike chat-based interfaces where you only get explicit feedback, Windsurf can observe:

This allows them to train on preference data that goes beyond simple acceptance metrics. If a developer accepts a suggestion but then deletes several items, that’s valuable signal for improving future suggestions. The team uses synthetic data combined with this preference data to improve their supercomplete product over time.

The Case for Building a Custom IDE

Windsurf’s decision to build their own IDE rather than continuing as a VS Code extension reflects specific technical limitations they encountered:

Enterprise Considerations

The company has deliberately built infrastructure that serves both individual developers and large enterprises with the same systems. This means:

Their enterprise product achieved $10M ARR in under a year, which they attribute partly to having a product that developers genuinely love using - enterprise software decisions often involve asking developers directly about their experience.

Agentic Capabilities and Future Direction

The Cascade system represents their agentic approach, combining:

Current limitations they’re working to address include:

Lessons on Build vs. Buy

The team revised their earlier thinking on first-party vs. third-party models. In 2022, they believed building first-party was essential. Now they take a more nuanced view:

Platform and Language Diversity

A notable aspect of their infrastructure work is supporting the actual diversity of enterprise developers:

This reality-grounded approach to supporting where developers actually work (rather than just Silicon Valley preferences) has been important for enterprise adoption.

Skeptical Engineering Culture

The company intentionally maintains engineers who are skeptical of AI claims - many from autonomous vehicle backgrounds where the gap between demos and production reliability is well understood. This creates healthy tension between enthusiastic adoption and realistic assessment of what actually works, helping them avoid optimizing for benchmarks that don’t reflect real-world value and kill bad ideas before they waste resources.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino 2025

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

healthcare fraud_detection customer_support +50

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57