ZenML

Eval-Driven Development for AI Applications

Vercel 2024
View original source

Vercel presents their approach to building and deploying AI applications through eval-driven development, moving beyond traditional testing methods to handle AI's probabilistic nature. They implement a comprehensive evaluation system combining code-based grading, human feedback, and LLM-based assessments to maintain quality in their v0 product, an AI-powered UI generation tool. This approach creates a positive feedback loop they call the "AI-native flywheel," which continuously improves their AI systems through data collection, model optimization, and user feedback.

Industry

Tech

Technologies

Overview

Vercel, a cloud platform company known for hosting Next.js applications and frontend infrastructure, has developed a philosophy they call “eval-driven development” for building AI-native applications. This case study focuses on their approach to managing the inherent unpredictability of LLM-based systems in production, with their product v0 serving as the primary example. v0 is an AI-powered tool that generates user interfaces from text descriptions, producing React components styled with Tailwind CSS.

The core premise is that traditional software testing methodologies—unit tests, integration tests, and even end-to-end tests—are insufficient for AI systems due to their probabilistic nature. Unlike deterministic code where “two plus two always equals four,” LLM outputs can vary significantly even with identical inputs. Vercel’s response to this challenge is a comprehensive evaluation framework that assesses AI output quality against defined criteria using multiple grading methods.

The Problem with Traditional Testing for LLMs

The article effectively articulates why standard testing approaches fall short for AI systems. In conventional software development, developers can write tests that expect specific outputs for given inputs. However, LLMs operate as “black boxes” with responses that are difficult to predict. This unpredictability makes exhaustive, hard-coded testing impractical. Vercel draws an interesting parallel to the challenges faced by web search engineers over the past two decades, who dealt with the web’s vastness and unpredictable user queries by adopting eval-centric processes that acknowledged any change could bring both improvements and regressions.

This framing is valuable because it normalizes the idea that AI development requires accepting a certain level of variability and measuring overall performance rather than individual code paths. However, it’s worth noting that this is not a completely novel insight—the machine learning community has long used evaluation metrics and test sets, though the specific application to LLM-based code generation does present unique challenges.

Three Types of Evaluations

Vercel outlines three primary evaluation approaches that complement traditional test suites:

Code-based grading involves automated checks using code for objective criteria. These provide fast feedback and are ideal for verifiable properties. Examples include checking if AI output contains specific keywords, matches regular expressions, validates as syntactically correct code, or uses expected patterns (like Tailwind CSS classes versus inline styles). Vercel notes this is the most cost-effective approach.

Human grading leverages human judgment for subjective evaluations. This is essential for nuanced assessments of quality, creativity, clarity, coherence, and overall effectiveness. The approach is particularly useful when evaluating generated text or code that may be functionally correct but stylistically poor. Vercel mentions both end-user feedback (via thumbs up/down, ratings, or forms) and internal “dogfooding” as sources of human evaluation.

LLM-based grading uses other AI models to assess output, offering scalability for complex judgments. While acknowledged as potentially less reliable than human grading, this approach is more cost-effective for large-scale evaluations. The article notes that LLM evals still cost 1.5x to 2x more than code-based grading, which is a useful practical data point for teams budgeting for evaluation infrastructure.

Practical Example: React Component Generation

The article provides a concrete example of evaluating AI-generated React components. Given a prompt to generate an ItemList component that displays items in bold using Tailwind CSS, the expected output uses arrow function syntax, Tailwind classes, and includes proper key props for list items. The actual LLM output, while functionally correct, deviated by using traditional function syntax, inline styles instead of Tailwind, and omitting the key prop.

This example illustrates several important LLMOps considerations:

The key insight is that evals should produce clear scores across multiple domains, making it easy to track whether the AI is improving or regressing on specific criteria over time.

The AI-Native Flywheel

Vercel introduces the concept of an “AI-native flywheel,” a continuous feedback loop that accelerates development and drives consistent improvement. This flywheel consists of four interconnected components:

Evals form the foundation, replacing gut feelings with data-driven decisions. The article acknowledges that managing evals at scale is challenging and mentions Braintrust as a tool for automated evaluations, logging, and comparative analysis. Integration of production logs and user interactions into evaluation data creates a connection between real-world performance and development decisions.

Data quality remains critical—“garbage in, garbage out” still applies. Robust evals help pinpoint data gaps and target collection efforts. The article notes that every new data source requires appropriate evals to prevent hallucinations or other issues, which is an important operational consideration for teams expanding their AI systems.

Models and strategies evolve rapidly in the AI landscape. Evals enable rapid testing of different models, data augmentation techniques, and prompting strategies against defined criteria. Vercel promotes their AI SDK as simplifying this process through a unified abstraction layer for switching between providers with minimal code changes.

Feedback from users closes the loop. The article distinguishes between explicit feedback (ratings, reviews), implicit feedback (user behavior like rephrasing or abandoning queries), and error reporting. Combining these feedback types provides comprehensive understanding of AI performance from the user’s perspective.

v0 in Practice

Vercel’s v0 product demonstrates eval-driven development in production. Their multi-faceted evaluation strategy includes fast code checks, human feedback from both end users and internal teams, and LLM-based grading for complex judgments at scale. Key operational details include:

Specific code-based checks mentioned include validating code blocks, ensuring correct imports, confirming multi-file usage, and verifying the balance of code comments versus actual code (addressing “LLM laziness”).

Critical Assessment

While the article presents a compelling philosophy for AI development, it’s important to note several caveats:

The content is primarily promotional, originating from Vercel’s blog and serving to promote their products (v0, AI SDK) and partners (Braintrust). The specific metrics around improvement and iteration speed are qualitative rather than quantitative—claims like “build better AI faster” and “iterate on prompts almost daily” lack concrete benchmarks.

The acknowledgment that “not all evals currently pass” and that “maintaining these evals presents an ongoing challenge” is refreshingly honest and reflects real-world LLMOps complexity. The admission that they “continue to look for ways to make managing the eval suite less time-consuming and more scalable, without totally giving up the human-in-the-loop” suggests this is an area of active development rather than a solved problem.

The article does not provide specific numbers on evaluation pass rates (except for safety evals), improvement percentages, or cost comparisons between different evaluation approaches beyond relative estimates. Teams looking to implement similar systems would need to conduct their own benchmarking.

Tools and Technologies

The case study references several specific tools and technologies:

Implications for LLMOps Practitioners

The eval-driven development philosophy has several practical implications for teams building LLM-powered applications:

The comparison to search engine quality evaluation is particularly apt, as it frames AI development within a longer historical context of dealing with probabilistic, user-facing systems at scale.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI 2025

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

customer_support healthcare code_generation +42