Coval: Agent Testing and Evaluation Using Autonomous Vehicle Simulation Principles

Overview

This case study is drawn from a conference talk by Brooke Hopkins, founder of Coval, who previously led evaluation infrastructure at Waymo (the self-driving car company). The presentation explores how lessons learned from testing autonomous vehicles can be applied to testing and evaluating LLM-powered autonomous agents, particularly conversational AI systems like voice agents. Coval’s thesis is that the agent testing paradigm needs to fundamentally shift from static, manually-created test cases toward dynamic, probabilistic simulation—mirroring the evolution that occurred in self-driving car evaluation over the past decade.

The Problem: Manual Agent Testing is Slow and Fragile

Hopkins identifies a core challenge facing teams building autonomous agents: testing is painfully slow and manual. Engineers currently spend hours manually testing their systems, particularly for multi-step conversational agents. For example, testing a voice agent might involve calling it repeatedly, with each call taking anywhere from 30 seconds to 10 minutes to test end-to-end.

The workflow becomes a frustrating cycle: make a change to the agent, spend hours doing “vibe checks” by chatting with the system, find ad hoc cases needing improvement, make another change, and repeat. This creates a “whack-a-mole” dynamic where each change potentially breaks something else, and comprehensively testing every possible case is simply infeasible given time constraints.

This problem is structurally similar to what self-driving car companies faced: how do you test an autonomous system that must handle an essentially infinite plane of possible inputs, while still being able to systematically demonstrate improvement over time?

The Self-Driving Testing Paradigm Evolution

Hopkins describes how self-driving car testing evolved through several phases that are instructive for agent evaluation:

Traditional Software Testing Layers: Software engineering best practice involves layers of tests—unit tests on every commit, integration tests run nightly, and release tests with clear pass/fail metrics.

Traditional ML Metrics: Machine learning evaluation focused on aggregate statistics like F1 score, precision, and recall.

The Foundation Model Challenge: With foundational models (both in self-driving and generative AI), teams care about both aggregate statistics AND individual examples. You want to know your overall collision rate, but you also care deeply about whether the car stopped at a specific stop sign.

From Static to Probabilistic Evaluation: Self-driving companies initially created rigid, manually-maintained test cases for specific scenarios. This approach proved expensive to maintain and didn’t scale. The industry shifted toward probabilistic evaluation: running thousands of test cases generated from logs or synthetic data, simulating all those cases, and looking at aggregate events rather than individual test pass/fail. The key metrics became things like: how often do we reach our destination, how often is there a hard brake, how often is there a collision?

Hopkins argues the agent evaluation space is currently stuck in the early, static test case phase and needs to make the same transition. Instead of testing that one specific prompt produces one specific expected output, teams should run thousands of iterations of a scenario (like “book an appointment”) and measure how often the goal is achieved using LLM-as-judge or heuristics.

Simulating Different Layers

A key insight from robotics and self-driving is the ability to test different system layers independently. Hopkins provides a simplified overview of self-driving architecture: sensor inputs (lidar, cameras, radar, audio) feed into perception (identifying objects and entities), which combines with localization and other modules, flowing into planning (behavior prediction and path planning), and finally control (actual steering and acceleration commands).

In simulation, you don’t need to test every piece of this stack for every test. You can isolate and test specific layers independently. This applies directly to conversational agents:

You might test only the text-based logical consistency of a voice agent without simulating actual speech-to-text and text-to-speech
For an AI SDR (sales development representative), you might evaluate the conversational logic while mocking out all the web agents and external APIs it interacts with

This modular approach helps optimize the fundamental tradeoff in simulation: cost, latency, and signal. Maximum signal requires running many tests (high cost), while low cost means running few tests (low signal). You can run tests slowly on cheap compute (low cost, high latency) or quickly on expensive compute.

Comparing to Human Ground Truth

One of the harder problems in agent evaluation is comparison to human performance when there are many valid ways to accomplish a goal. If you’re driving from point A to B, you might take a left then right, or right then left, or pass a car, or wait for it to pass first—all potentially valid approaches.

The same applies to conversational agents: booking an appointment can unfold in many ways, as long as the goal is achieved (getting the email, phone number, confirming the booking, etc.).

Coval’s approach involves using LLM-as-judge to identify what steps the agent took to achieve a goal and how long it took. This is compared against ground truth human labels showing how long an average person takes and what abstract steps they follow. Dramatic slowdowns might indicate the agent is confused, going in circles, or repeating itself.

Hopkins emphasizes that these metrics shouldn’t be rigid pass/fail criteria but rather a spectrum that supports human review of large regression tests, helping teams understand tradeoffs—for example, whether it’s acceptable for an agent to take longer if it has higher accuracy.

The Compounding Error Problem and Reliability

Autonomous agents face a “butterfly effect” where an error early in a conversation can cascade into subsequent errors. Critics sometimes argue this makes reliable agents fundamentally difficult to build.

Hopkins counters this with examples from traditional software infrastructure: achieving “six nines” of reliability means building on top of many potentially-failing systems (servers, networks, packet loss, application errors). The solution is building reliability through redundancy and fallback mechanisms.

Self-driving took the same approach: despite non-deterministic models, fallback mechanisms ensure graceful degradation and redundancy. Hopkins argues agents should be designed the same way—not just trying to get the prompt right on the first try, but building in fallback mechanisms, redundancy, and the ability to call multiple models.

The Case for Level 5 Autonomy from Day One

In an interesting counterpoint to typical advice about keeping humans in the loop, Hopkins argues teams should target “Level 5 autonomy” (full autonomy with no human intervention) from the beginning. She draws on the levels of autonomous driving (Level 1 being not autonomous, Level 2-4 being various driver assist modes, Level 5 being full autonomy).

The lesson from self-driving is that it’s very difficult to go from Level 2 to Level 5 linearly. Companies targeting driver assist with plans to systematically improve to full autonomy have not succeeded with this approach. Waymo, by contrast, targeted Level 5 from the start, which forced building redundancy, fallback mechanisms, and reliability into the core system architecture—things that are hard to add after the fact.

For agent builders, this means: if you have human-in-the-loop from the start, it becomes difficult to improve over time. Hopkins uses email suggestions as an example—it’s hard to transition from “human reviews every suggestion” to “agent sends autonomously” because the infrastructure for monitoring, flagging potentially incorrect emails, and secondary review wasn’t built in from the beginning.

Coval’s Approach: Dynamic Simulation

Coval’s solution involves simulating entire conversations dynamically rather than creating static test cases. The system simulates back-and-forth conversations (and is expanding to web agents and other agent types), providing dynamic input as the agent makes decisions. When agent behavior changes, test cases don’t break because the simulation adapts.

The platform also provides visibility into which conversation paths are being exercised most frequently, which helps teams understand test coverage and identify unexplored paths that may need more testing.

Interestingly, Hopkins notes that this testing approach blurs the lines between engineering, product, and sales. Demonstrating how well an agent works is critical for gaining user trust, and Coval customers sometimes use test results and dashboards to build user confidence that the agent is safe and performing as expected.

Generating Test Scenarios

In the Q&A, Hopkins explains that Coval can generate test scenarios from context like workflows, RAG documents, or other content that informs the agent. Crucially, test cases are defined as natural language (e.g., “create an account, give the details, and ask for a professional plan”), which makes them easy to create and maintain. The same scenario can be run hundreds of times to explore different pathways for that high-level objective.

Handling Erratic Human Behavior

When asked about simulating the erratic, multi-thematic nature of real human conversations (especially voice), Hopkins describes re-simulating from logs. You can’t simply replay exact transcripts because agent behavior changes mean those responses no longer make sense. Instead, the simulation tries to follow the original trajectory but responds dynamically when agent behavior diverges. This approach captures realistic edge cases—unexpected topics, specific names, background noise like a barking dog—and enables a development cycle where production failures can be re-simulated during development iteration.

Critical Assessment

While Coval’s approach is conceptually compelling, a few caveats are worth noting. The self-driving analogy, while instructive, has limits: self-driving cars operate in physical space with physics constraints and road rules, while language agents navigate an even more open-ended possibility space. Hopkins acknowledges this complexity will grow as agents take on more complex tasks like data entry across multiple APIs.

Additionally, the Level 5 autonomy recommendation is somewhat provocative and may not apply universally—many agent applications genuinely benefit from human oversight, and the comparison to Waymo’s success targeting Level 5 may not generalize to all agent domains. The approach also requires significant investment in simulation infrastructure that smaller teams may find challenging to build or adopt.

Agent Testing and Evaluation Using Autonomous Vehicle Simulation Principles

Industry

Technologies