## Overview of Booking.com's Agent Evaluation Framework
Booking.com has developed and deployed LLM-based agents for customer-facing applications including their AI Trip Planner, which assists users with complex travel planning tasks such as destination recommendations, property searches, booking inquiries, and cancellation policy questions. This case study represents the second installment in their GenAI evaluation best practices series and focuses specifically on the operational challenges of evaluating autonomous agents in production environments.
The fundamental challenge that Booking.com addresses is that LLM agents are qualitatively different from single-prompt LLM systems. While traditional LLMs respond to direct prompts with text generation, agents can autonomously break down complex problems, determine when to use external tools, iterate through reasoning cycles, and engage in multi-turn conversations. This autonomy creates significant evaluation complexity because the agent's decision-making process involves multiple internal steps that may not be visible in the final output. The company has developed a dual evaluation methodology that examines both the final outcomes (black box) and the internal reasoning processes (glass box) to gain comprehensive insight into agent performance.
## Black Box Evaluation: Task Completion Metrics
The cornerstone of Booking.com's black box evaluation is the **task completion** metric, which measures the fraction of tasks correctly completed by the agent divided by the total number of tasks requested by the user within a chat session. This metric is computed at the conversation level, recognizing that users typically engage in multi-turn dialogues where they request multiple related tasks.
The evaluation leverages an LLM-as-a-judge approach following established methodologies from prior research. The judge LLM receives as input the complete dialogue between user and agent, along with two critical pieces of context: a pre-defined list of tasks that the agent is expected to handle, and a set of product-specific constraints that define acceptable behavior. For the AI Trip Planner specifically, the task list includes capabilities such as providing hotel recommendations based on user preferences, answering questions about specific properties, retrieving booking and cancellation policies, and other travel-related information retrieval tasks.
The mathematical formulation they employ calculates task completion (τᵢ) for the i-th conversation as the average of binary evaluations across all Nᵢ tasks in that conversation. Each task tₖ receives a binary score from the judge LLM function f, which considers the entire dialogue dᵢ, the task list T, and the product constraints θ. A task is considered completed when the agent provides valid options that fulfill user requests while adhering to product-specific constraints and instructions.
The practical implementation uses a single judge LLM that performs dual functions: identifying which tasks were requested within the conversation and evaluating whether each identified task was successfully completed. While this could theoretically be separated into two distinct judge LLMs, Booking.com chose the unified approach for cost-effectiveness and simplicity, noting that state-of-the-art LLMs perform excellently on their benchmark datasets. This design is intentionally general and can be adapted to any agentic use case simply by updating the task list and product constraints.
## Glass Box Evaluation: Tool Proficiency and Reliability
Glass box evaluation provides visibility into the agent's internal decision-making processes, with particular focus on tool usage patterns. Booking.com has developed a comprehensive suite of metrics that assess whether agents are using tools correctly, choosing appropriate tools for tasks, and working with well-designed tool specifications.
### Tool Call Validity
Tool call validity addresses the fundamental question of whether tools are being called in executable ways. The evaluation method draws inspiration from the Berkeley Function Calling Leaderboard and employs Abstract Syntax Tree (AST) parsing techniques. When an agent generates a tool call, the system validates multiple aspects: whether the tool name exists in the available tool inventory, whether all required parameters are present, whether any unexpected parameters have been included, whether parameter types match the tool signature specifications, and whether argument values follow the specified format.
In practice, this validation leverages JSON schema validation rather than running individual checks sequentially. Each tool accessible by the agent must have a valid JSON schema specifying the names and types of required arguments. The evaluation pipeline validates the JSON instance containing tool call arguments against the tool's JSON schema, providing clear error messages when validation fails. This approach ensures that tool calls are syntactically correct and executable before the agent attempts to use them.
### Tool Correctness
Beyond validating that tool calls are syntactically correct, Booking.com evaluates whether agents select the most appropriate tools for solving specific tasks. This assessment requires pre-defined test datasets containing tasks, the complete list of available tools, and the expected set of tools that should be used for each task. These datasets can be created manually or through LLM-assisted synthetic data generation processes.
The evaluation submits tasks to the agent and compares the set of tools actually used against the expected tool set. Matches indicate correct tool selection, while mismatches trigger detailed error reporting that identifies redundant tools (used by the agent but not expected) and missing tools (expected but not used). Additionally, the framework can assess whether agents correctly decline to answer tasks for which no relevant tools are available by providing unanswerable tasks and verifying that the agent properly rejects them.
### Tool Argument Hallucination
Even when agents make valid calls to correct tools, they can still generate incorrect parameter values—for instance, confusing Paris, France with Paris, Texas when processing accommodation requests. Booking.com notes that empirical observations with state-of-the-art LLMs show this type of error is extremely rare in practice. However, for more comprehensive evaluation, they can employ specialized factual accuracy judge LLMs that assess generated parameter values by considering both the user's input and the agent's reasoning context.
### Tool Reliability Checks
A particularly innovative aspect of Booking.com's framework is their focus on tool specification quality. They recognize that agents depend heavily on clear tool names and descriptions to make effective calling decisions, but human-written documentation is often inconsistent, ambiguous, or redundant because it's optimized for human rather than machine interpretation.
To address this, they've developed judge LLMs that assess tool specification quality across multiple dimensions. For **tool names**, they verify adherence to best practices drawn from prior research, frontier lab guidance, and internal experience. Each tool name must be clear and descriptive, self-explanatory, intent-focused rather than implementation-focused, use snake_case formatting, avoid acronyms and abbreviations, and exclude references to internal team names or codebases. Internal experiments demonstrated that general or obscure tool names (like "search_social_network" or "fetch_property_rsi_with_cml") significantly degrade both task completion and tool proficiency performance, while clear, concise, intent-focused names improve outcomes.
For **tool descriptions**, the framework implements checks ensuring descriptions are clear and unambiguous, complete but not excessively verbose (to avoid overwhelming agents when 10+ tools are available), contain well-defined input argument descriptions, avoid redundancy with tool names, limit optional input arguments to a maximum of three (based on research showing LLMs struggle with too many optional parameters), and include all mandatory arguments that lack default values. They found that for simple tools, the combination of a good tool name with detailed input argument descriptions is often sufficient without requiring verbose additional explanation.
## Benchmarking Agents Against Baselines
A critical aspect of Booking.com's LLMOps approach is their insistence on justifying agent complexity through rigorous benchmarking against simpler baselines. They recognize that agents are complex, high-maintenance systems whose deployment must demonstrate clear performance advantages. Their baseline hierarchy progresses from zero-shot LLMs (prompt-based with task instructions only), to few-shot LLMs (adding examples), to LLM-based deterministic flows (chains of LLMs with usage rules).
The decision to deploy an agent is explicitly data-driven, comparing agents against baseline(s) across three dimensions: **performance** (measured via task completion or other judge-based metrics translatable to business impact), **cost** (total monetary impact of running the system), and **latency** (total time to fetch final responses). The expectation is that agents should show increased performance with only acceptable increases in cost and latency. Clear constraints on cost and latency can eliminate agents with prohibitively high response times immediately.
A crucial consideration is determining what performance gain justifies increased cost and latency. Booking.com emphasizes translating performance improvements into business impact—for example, a 1% improvement on a use case with 1 million users likely delivers more value than a 20% improvement for 500 users. Computing business metric uplift helps inform decisions about whether extra cost and latency are justified. If comparisons don't show clear gains, simpler approaches are preferable due to lower cost, latency, and maintenance effort.
To increase agent debuggability, they also monitor supplementary aspects including token usage (input, output, and reasoning tokens separately) and requests per second, which help identify which agentic components drive changes in cost or latency.
## Advanced Evaluation Topics
### Consistency
Booking.com highlights research from the tau-bench paper and subsequent work showing that agents exhibit extremely low consistency—responding differently to similarly formulated requests. This presents critical problems for real-world applications, as agents should provide similar responses whether users ask for the "distance between New York and Los Angeles" or "distance between the Big Apple and the City of Angels."
Their practical approach involves paraphrasing input queries k times and computing average task completion scores across k attempts, along with pass^k (the probability that all k trials succeed) and pass@k (probability that at least one trial succeeds). Research on the tau-retail dataset shows pass^k drops below 20% after just 8 trials, highlighting significant inconsistency challenges with current LLM agents.
### Reasoning
While definitions of reasoning vary, Booking.com recognizes its foundational role in LLM system effectiveness. Nearly every contemporary module—from Chain-of-Thought prompting and augmented generation to tool calling, ReAct, and Reflexion—depends critically on robust reasoning capabilities. Recent literature trends evaluate reasoning by examining how rapidly performance deteriorates as task complexity increases. The GAIA benchmark, for example, finds that the number of steps and tools used increases with task complexity. Therefore, one approach to evaluating agent reasoning ability is assessing the efficiency with which tasks of increasing complexity are solved.
### Memory
Memory evaluation encompasses both short-term and long-term memory components that persist key information for agent performance. The evaluation is two-fold: **memory quality** (evaluating memory content using judge LLM-based metrics like factual accuracy, relevance, and correctness) and **memory impact** (evaluating the added value gained by using memory). Impact evaluation is particularly complex for conversational systems and requires estimating performance metric gains when using versus not using the memory component. Task completion serves as a suitable metric since it typically proxies business impact well.
### Trajectory Optimality
LLM agents can follow different reasoning and action paths to achieve results. For example, when verifying hotel availability, an agent might query the availability service for all hotels in a region and then filter, or it could directly query for the specific hotel—clearly the latter is preferable for its simplicity and efficiency. Similarly, agents might call tools in different sequences, with varying degrees of optimality.
Booking.com outlines three approaches for evaluating trajectory optimality: counting the number of tool calls for successful trajectories as a proxy for optimality, using reference-based judge LLMs with human-labeled datasets containing "golden" reference action sequences, or employing preference-based judge LLMs with labeled datasets containing pairs of optimal and suboptimal trajectories. The biggest challenge is defining ground truth for optimal trajectories, leading them to recommend preference-based evaluation for this particular metric.
## Production Deployment Considerations
Throughout the case study, Booking.com demonstrates a sophisticated understanding of the operational tradeoffs involved in deploying LLM agents to production. Their framework is designed to be fully general and adaptable—the judge LLM can be customized to any agentic use case simply by updating the task list and product constraints. This modularity is crucial for an organization operating multiple agent-based systems across different domains.
The emphasis on cost and latency monitoring alongside performance metrics reflects real-world production constraints. By tracking token usage granularly (separating input, output, and reasoning tokens) and measuring requests per second, they can attribute cost and latency changes to specific agent components and make informed optimization decisions.
The tool reliability framework represents a particularly mature approach to LLMOps, recognizing that agent performance depends not just on the LLM's capabilities but on the quality of the infrastructure surrounding it. By systematically evaluating and improving tool specifications—essentially the API that agents interact with—they're addressing a often-overlooked aspect of agent deployment that significantly impacts production performance.
Finally, their insistence on benchmarking against simpler baselines before deployment demonstrates engineering discipline that prevents over-engineering solutions. In environments where prompt-based LLM solutions can be simpler to implement than rule-based systems, the complexity of full agents must be rigorously justified through demonstrated performance gains that translate to business value.
This case study represents a mature, production-focused approach to LLM agent evaluation that balances the excitement of autonomous AI capabilities with the pragmatic requirements of reliable, cost-effective, and maintainable production systems.