ZenML

Supervised Fine-Tuning for AI-Powered Travel Recommendations

Booking.com 2026
View original source

Booking.com built an AI Trip Planner to handle unstructured, natural language queries from travelers seeking personalized recommendations. The challenge was combining LLMs' ability to understand conversational requests with years of structured behavioral data (searches, clicks, bookings). Instead of relying solely on prompt engineering with external APIs, they used supervised fine-tuning on open-source LLMs with parameter-efficient methods. This approach delivered superior recommendation metrics while achieving 3x faster inference compared to prompt-based solutions, while maintaining data privacy and security by keeping all processing internal.

Industry

E-commerce

Technologies

Overview

Booking.com developed an AI Trip Planner, a conversational assistant integrated into their mobile app that helps travelers find destinations and accommodations through natural language chat. This case study focuses on their production deployment of supervised fine-tuning (SFT) techniques to build a recommendation system that bridges the gap between modern large language models and traditional machine learning approaches. The system is now available across multiple countries and languages, representing a significant production deployment of fine-tuned LLMs for e-commerce recommendations.

The core challenge Booking.com faced was the evolution of how travelers express their needs. Modern travelers increasingly use unstructured, nuanced natural language to describe their vacation preferences rather than interacting with traditional structured interfaces like filters and dropdown menus. These requests vary widely in specificity and clarity, ranging from vague inspirational queries to highly detailed requirements, all while remaining deeply personal. Traditional machine learning models, which excel with structured data and clear behavioral signals, struggle to interpret this level of expressiveness and conversational nuance.

Simultaneously, Booking.com possessed years of valuable structured behavioral data from their platform—searches, clicks, bookings, and other user interactions that capture actual traveler behavior and preferences. The strategic question became how to leverage both the conversational understanding capabilities of LLMs and the proprietary behavioral insights accumulated over years of operation. This case study illustrates their solution and provides important insights into the tradeoffs between different approaches to deploying LLMs in production.

Architectural Approach and Technical Decisions

Booking.com evaluated three primary approaches for building their recommendation system, each with distinct advantages and drawbacks from an LLMOps perspective.

Prompt-based solutions with external LLMs represented the first option. These systems excel at handling conversational data and can be implemented quickly with minimal infrastructure investment. They naturally adapt to the way travelers describe their needs in natural language, taking full advantage of state-of-the-art LLM capabilities. However, this approach presented significant concerns for production deployment. Sending sensitive user interaction data and proprietary business information to external LLM providers raises substantial privacy and security issues. Additionally, these systems rely entirely on prompt engineering, meaning there’s no ability to adjust the model’s internal weights to incorporate proprietary data or behavioral signals. Response times can be slower due to the network latency involved in external API calls, and the organization becomes dependent on third-party services with potential availability, pricing, and policy changes.

Traditional machine learning systems represented the other end of the spectrum. These systems keep all data internal, addressing privacy concerns by avoiding external data transmission. They effectively leverage behavioral signals like clicks and bookings, and can deliver fast recommendations with optimized inference pipelines. However, they require extensive preprocessing and feature engineering to handle unstructured natural language queries. More critically, they lack the deep semantic understanding and conversational capabilities that modern LLMs provide, making them less suitable for the increasingly conversational nature of user interactions.

Fine-tuning open-source LLMs emerged as Booking.com’s chosen solution, combining advantages from both approaches while mitigating many of their weaknesses. This approach enables a secure, context-aware recommender that understands natural language while leveraging proprietary behavioral data. By using open-source models, they eliminate external API dependencies and the associated privacy concerns. Critically, the case study reports achieving 3x faster inference compared to prompt-based solutions—a significant performance improvement for production systems where latency directly impacts user experience.

Parameter-Efficient Fine-Tuning Implementation

Given the computational and memory requirements of fine-tuning very large language models, Booking.com adopted parameter-efficient fine-tuning (PEFT) methods. While the provided text doesn’t detail the specific PEFT technique used (such as LoRA, prefix tuning, or adapter layers), the explicit mention of parameter efficiency indicates a conscious decision to balance model customization with resource constraints—a key consideration in production LLMOps.

Parameter-efficient fine-tuning updates only a small subset of model parameters or adds small trainable modules to a frozen base model, rather than updating all weights. This approach reduces the computational resources, memory footprint, and training time required while still adapting the model to domain-specific tasks. For a production system at Booking.com’s scale, these efficiency gains translate directly to reduced infrastructure costs and faster iteration cycles for model improvements.

LLMOps Considerations and Production Tradeoffs

This case study illuminates several critical LLMOps considerations that organizations face when deploying LLMs in production environments.

Data Privacy and Security: By choosing to fine-tune open-source models rather than relying on external APIs, Booking.com maintained complete control over sensitive user data. This is particularly important given GDPR and other privacy regulations affecting the travel industry. All user interactions, booking history, and behavioral signals remain within their infrastructure, eliminating the compliance and trust issues associated with third-party data processing.

Inference Performance: The reported 3x speedup in inference time compared to prompt-based solutions represents a substantial operational improvement. In conversational AI applications, response latency directly impacts user experience and engagement. Faster inference also means higher throughput capacity, allowing the system to serve more users with the same infrastructure. This performance gain likely results from eliminating network round-trips to external APIs and potentially from model optimization techniques applied during or after fine-tuning.

Model Customization and Control: Fine-tuning enables Booking.com to encode their proprietary behavioral data and domain knowledge directly into the model weights, going beyond what’s possible through prompt engineering alone. This represents a fundamental architectural advantage—the model learns patterns from actual booking behavior, user preferences, and successful recommendations rather than relying solely on the general knowledge encoded in the pre-trained model. This also provides greater control over model behavior, biases, and outputs compared to black-box external APIs.

Infrastructure and Operational Complexity: While not explicitly discussed in the text, the fine-tuning approach does introduce operational complexity. The organization must manage the entire model lifecycle: data preparation and curation for fine-tuning, training infrastructure, model versioning, deployment pipelines, and ongoing monitoring. This contrasts with prompt-based solutions where much of this complexity is abstracted away by the API provider. However, Booking.com appears to have determined that this tradeoff favors greater control and performance.

Multilingual and Multi-Market Deployment: The case study mentions the AI Trip Planner is available in multiple countries and languages, suggesting either the use of multilingual base models or multiple fine-tuned variants. This represents a significant production consideration—maintaining and operating potentially multiple model variants while ensuring consistent quality across languages and markets. The choice of open-source models and efficient fine-tuning methods likely makes this multi-market deployment more feasible than it would be with external APIs, where costs scale linearly with API calls across all markets.

Critical Assessment and Balanced Perspective

While the case study presents impressive results, it’s important to note that it comes from Booking.com and naturally emphasizes the positive aspects of their approach. The “3x faster inference” claim is notable but lacks contextual details—we don’t know the absolute latency numbers, what specific prompt-based solution was used as the baseline, or what model sizes are being compared. External API calls naturally incur network latency, so some speedup is expected, but the magnitude could vary significantly based on implementation details.

The text doesn’t discuss challenges encountered during fine-tuning, such as data quality issues, overfitting to historical patterns, or difficulties in model evaluation. Traditional recommendation metrics (clicks, bookings, conversion rates) may not fully capture the quality of conversational interactions, and the case study doesn’t detail how they evaluate the conversational aspects of the system.

Additionally, the article is incomplete (cutting off mid-discussion of parameter-efficient fine-tuning), so we lack details about the specific techniques used, training dataset construction, evaluation methodology, and ongoing maintenance procedures. These are critical components of any production LLMOps system.

The comparison between approaches is somewhat simplified. Real-world prompt engineering can be quite sophisticated, incorporating few-shot learning, retrieval-augmented generation, and other techniques that might narrow the performance gap. Similarly, modern traditional ML systems can incorporate transformer-based models and embeddings to handle unstructured text more effectively than implied.

Production Deployment Implications

From an LLMOps perspective, this case study illustrates several important principles for deploying LLMs in production:

The architectural choice between prompt engineering and fine-tuning isn’t binary. Many organizations successfully combine both approaches, using fine-tuned models for core functionality while supplementing with prompt engineering for flexibility and rapid iteration. Booking.com’s choice reflects their specific context: high user volume requiring low latency, sensitive proprietary data, and substantial historical behavioral data to leverage.

Infrastructure investment matters. The fine-tuning approach requires significant ML infrastructure and expertise. Organizations must evaluate whether they have the resources and technical capability to manage the entire model lifecycle. For smaller organizations or those with less ML maturity, prompt-based solutions might remain more practical despite their limitations.

Domain-specific adaptation drives value. The key advantage of fine-tuning in this case appears to be incorporating Booking.com’s specific domain knowledge and behavioral data—what travelers actually book and enjoy, not just what sounds plausible to a general-purpose LLM. This suggests that for recommendation systems and other tasks where proprietary data provides competitive advantage, fine-tuning or other forms of model customization offer substantial benefits.

Performance optimization is critical at scale. The 3x inference speedup matters significantly at Booking.com’s scale, potentially saving substantial infrastructure costs and improving user experience. Organizations should carefully profile and optimize their LLM deployments, considering model size, quantization, batching, caching, and other optimization techniques.

This case study represents a mature approach to production LLM deployment, moving beyond the initial experimentation phase to build a system that balances multiple competing concerns: conversational capability, recommendation quality, inference performance, data privacy, and operational control. While we should view the specific claims with appropriate skepticism given the promotional nature of the content, the overall architectural approach and tradeoff analysis provide valuable insights for organizations deploying similar systems.

More Like This

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building an Enterprise-Grade AI Agent for Recruiting at Scale

LinkedIn 2025

LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.

healthcare customer_support question_answering +51