Booking.com: Supervised Fine-Tuning for AI-Powered Travel Recommendations

LLMOps Database

E-commerce

Booking.com

Company

Booking.com

Title

Supervised Fine-Tuning for AI-Powered Travel Recommendations

Industry

E-commerce

Link

https://booking.ai/beyond-prompt-engineering-how-we-used-supervised-fine-tuning-for-travel-recommendations-91e8f4711e4b

Year

2026

Summary (short)

Booking.com built an AI Trip Planner to handle unstructured, natural language queries from travelers seeking personalized recommendations. The challenge was combining LLMs' ability to understand conversational requests with years of structured behavioral data (searches, clicks, bookings). Instead of relying solely on prompt engineering with external APIs, they used supervised fine-tuning on open-source LLMs with parameter-efficient methods. This approach delivered superior recommendation metrics while achieving 3x faster inference compared to prompt-based solutions, while maintaining data privacy and security by keeping all processing internal.

Tags

## Overview Booking.com developed an AI Trip Planner, a conversational assistant integrated into their mobile app that helps travelers find destinations and accommodations through natural language chat. This case study focuses on their production deployment of supervised fine-tuning (SFT) techniques to build a recommendation system that bridges the gap between modern large language models and traditional machine learning approaches. The system is now available across multiple countries and languages, representing a significant production deployment of fine-tuned LLMs for e-commerce recommendations. The core challenge Booking.com faced was the evolution of how travelers express their needs. Modern travelers increasingly use unstructured, nuanced natural language to describe their vacation preferences rather than interacting with traditional structured interfaces like filters and dropdown menus. These requests vary widely in specificity and clarity, ranging from vague inspirational queries to highly detailed requirements, all while remaining deeply personal. Traditional machine learning models, which excel with structured data and clear behavioral signals, struggle to interpret this level of expressiveness and conversational nuance. Simultaneously, Booking.com possessed years of valuable structured behavioral data from their platform—searches, clicks, bookings, and other user interactions that capture actual traveler behavior and preferences. The strategic question became how to leverage both the conversational understanding capabilities of LLMs and the proprietary behavioral insights accumulated over years of operation. This case study illustrates their solution and provides important insights into the tradeoffs between different approaches to deploying LLMs in production. ## Architectural Approach and Technical Decisions Booking.com evaluated three primary approaches for building their recommendation system, each with distinct advantages and drawbacks from an LLMOps perspective. **Prompt-based solutions with external LLMs** represented the first option. These systems excel at handling conversational data and can be implemented quickly with minimal infrastructure investment. They naturally adapt to the way travelers describe their needs in natural language, taking full advantage of state-of-the-art LLM capabilities. However, this approach presented significant concerns for production deployment. Sending sensitive user interaction data and proprietary business information to external LLM providers raises substantial privacy and security issues. Additionally, these systems rely entirely on prompt engineering, meaning there's no ability to adjust the model's internal weights to incorporate proprietary data or behavioral signals. Response times can be slower due to the network latency involved in external API calls, and the organization becomes dependent on third-party services with potential availability, pricing, and policy changes. **Traditional machine learning systems** represented the other end of the spectrum. These systems keep all data internal, addressing privacy concerns by avoiding external data transmission. They effectively leverage behavioral signals like clicks and bookings, and can deliver fast recommendations with optimized inference pipelines. However, they require extensive preprocessing and feature engineering to handle unstructured natural language queries. More critically, they lack the deep semantic understanding and conversational capabilities that modern LLMs provide, making them less suitable for the increasingly conversational nature of user interactions. **Fine-tuning open-source LLMs** emerged as Booking.com's chosen solution, combining advantages from both approaches while mitigating many of their weaknesses. This approach enables a secure, context-aware recommender that understands natural language while leveraging proprietary behavioral data. By using open-source models, they eliminate external API dependencies and the associated privacy concerns. Critically, the case study reports achieving 3x faster inference compared to prompt-based solutions—a significant performance improvement for production systems where latency directly impacts user experience. ## Parameter-Efficient Fine-Tuning Implementation Given the computational and memory requirements of fine-tuning very large language models, Booking.com adopted parameter-efficient fine-tuning (PEFT) methods. While the provided text doesn't detail the specific PEFT technique used (such as LoRA, prefix tuning, or adapter layers), the explicit mention of parameter efficiency indicates a conscious decision to balance model customization with resource constraints—a key consideration in production LLMOps. Parameter-efficient fine-tuning updates only a small subset of model parameters or adds small trainable modules to a frozen base model, rather than updating all weights. This approach reduces the computational resources, memory footprint, and training time required while still adapting the model to domain-specific tasks. For a production system at Booking.com's scale, these efficiency gains translate directly to reduced infrastructure costs and faster iteration cycles for model improvements. ## LLMOps Considerations and Production Tradeoffs This case study illuminates several critical LLMOps considerations that organizations face when deploying LLMs in production environments. **Data Privacy and Security:** By choosing to fine-tune open-source models rather than relying on external APIs, Booking.com maintained complete control over sensitive user data. This is particularly important given GDPR and other privacy regulations affecting the travel industry. All user interactions, booking history, and behavioral signals remain within their infrastructure, eliminating the compliance and trust issues associated with third-party data processing. **Inference Performance:** The reported 3x speedup in inference time compared to prompt-based solutions represents a substantial operational improvement. In conversational AI applications, response latency directly impacts user experience and engagement. Faster inference also means higher throughput capacity, allowing the system to serve more users with the same infrastructure. This performance gain likely results from eliminating network round-trips to external APIs and potentially from model optimization techniques applied during or after fine-tuning. **Model Customization and Control:** Fine-tuning enables Booking.com to encode their proprietary behavioral data and domain knowledge directly into the model weights, going beyond what's possible through prompt engineering alone. This represents a fundamental architectural advantage—the model learns patterns from actual booking behavior, user preferences, and successful recommendations rather than relying solely on the general knowledge encoded in the pre-trained model. This also provides greater control over model behavior, biases, and outputs compared to black-box external APIs. **Infrastructure and Operational Complexity:** While not explicitly discussed in the text, the fine-tuning approach does introduce operational complexity. The organization must manage the entire model lifecycle: data preparation and curation for fine-tuning, training infrastructure, model versioning, deployment pipelines, and ongoing monitoring. This contrasts with prompt-based solutions where much of this complexity is abstracted away by the API provider. However, Booking.com appears to have determined that this tradeoff favors greater control and performance. **Multilingual and Multi-Market Deployment:** The case study mentions the AI Trip Planner is available in multiple countries and languages, suggesting either the use of multilingual base models or multiple fine-tuned variants. This represents a significant production consideration—maintaining and operating potentially multiple model variants while ensuring consistent quality across languages and markets. The choice of open-source models and efficient fine-tuning methods likely makes this multi-market deployment more feasible than it would be with external APIs, where costs scale linearly with API calls across all markets. ## Critical Assessment and Balanced Perspective While the case study presents impressive results, it's important to note that it comes from Booking.com and naturally emphasizes the positive aspects of their approach. The "3x faster inference" claim is notable but lacks contextual details—we don't know the absolute latency numbers, what specific prompt-based solution was used as the baseline, or what model sizes are being compared. External API calls naturally incur network latency, so some speedup is expected, but the magnitude could vary significantly based on implementation details. The text doesn't discuss challenges encountered during fine-tuning, such as data quality issues, overfitting to historical patterns, or difficulties in model evaluation. Traditional recommendation metrics (clicks, bookings, conversion rates) may not fully capture the quality of conversational interactions, and the case study doesn't detail how they evaluate the conversational aspects of the system. Additionally, the article is incomplete (cutting off mid-discussion of parameter-efficient fine-tuning), so we lack details about the specific techniques used, training dataset construction, evaluation methodology, and ongoing maintenance procedures. These are critical components of any production LLMOps system. The comparison between approaches is somewhat simplified. Real-world prompt engineering can be quite sophisticated, incorporating few-shot learning, retrieval-augmented generation, and other techniques that might narrow the performance gap. Similarly, modern traditional ML systems can incorporate transformer-based models and embeddings to handle unstructured text more effectively than implied. ## Production Deployment Implications From an LLMOps perspective, this case study illustrates several important principles for deploying LLMs in production: **The architectural choice between prompt engineering and fine-tuning isn't binary.** Many organizations successfully combine both approaches, using fine-tuned models for core functionality while supplementing with prompt engineering for flexibility and rapid iteration. Booking.com's choice reflects their specific context: high user volume requiring low latency, sensitive proprietary data, and substantial historical behavioral data to leverage. **Infrastructure investment matters.** The fine-tuning approach requires significant ML infrastructure and expertise. Organizations must evaluate whether they have the resources and technical capability to manage the entire model lifecycle. For smaller organizations or those with less ML maturity, prompt-based solutions might remain more practical despite their limitations. **Domain-specific adaptation drives value.** The key advantage of fine-tuning in this case appears to be incorporating Booking.com's specific domain knowledge and behavioral data—what travelers actually book and enjoy, not just what sounds plausible to a general-purpose LLM. This suggests that for recommendation systems and other tasks where proprietary data provides competitive advantage, fine-tuning or other forms of model customization offer substantial benefits. **Performance optimization is critical at scale.** The 3x inference speedup matters significantly at Booking.com's scale, potentially saving substantial infrastructure costs and improving user experience. Organizations should carefully profile and optimize their LLM deployments, considering model size, quantization, batching, caching, and other optimization techniques. This case study represents a mature approach to production LLM deployment, moving beyond the initial experimentation phase to build a system that balances multiple competing concerns: conversational capability, recommendation quality, inference performance, data privacy, and operational control. While we should view the specific claims with appropriate skepticism given the promotional nature of the content, the overall architectural approach and tradeoff analysis provide valuable insights for organizations deploying similar systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source