AirBnB evolved their Automation Platform from a static workflow-based conversational AI system to a comprehensive LLM-powered platform. The new version (v2) combines traditional workflows with LLM capabilities, introducing features like Chain of Thought reasoning, robust context management, and a guardrails framework. This hybrid approach allows them to leverage LLM benefits while maintaining control over sensitive operations, ultimately enabling customer support agents to work more efficiently while ensuring safe and reliable AI interactions.
AirBnB’s engineering team published details about their evolution of the Automation Platform, their internal conversational AI infrastructure, from a traditional workflow-based system (v1) to an LLM-native platform (v2). This case study provides insight into how a large-scale technology company is transitioning from rigid, predefined conversational AI workflows to more flexible LLM-powered applications while maintaining production-level reliability and safety.
The platform serves as the backbone for customer support automation at AirBnB, powering both customer-facing chatbots and tools for customer support agents. The evolution represents a significant architectural shift from manually designed conversation flows to LLM-driven reasoning systems, while maintaining backward compatibility with existing traditional workflows.
The original Automation Platform v1 was built to support traditional conversational AI products like chatbots using predefined step-by-step workflows. While functional, this approach had notable limitations that motivated the transition to v2:
The system lacked flexibility because AI products followed rigid, predefined processes that couldn’t adapt to the natural flow of customer conversations. Additionally, the platform was difficult to scale since product creators needed to manually create workflows and tasks for every scenario, repeating this process for each new use case. This manual approach was time-consuming and error-prone, creating a bottleneck for expanding automation capabilities.
Early experiments at AirBnB demonstrated that LLM-powered conversations could provide more natural and intelligent experiences compared to human-designed workflows. Customers could engage in natural dialogue, ask open-ended questions, and explain issues in detail, while LLMs could more accurately interpret queries and capture nuanced information from ongoing conversations.
However, the team acknowledges that LLM-powered applications are still maturing for production use. They explicitly note concerns about latency and hallucination as areas where the community is still improving. For sensitive operations like claim processing that require strict data validation, traditional workflows may still be more appropriate. This balanced view is refreshing compared to many case studies that present LLMs as a universal solution.
The Automation Platform v2 operates through a request-response cycle that orchestrates between user inputs, context management, LLM inference, and tool execution. When a user inquiry arrives, the platform collects relevant contextual information including previous chat history, user ID, user role, and other relevant data. It then loads and assembles prompts using the inquiry and context before sending requests to the LLM.
A key architectural decision is that LLM responses can request tool execution. For example, when a user asks “where is my next reservation?”, the LLM can request a service call to fetch reservation data. The platform handles this tool execution, saves results to the current context, and sends updated context back to the LLM for final response generation.
The platform implements Chain of Thought reasoning as a core workflow pattern. This AI agent framework enables LLMs to reason about issues by using the LLM as a reasoning engine to determine which tools to use and in which order. Tools serve as the mechanism for LLMs to interact with external systems and solve real problems, such as checking reservation status or listing availability.
An important design choice is that the existing actions and workflows from the v1 platform work well as tools in the Chain of Thought system because of their unified interface and managed execution environment. This allows for backward compatibility and reuse of existing automation building blocks.
The Chain of Thought workflow consists of several main steps: preparing context for the LLM (including prompt, contextual data, and historical conversations), then entering a reasoning loop where the system asks the LLM for reasoning, executes any LLM-requested tools, and processes outcomes. The workflow continues until a final result is generated.
Three high-level components power this Chain of Thought implementation:
The CoT IO handler assembles prompts, prepares contextual data, collects user input, and handles general data processing before sending requests to the LLM. The Tool Manager prepares tool payloads with LLM input and output, manages tool execution, and provides quality-of-life features like retry logic and rate limiting. The LLM Adapter allows developers to add customized logic for integrating with different types of LLMs, providing flexibility for model selection and management.
Context management is identified as a key component ensuring LLMs have access to all necessary information for decision-making. The system provides historical interactions, customer support inquiry intent, current trip information, and other relevant data.
An interesting capability mentioned is point-in-time data retrieval for offline evaluation use cases, which is configurable. This suggests the team is building evaluation capabilities that can replay historical scenarios with consistent context.
Developers can either statically declare needed context (like customer name) or specify dynamic context retrievers (such as relevant help articles for customer questions). This flexibility allows for both simple and complex context requirements.
The context management architecture includes a Context Loader that connects to different sources and fetches relevant context based on developers’ customizable fetching logic, and a Runtime Context Manager that maintains runtime context, processes context for each LLM call, and interacts with context storage systems.
The platform includes a Guardrails Framework as a safeguarding mechanism that monitors communications with the LLM to ensure responses are helpful, relevant, and ethical. This addresses common LLM concerns including hallucinations and jailbreak attempts.
The architecture allows engineers from different teams to create reusable guardrails. During runtime, guardrails can execute in parallel and leverage different downstream technology stacks. Examples mentioned include content moderation guardrails that call various LLMs to detect violations in communication content, and tool guardrails that use rules to prevent bad execution (such as updating listings with invalid configurations).
The case study mentions several supporting components for LLM development and operations:
A Playground feature bridges the gap between development and production tech stacks by allowing prompt writers to freely iterate on their prompts. This is crucial for the iterative nature of prompt engineering.
LLM-oriented observability provides detailed insights into each LLM interaction, including latency and token usage metrics. This monitoring capability is essential for understanding production performance and costs.
Enhanced Tool Management handles tool registration, publishing processes, execution, and observability. This systematizes the integration between LLM reasoning and external service calls.
A notable aspect of this case study is the team’s explicit acknowledgment that a hybrid approach combining traditional workflows with LLM-powered systems is currently the best strategy. They state that “it is too early to fully rely on [LLM-powered applications] for large scale and diverse experience for millions of customers.”
This pragmatic stance recognizes that certain use cases, particularly those involving sensitive data and strict validations (like claim processing), may be better served by traditional workflow systems. The platform design supports this hybrid model by allowing LLM-powered Chain of Thought workflows to coexist with and leverage existing traditional automation building blocks.
The team indicates they will continue evolving the platform with transformative technologies, exploring other AI agent frameworks beyond Chain of Thought, expanding tool capabilities, and investigating LLM application simulation. This suggests ongoing investment in agent architectures and testing methodologies.
While this case study provides valuable architectural insights, it lacks quantitative metrics on performance improvements, cost considerations, or specific success rates. The absence of concrete results data makes it difficult to assess the actual production impact. The case study reads somewhat as an engineering blog post showcasing technical capabilities rather than a comprehensive results-oriented case study.
The platform appears well-architected for managing the complexity of production LLM applications, with thoughtful attention to context management, safety guardrails, and developer experience. The hybrid approach acknowledging LLM limitations demonstrates engineering maturity. However, the real-world effectiveness at scale remains somewhat unclear from the available information.
The focus on customer support use cases is sensible given the conversational nature of LLMs and the potential for significant efficiency gains in this domain. The integration with existing automation infrastructure shows pragmatic thinking about organizational adoption and migration paths.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.
Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.