Otter: LLM-Powered Customer Support Agent Handling 50% of Inbound Requests

Company

Otter

Title

LLM-Powered Customer Support Agent Handling 50% of Inbound Requests

Industry

Tech

Link

https://techblog.cloudkitchens.com/p/llm-support-agent

Year

2024

Summary (short)

Otter, a delivery-native restaurant hardware and software provider, built an in-house LLM-powered support agent called Otter Assistant to handle the high volume of customer support requests generated by their broad feature set and integrations. The company chose to build rather than buy after determining that existing vendors in Q1 2024 relied on hard-coded decision trees and lacked the deep integration flexibility required. Through an agentic architecture using function calling, runbooks, API integrations, confirmation widgets, and RAG-based research capabilities, Otter Assistant now autonomously resolves approximately 50% of inbound customer support requests while maintaining customer satisfaction and seamless escalation to human agents when needed.

Tags

## Overview Otter Assistant represents a comprehensive production implementation of an LLM-powered customer support agent built by Otter, a company providing delivery-native restaurant hardware and software solutions. Published in July 2025, this case study documents their journey over approximately one year (starting Q1 2024) building and scaling a chatbot that currently handles roughly 50% of inbound customer requests autonomously. The case is particularly interesting from an LLMOps perspective because it showcases a build-versus-buy decision that favored in-house development, a sophisticated agentic architecture, and custom tooling for testing, evaluation, and management of LLM systems in production. The business context is important: Otter offers a broad suite of products with numerous features, integrations, and customization options for restaurant operations. This breadth naturally creates significant demand for customer support. The company recognized that while customers appreciate speed and reliability, they also value having the option to escalate to 24/7 human agents when needed. This human-in-the-loop consideration shaped many of their design decisions. ## Build Versus Buy Decision The case study provides valuable insight into the vendor landscape as of Q1 2024. Otter's analysis revealed that resolving their customer tickets required deep integration with internal systems—support agents needed tightly controlled permissions to review menus, update accounts, and modify configurations like remote printer settings. At that time, no vendors offered the required integration flexibility without relying on hard-coded decision trees. The appendix details their vendor comparison, noting that established vendors like Zendesk primarily featured hard-coded decision trees and were still determining their LLM product strategy, while LLM-native startups they evaluated weren't capable of managing the complexity required for their top issues. Critically, the team observed that LLMs significantly reduced the value proposition of traditional vendor infrastructure. Features like Zendesk's workflow configuration UIs and NLP-based intent matching became less necessary with LLM capabilities, allowing Otter to focus on domain-specific problem solving. They identified four key requirements: LLM-native operation (no hard-coded trees), ability to choose models and control prompts, ability to update user accounts via API calls while maintaining access controls, and seamless bot-to-human escalation within a single chat window. These requirements drove their decision to build in-house while initially leveraging Zendesk's Sunco Web SDK for the front end (later replaced with a custom solution). ## Architecture: Agentic Approach and Function Calling The architecture spans online conversation flows and offline management flows. Interestingly, the team notes that when they began implementation in Q2 2024, the term "agentic" hadn't yet caught on, but by emulating how human support agents diagnose and resolve issues, they naturally arrived at an agentic approach. Their design philosophy was to make the bot mimic human agent workflow: identify the corresponding predefined procedure for a customer request, follow the steps if one exists, conduct research in the knowledge base if not, and escalate when encountering issues or missing information. This philosophy manifests in four main function types that form the core of their system: **GetRunbook Function**: This function serves as the bot's primary routing and orchestration mechanism. After analyzing support issues by volume and resolution complexity, Otter translated high-volume, low-to-medium complexity issues into "runbooks"—plain text instructions detailing the diagnostic and resolution steps the bot should take. This is a key differentiator from prior generation bot technology: while runbooks conceptually function like decision trees, being written in plain text makes them significantly easier to implement and maintain, more modular, and more traversable during runtime diagnosis. The mechanics of GetRunbook are sophisticated. It takes the user's issue description as input and attempts to find a corresponding runbook. Under the hood, this involves embedding-based retrieval from a vector database containing all runbooks, using semantic similarity to identify relevant candidates. A separate LLM call then selects the correct runbook from candidates or returns "Not Found" if no good match exists. Once a runbook is matched, the LLM works through the listed steps, gathering follow-up information from users and executing API calls as needed until reaching the end. This represents a practical implementation of RAG (Retrieval Augmented Generation) patterns combined with agentic execution. **API Call Functions**: As the bot executes runbook steps, it can choose from API wrapper functions to gather information (like fetching store status) or modify user accounts. The team was able to reuse pre-existing APIs within the Otter ecosystem, which is a significant advantage of building in-house. A critical security consideration is implemented here: for internal APIs, the system calls backend APIs with the user token passed as part of each Otter Assistant service request. This approach maintains and reuses existing permission control models and authentication infrastructure, ensuring the bot cannot access data the user shouldn't have access to. This is an excellent example of applying principle of least privilege and integrating with existing security infrastructure rather than building parallel systems. **Widget Functions**: After identifying root causes, the bot takes appropriate action, and for most write operations (exceptions being simple account modifications), actions are presented through "widgets"—embedded UI modules. The example given is a store pause/unpause widget. Widgets provide several LLMOps benefits: encapsulation and reuse across different conversation flows, distributed ownership (different teams can own different widgets), information density in the UI, and critically, easy user confirmation that eliminates hallucination risk. For any critical write operation, explicit user review and click confirmation is required before execution. This represents a thoughtful approach to managing LLM hallucination risks in production—rather than trying to eliminate hallucination through prompting alone, they architect the system so that hallucinations cannot cause harm. The bot calls the widget function (informing the LLM that a widget is being displayed) and simultaneously emits a notification to the external chat UI, which renders the widget within the message. **Research Function**: This function handles user questions that don't match a runbook, designed to mimic how humans find answers in help articles online. The implementation follows a multi-step RAG pattern: help articles from Otter's knowledge base are converted to embeddings offline and stored in a vector database. When a request arrives, the user question is converted to embeddings and semantic similarity retrieves top relevant articles. The system then issues a separate LLM request to each top article to find relevant answers, stopping when either finding n answers or going through m results (both configurable parameters). Finally, a separate LLM call combines the answers into a final response. This multi-stage approach with configurable parameters shows mature thinking about RAG system design—they're not just doing naive retrieval and generation but implementing controllable, iterative search. **EscalateToHuman Function**: This provides the LLM capability to signal that conversation should be escalated to a human agent. When the LLM detects user intent to escalate, the chat interface passes conversation control to an assigned human agent, which calls Zendesk to connect to a live agent. This seamless escalation path is crucial for maintaining customer satisfaction and represents good product thinking—the bot knows its limits. ## Testing and Evaluation Framework One of the most valuable aspects of this case study from an LLMOps perspective is the detailed discussion of testing and management infrastructure. The team recognized that the inherent randomness and unpredictability in LLM-powered conversational flows required bespoke tooling beyond traditional software testing approaches. **Local Development and Playground**: Given the stochastic nature of LLMs and multi-modal nature of conversations (encompassing both text and bot actions/widgets), developers need effective debugging tools. Otter built a Streamlit-based library providing a web UI where developers can interact with the bot while viewing input and output arguments for each function call. This allows verification of end-to-end flow correctness. The choice of Streamlit is pragmatic—it's quick to develop with and provides adequate functionality for internal tooling. **Bot Validation Testing**: This is where their approach gets particularly innovative. They recognized that traditional software testing frameworks rely on deterministic execution and structured output, but LLM systems are inherently stochastic, requiring multiple conversation iterations to expose and verify specific behaviors. Additionally, changing prompt logic in one place could cause unanticipated behavior changes elsewhere that are difficult to detect. Their solution was developing a custom test and evaluation framework with four components: predefined test scenarios (e.g., "customer's store is paused"), expected behaviors for each scenario (e.g., "confirm which store, check status, then launch widget"), launching a chatbot using an LLM to play the customer role and chat with their bot, and leveraging an LLM as a judge to assert on expected behaviors based on conversation transcripts. This "LLM as judge" approach is increasingly common in LLMOps but was less established in mid-2024 when they were building this. The framework allows them to evaluate chatbots through a mechanism similar to traditional unit tests—defining inputs and asserting on expected outputs—while accommodating the non-deterministic nature of LLM systems. **Bot Conversation Review and Analytics**: Post-deployment, they needed to understand performance. They defined and instrumented a "resolution" metric informing overall bot performance and business impact, helping identify issues and improvement opportunities. However, bot issue analysis presents challenges compared to traditional software—bots can err in many ways at both the software layer and model layer, and manual inspection is often required to determine which. To streamline conversation review, they built a conversation inspector tool in Streamlit allowing reviewers to load past conversations and visualize chat history and action logs similarly to the local testing app. Importantly, this tool is available to both developers and non-developers, which has helped scale their investigation efforts. Making evaluation tools accessible to non-technical stakeholders is excellent practice for LLMOps—it democratizes the ability to understand and improve the system. ## Lessons Learned and Production Considerations The team notes that when they began implementing Otter Assistant in 2024, there were no established bot guidelines or frameworks. While frameworks have begun to emerge (they mention OpenAI's Agents SDK as an example), they still feel building in-house was the right decision for them. They recommend other organizations weigh build-versus-buy according to their abilities and the degree of control and customization required for their use cases. This is balanced advice—they're not claiming building is always right, but rather that it was right for their specific requirements. The most important takeaway they emphasize is the importance of defensible, actionable success metrics. These metrics proved instrumental in persuading themselves of the bot's value and establishing a feedback loop for improvement over time. This is mature LLMOps thinking—without clear metrics tied to business outcomes, it's difficult to justify investment or know where to improve. An interesting secondary benefit they discovered: Otter Assistant exposed multiple product and platform issues previously undetected in their systems. The high-fidelity conversational feedback generated by the bot has been incorporated into their product strategy alongside traditional sources like user interviews and competitive analysis. This demonstrates how LLM applications can serve as diagnostic tools for broader system health. ## Current State and Future Directions After approximately one year of development (Q1 2024 through publication in July 2025), Otter Assistant solves roughly half of support requests autonomously without compromising customer satisfaction. The team indicates they will share more about prompt engineering lessons and best practices for designing and structuring functions in future posts. Importantly, they acknowledge hitting limitations: "in certain scenarios, we have started to hit limitations on how much we can improve without more fundamental improvements on the LLMs." This is honest acknowledgment of current LLM capabilities' constraints. They're exploring establishing more efficient feedback loop mechanisms so the bot can self-sufficiently become smarter over time, which suggests interest in fine-tuning or reinforcement learning from human feedback approaches, though they don't specify details. Looking ahead, they view this as just the beginning of a new era for product design and development, believing agentic chatbots can hugely elevate customer experience, with support requests being just a starting point. ## Critical Assessment While this case study provides valuable technical detail, there are areas where claims should be evaluated critically. The "~50% of inbound customer requests" metric is impressive, but we don't know the full context: What percentage of these are simple queries that traditional FAQ systems might have handled? What's the distribution of issue complexity in the 50% that's automated versus the 50% that isn't? The claim that this is achieved "without compromising customer satisfaction" is not substantiated with specific satisfaction metrics or comparisons to pre-bot baselines. The build-versus-buy analysis, while thorough for Q1 2024, may not reflect the rapidly evolving vendor landscape. By late 2025, many of the vendors they dismissed might have significantly improved their LLM-native offerings. However, their point about deep integration requirements remains valid—custom business logic and permission models are difficult for external vendors to accommodate without significant customization. The testing framework using LLM-as-judge is innovative but has known limitations not discussed in the case study. LLM judges can be inconsistent, may not catch subtle issues, and can be expensive to run at scale. The team doesn't discuss how they validate the judge's assessments or handle cases where the judge's evaluation is questionable. The emphasis on widgets for confirmation is excellent risk management, but it does raise questions about fully autonomous operation. If most write operations require user confirmation, is this truly autonomous support or assisted support? The distinction matters for understanding the actual level of automation achieved. Despite these caveats, this case study represents solid LLMOps practice: thoughtful architecture decisions, custom tooling for the unique challenges of LLM systems, integration with existing infrastructure, careful attention to security and permissions, and realistic acknowledgment of limitations. The progression from prototype to production over approximately one year with measurable business impact demonstrates successful LLM deployment at scale.

Start deploying reproducible AI workflows today