The case study demonstrates how to build production-ready multilingual AI agents that serve users speaking different languages. The core problem is that when AI pipelines are designed primarily in English with extensive prompts, tool definitions, and business logic, they tend to produce English responses even when users interact in other languages. The solution involves building a translation pipeline that normalizes user input to English, processes it through a well-evaluated English pipeline, and then translates the response back to the user's original language while matching their tone. This approach is demonstrated through a live-coded travel booking agent, showing that even the smartest models fail to respond reliably in non-English languages without proper pipeline architecture, but succeed when proper translation boundaries are implemented.
This case study from Boundary presents a comprehensive approach to building multilingual AI agents that work reliably in production environments. The team demonstrates through both architectural discussion and live coding how to properly handle multiple languages in AI systems. The fundamental insight is that most large language models, despite having multilingual capabilities, cannot reliably respond in non-English languages when the majority of the steering logic in an AI pipeline is in English.
The central challenge in multilingual AI applications stems from what the team calls the “burden” or “steering” ratio. In a production AI application, unlike a general-purpose chatbot like ChatGPT, the majority of the context consists of system prompts, instructions, tool definitions, and business logic written by developers rather than user input. When building bespoke AI applications, developers intentionally shift the burden from the user to the system to guarantee better outcomes. This means extensive English-language prompts, tool definitions, and instructions dominate the context window.
The practical consequence is that even if a user submits input in French, Hindi, or any other language, the overwhelming presence of English in the prompt makes it statistically likely that the model will respond in English. The team demonstrates this empirically by building a travel booking agent where all the tooling and instructions are in English, showing that even when a user writes in Hindi mixed with English, the system responds in English despite using a very capable model.
The team explores several approaches to solving this problem and explains the trade-offs of each:
One solution is to build completely separate pipelines for each language. This involves creating a classifier at the beginning that routes to language-specific pipelines where all prompts, tool definitions, and instructions are translated into the target language. For a French user, they would hit a French pipeline where everything is in French.
The problems with this approach are severe from an LLMOps perspective:
Another naive approach is to add a simple instruction like “respond in the user’s preferred language” to your existing English pipeline. The problem is this relies on both the model’s ability to understand and respond coherently in the target language AND its ability to follow instructions. Instruction following is a more recently developed capability and is inherently less reliable than the model’s core language generation abilities. This reduces the success rate from potentially 99% to something lower, and the difference between 99% and 99.99% accuracy is the difference between software that feels reliable versus software that feels constantly broken.
The team’s recommended approach draws a direct parallel to how production voice agents are architected. In voice systems, the industry standard is not to use direct audio-to-audio models but rather to use a pipeline: audio to speech-to-text, then to an LLM, then text-to-speech back to audio. This works far better than end-to-end audio models despite being more complex.
The same principle applies to multilingual systems. The architecture consists of:
The team emphasizes using small models like Anthropic’s Haiku for these translation tasks because they’re performing a narrow, well-defined function that doesn’t require the reasoning capabilities of larger models.
The team discusses several important optimizations for production deployment:
For applications where most users speak English, adding a translation layer to every request introduces unnecessary latency. The solution is to implement a fast heuristic classifier that checks if the input contains common English words above a certain threshold. If it does, skip the translation layers entirely and go straight to the English pipeline. This is a simple string-matching operation that’s much faster than any model call.
The translation and intent capture steps at the input boundary can run in parallel since they’re independent operations. This reduces the latency overhead of the multilingual pipeline.
The team advocates strongly for using the smartest, most capable models initially and only optimizing for cost when token spend becomes significant. The advice is to throw tokens at the problem until it becomes more expensive than engineering time, then invest in optimization. This is a common theme in their LLMOps philosophy.
The team performs a live coding demonstration where they use Claude (Anthropic’s AI assistant) to generate a complete multilingual travel booking agent. The demonstration is particularly valuable because it shows empirical proof of the concepts:
They first test the agent with the translation pipeline disabled, forcing user input directly into the English pipeline. Even with a very capable model and a Hindi/English mixed input, the system responds in English. This definitively proves that model multilingual capabilities alone are insufficient.
With the translation pipeline enabled, the same input produces a response in the appropriate language, matching the user’s original linguistic style.
The live coding reveals several important implementation details:
The case study is refreshingly honest about the limitations of the approach:
For languages the team doesn’t speak, there are only two paths:
The team notes that for high-stakes domains like medical applications, the investment in comprehensive multilingual evaluation may be necessary. For other applications, trusting capable models on the narrow translation tasks may be acceptable.
Several production-oriented patterns emerge from the discussion:
By isolating translation to specific boundary layers, the core business logic remains unchanged. This makes the system more maintainable and allows language support to be added or modified without touching the core agent logic.
The team suggests tracking which languages appear in the miscellaneous pipeline (the catch-all for languages without dedicated fast-paths) and automatically creating optimized paths for languages that reach certain usage thresholds.
The architecture allows for sophisticated latency optimizations:
The team strongly advocates for automated testing being part of the default workflow. When Claude generates the code, it also generates tests and runs them automatically. This creates internal pressure to maintain test coverage and catch regressions quickly.
The demonstration uses several specific technologies:
The team draws an extended analogy to voice agents, which face a similar architectural decision. The industry has settled on speech-to-text → LLM → text-to-speech rather than end-to-end speech models, despite the latter seeming simpler. The reason is reliability and accuracy. The same principle applies here: translation → English pipeline → translation produces more reliable results than hoping a multilingual model will handle everything end-to-end.
Quinn LaKramer from Daily is cited as someone who regularly discusses this pattern in the voice agent space, noting that while they wish end-to-end speech models worked better, the pipeline approach is simply more reliable in production.
The team discusses why English and Chinese likely produce the best results: these represent the two largest sources of training data on the internet. This has practical implications for which languages might work best in the miscellaneous pipeline without dedicated optimization. They note that French might be improving due to Mistral’s focus on that language, but it likely doesn’t have the training data volume to match English or Chinese.
A recurring theme is the balance between token costs and engineering time. The team advocates for:
This pragmatic approach acknowledges that premature optimization often wastes more engineering time than it saves in compute costs, especially early in a product’s lifecycle.
The case study reinforces several key principles for production LLM operations:
The translation pipeline approach is more complex than just adding a multilingual instruction, but it produces dramatically more reliable results. In production, reliability trumps architectural simplicity.
The amount of developer-provided context versus user-provided context fundamentally affects how models behave. Applications with high steering ratios require different architectures than general-purpose chatbots.
Translation and classification are narrow, well-defined tasks that don’t require the largest models. Using smaller models for these boundary operations reduces cost and latency.
Maintaining separate pipelines for each language creates an unsustainable maintenance burden. The translation boundary approach keeps all core logic in one place.
The team doesn’t just theorize about the approach—they build it live and demonstrate empirically that the naive approach fails while their solution works. This kind of validation is crucial in LLMOps.
The case study provides a comprehensive, practical guide to building multilingual AI systems that actually work in production, backed by both architectural reasoning and empirical demonstration.
ElevenLabs, founded by Mati and his co-founder from Poland, built frontier voice AI models to solve audio generation, transcription, and translation problems at scale. Starting in 2022 with text-to-speech models trained on modest compute budgets, they evolved a cascaded architecture combining speech-to-text, LLMs, and text-to-speech models to power applications from audiobook narration to real-time voice agents. By focusing on product-led growth, staying close to users through Discord communities, and building deployment infrastructure for enterprise customers, they scaled from under $2M to over $430M ARR in 36 months with a team of 450 people, serving use cases ranging from content localization to customer support automation while maintaining quality, reliability, and emotional expressiveness in voice outputs.
This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.
Grafana Labs developed an agentic AI assistant integrated into their observability platform to help users query data, create dashboards, troubleshoot issues, and learn the platform. The team started with a hackathon project that ran entirely in the browser, iterating rapidly from a proof-of-concept to a production system. The assistant uses Claude as the primary LLM, implements tool calling with extensive context about Grafana's features, and employs multiple techniques including tool overloading, error feedback loops, and natural language tool responses. The solution enables users to investigate incidents, generate queries across multiple data sources, and modify visualizations through conversational interfaces while maintaining transparency by showing all intermediate steps and data to keep humans in the loop.