Boundary: Building Multilingual AI Agents with Translation Pipelines

Overview

This case study from Boundary presents a comprehensive approach to building multilingual AI agents that work reliably in production environments. The team demonstrates through both architectural discussion and live coding how to properly handle multiple languages in AI systems. The fundamental insight is that most large language models, despite having multilingual capabilities, cannot reliably respond in non-English languages when the majority of the steering logic in an AI pipeline is in English.

The Core Problem

The central challenge in multilingual AI applications stems from what the team calls the “burden” or “steering” ratio. In a production AI application, unlike a general-purpose chatbot like ChatGPT, the majority of the context consists of system prompts, instructions, tool definitions, and business logic written by developers rather than user input. When building bespoke AI applications, developers intentionally shift the burden from the user to the system to guarantee better outcomes. This means extensive English-language prompts, tool definitions, and instructions dominate the context window.

The practical consequence is that even if a user submits input in French, Hindi, or any other language, the overwhelming presence of English in the prompt makes it statistically likely that the model will respond in English. The team demonstrates this empirically by building a travel booking agent where all the tooling and instructions are in English, showing that even when a user writes in Hindi mixed with English, the system responds in English despite using a very capable model.

The Naive Approaches and Their Problems

The team explores several approaches to solving this problem and explains the trade-offs of each:

Approach 1: Parallel Language-Specific Pipelines

One solution is to build completely separate pipelines for each language. This involves creating a classifier at the beginning that routes to language-specific pipelines where all prompts, tool definitions, and instructions are translated into the target language. For a French user, they would hit a French pipeline where everything is in French.

The problems with this approach are severe from an LLMOps perspective:

The evaluation matrix becomes exponentially more complex as you need comprehensive evals for each language
You must maintain prompts and tool definitions in every language you support
Translation agencies or multilingual teams are required to maintain these translation files
Teams typically only speak one or two languages, making it nearly impossible to improve pipelines in languages they don’t understand
Updates to the English pipeline don’t automatically translate to improvements in other language pipelines, creating a lag where non-English users receive an inferior experience
It creates a situation analogous to software that has poor Windows support because the team primarily uses Mac OS

Approach 2: Generic Multilingual Instruction

Another naive approach is to add a simple instruction like “respond in the user’s preferred language” to your existing English pipeline. The problem is this relies on both the model’s ability to understand and respond coherently in the target language AND its ability to follow instructions. Instruction following is a more recently developed capability and is inherently less reliable than the model’s core language generation abilities. This reduces the success rate from potentially 99% to something lower, and the difference between 99% and 99.99% accuracy is the difference between software that feels reliable versus software that feels constantly broken.

The Recommended Solution: Translation Boundaries

The team’s recommended approach draws a direct parallel to how production voice agents are architected. In voice systems, the industry standard is not to use direct audio-to-audio models but rather to use a pipeline: audio to speech-to-text, then to an LLM, then text-to-speech back to audio. This works far better than end-to-end audio models despite being more complex.

The same principle applies to multilingual systems. The architecture consists of:

Input Translation Layer

User input comes in whatever language they prefer
A small, fast model performs translation to pure English
Optionally, an intent capture or metadata extraction step runs in parallel
The original message is preserved alongside the translation
The detected source language is stored for later use

The team emphasizes using small models like Anthropic’s Haiku for these translation tasks because they’re performing a narrow, well-defined function that doesn’t require the reasoning capabilities of larger models.

Core English Pipeline

All business logic, tool definitions, and prompts remain in English
This is where comprehensive evaluation and testing occur
The team can iterate quickly because they fully understand the language
Both the translated message and optionally the original message are provided as context
Tool calls and complex reasoning happen here with confidence

Output Translation Layer

The English response from the agent is translated back to the user’s language
The translation can match the user’s original tone and style
If the user was mixing languages (like Hindi and English together, common in India), the system can mimic that conversational flow
The translation has access to both the original user message and the English response

Performance Optimizations

The team discusses several important optimizations for production deployment:

Fast-Path Classification

For applications where most users speak English, adding a translation layer to every request introduces unnecessary latency. The solution is to implement a fast heuristic classifier that checks if the input contains common English words above a certain threshold. If it does, skip the translation layers entirely and go straight to the English pipeline. This is a simple string-matching operation that’s much faster than any model call.

Parallel Processing

The translation and intent capture steps at the input boundary can run in parallel since they’re independent operations. This reduces the latency overhead of the multilingual pipeline.

Model Selection Strategy

The team advocates strongly for using the smartest, most capable models initially and only optimizing for cost when token spend becomes significant. The advice is to throw tokens at the problem until it becomes more expensive than engineering time, then invest in optimization. This is a common theme in their LLMOps philosophy.

Live Demonstration and Empirical Validation

The team performs a live coding demonstration where they use Claude (Anthropic’s AI assistant) to generate a complete multilingual travel booking agent. The demonstration is particularly valuable because it shows empirical proof of the concepts:

The Baseline Failure

They first test the agent with the translation pipeline disabled, forcing user input directly into the English pipeline. Even with a very capable model and a Hindi/English mixed input, the system responds in English. This definitively proves that model multilingual capabilities alone are insufficient.

The Working Solution

With the translation pipeline enabled, the same input produces a response in the appropriate language, matching the user’s original linguistic style.

Architectural Decisions Revealed

The live coding reveals several important implementation details:

Output field naming matters significantly: naming a field “reply_english” will bias the model to respond in English regardless of instructions
The schema itself acts as part of the prompt and can leak unintended steering signals
Testing frameworks should be built into the development workflow from the start
The team uses BAML (their domain-specific language for agents) which has built-in testing support

Evaluation Strategy and Trade-offs

The case study is refreshingly honest about the limitations of the approach:

What Can Be Evaluated

The English pipeline can have comprehensive evals because the team understands English
Translation quality from user language to English can be evaluated if you hire native speakers or translation agencies
Translation quality from English back to user language can similarly be evaluated with proper resources

What Requires Trust or Investment

For languages the team doesn’t speak, there are only two paths:

Accept the translation models at face value and trust they work within acceptable error tolerances for the narrow task of translation
Invest in hiring multilingual teams and building comprehensive evaluation datasets for every supported language

The team notes that for high-stakes domains like medical applications, the investment in comprehensive multilingual evaluation may be necessary. For other applications, trusting capable models on the narrow translation tasks may be acceptable.

Production Patterns and Best Practices

Several production-oriented patterns emerge from the discussion:

Separation of Concerns

By isolating translation to specific boundary layers, the core business logic remains unchanged. This makes the system more maintainable and allows language support to be added or modified without touching the core agent logic.

Progressive Enhancement

The team suggests tracking which languages appear in the miscellaneous pipeline (the catch-all for languages without dedicated fast-paths) and automatically creating optimized paths for languages that reach certain usage thresholds.

Latency Optimization

The architecture allows for sophisticated latency optimizations:

Fast-path classification for dominant languages
Parallel execution of independent steps
Caching of common translations
Model selection per pipeline component based on task complexity

Testing Philosophy

The team strongly advocates for automated testing being part of the default workflow. When Claude generates the code, it also generates tests and runs them automatically. This creates internal pressure to maintain test coverage and catch regressions quickly.

Technical Stack and Tools

The demonstration uses several specific technologies:

Anthropic’s Claude (Sonnet and Haiku models) for the agent logic and translations
BAML as the programming language for defining agents
Python with UV for package management
Structured outputs for reliable parsing
Fast API or similar for the web service layer

Comparison to Voice Agents

The team draws an extended analogy to voice agents, which face a similar architectural decision. The industry has settled on speech-to-text → LLM → text-to-speech rather than end-to-end speech models, despite the latter seeming simpler. The reason is reliability and accuracy. The same principle applies here: translation → English pipeline → translation produces more reliable results than hoping a multilingual model will handle everything end-to-end.

Quinn LaKramer from Daily is cited as someone who regularly discusses this pattern in the voice agent space, noting that while they wish end-to-end speech models worked better, the pipeline approach is simply more reliable in production.

Language Model Training Data Implications

The team discusses why English and Chinese likely produce the best results: these represent the two largest sources of training data on the internet. This has practical implications for which languages might work best in the miscellaneous pipeline without dedicated optimization. They note that French might be improving due to Mistral’s focus on that language, but it likely doesn’t have the training data volume to match English or Chinese.

Cost and Engineering Trade-offs

A recurring theme is the balance between token costs and engineering time. The team advocates for:

Using the most capable models initially, even if expensive
Only optimizing when token costs exceed engineering costs
Measuring actual usage patterns before building optimizations
Building evals that give confidence to use smaller models where appropriate

This pragmatic approach acknowledges that premature optimization often wastes more engineering time than it saves in compute costs, especially early in a product’s lifecycle.

Lessons for Production LLM Systems

The case study reinforces several key principles for production LLM operations:

Reliability Over Simplicity

The translation pipeline approach is more complex than just adding a multilingual instruction, but it produces dramatically more reliable results. In production, reliability trumps architectural simplicity.

Understand Your Steering Ratio

The amount of developer-provided context versus user-provided context fundamentally affects how models behave. Applications with high steering ratios require different architectures than general-purpose chatbots.

Small Models for Narrow Tasks

Translation and classification are narrow, well-defined tasks that don’t require the largest models. Using smaller models for these boundary operations reduces cost and latency.

Maintainability Matters

Maintaining separate pipelines for each language creates an unsustainable maintenance burden. The translation boundary approach keeps all core logic in one place.

Empirical Validation Is Essential

The team doesn’t just theorize about the approach—they build it live and demonstrate empirically that the naive approach fails while their solution works. This kind of validation is crucial in LLMOps.

The case study provides a comprehensive, practical guide to building multilingual AI systems that actually work in production, backed by both architectural reasoning and empirical demonstration.

Industry

Technologies