Roblox: Real-Time Multilingual Chat Translation at Scale

LLMOps Database

Media & Entertainment

Roblox

Company

Roblox

Title

Real-Time Multilingual Chat Translation at Scale

Industry

Media & Entertainment

Link

https://corp.roblox.com/newsroom/2024/02/breaking-down-language-barriers-with-a-multilingual-translation-model

Year

2024

Summary (short)

Roblox deployed a unified transformer-based translation LLM to enable real-time chat translation across all combinations of 16 supported languages for over 70 million daily active users. The company built a custom ~1 billion parameter model using pretraining on open source and proprietary data, then distilled it down to fewer than 650 million parameters to achieve approximately 100 millisecond latency while handling over 5,000 chats per second. The solution leverages a mixture-of-experts architecture, custom translation quality estimation models, back translation techniques for low-resource language pairs, and comprehensive integration with trust and safety systems to deliver contextually appropriate translations that understand Roblox-specific slang and terminology.

Tags

knowledge_distillation

## Overview Roblox, a global gaming and social platform with over 70 million daily active users across more than 15 million active experiences, faced the challenge of enabling seamless communication between users speaking different languages. The company deployed a production-scale real-time multilingual translation system that supports all combinations of 16 languages with approximately 100 millisecond latency, handling over 5,000 chats per second. This represents a sophisticated LLMOps implementation that required careful balancing of model accuracy, inference speed, resource efficiency, and safety considerations at massive scale. The case study provides insight into Roblox's approach to building and deploying a custom translation LLM that outperforms commercial translation APIs on Roblox-specific content. The company emphasizes that their goal was to go beyond translating static content to automatically translating real-time user interactions, which presented unique technical challenges around latency, context awareness, and platform-specific language understanding. ## Technical Architecture and Model Design At the core of Roblox's solution is a unified, transformer-based translation LLM that handles all language pairs in a single model rather than building 256 separate models (16x16 language pairs). The architecture employs a mixture-of-experts approach where different "experts" specialize in groups of similar languages, activated dynamically based on the source sentence and target language. This architectural choice provides several key advantages for production deployment: better resource utilization since each expert has different specialties, more efficient training and inference without sacrificing translation quality, and the ability to leverage linguistic similarities between related languages like Spanish and Portuguese during training. The initial model was trained with approximately 1 billion parameters, which the team recognized would be prohibitively resource-intensive for real-time serving at their required scale. To address this operational constraint, Roblox applied a student-teacher distillation approach combined with quantization, model compilation, and other serving optimizations to reduce the model to fewer than 650 million parameters while improving serving efficiency. This represents a critical LLMOps tradeoff between model capacity and operational requirements—achieving the 100 millisecond latency target necessary for natural conversation flow required aggressive optimization while maintaining translation quality. ## Training Data and Model Development The training pipeline demonstrates sophisticated approaches to addressing common production ML challenges. Roblox pretrained on available open source translation data, supplemented with their own in-experience translation data, human-labeled chat translation results, and common chat sentences and phrases specific to their platform. This combination of general-purpose and domain-specific data is crucial for building models that perform well on platform-specific language patterns. A particularly interesting aspect is how Roblox addressed the challenge of less common translation pairs like French to Thai, where high-quality parallel training data is scarce. The team applied iterative back translation techniques, where content is translated back into the original language and compared to the source text for accuracy. During training, they strategically mixed this back-translated synthetic data with supervised labeled data to expand the available training corpus for underrepresented language pairs. This represents a pragmatic approach to the data scarcity problem that many production ML teams face. To handle modern slang and platform-specific terminology like "obby," "afk," and "lol," Roblox employed human evaluators to translate popular and trending terms for each language, incorporating these translations into training data. The team plans to repeat this process regularly to keep the system current, which reflects the ongoing maintenance requirements of production LLM systems. The model's robustness is evident in its ability to detect correct source languages even when not explicitly set or set incorrectly, and to handle mixed-language inputs with reasonable accuracy. ## Custom Translation Quality Evaluation One of the most sophisticated aspects of Roblox's LLMOps implementation is their custom translation quality estimation system. Most off-the-shelf translation quality metrics compare AI translations to ground truth references and focus primarily on understandability. Roblox wanted to assess translation quality without requiring reference translations, which would be impractical at their scale. They built a custom ML model trained on human-labeled error types and scores that evaluates multiple dimensions: accuracy (checking for additions, omissions, or mistranslations), fluency (punctuation, spelling, and grammar), and incorrect references (discrepancies with the rest of the text). This quality estimation model fine-tunes a multilingual language model to predict word-level errors and types, then calculates scores using multidimensional criteria and classifies errors into severity levels (critical, major, or minor). This approach enables continuous quality assessment in production without the overhead of constant human evaluation or maintaining reference translations. The results feed back into model improvement cycles, creating a closed loop for iterative enhancement. This represents a mature approach to production ML monitoring and quality assurance. ## Production Serving Infrastructure The production infrastructure reveals careful attention to efficiency and scalability. The serving pipeline includes several optimization layers: a request caching component (RCC) that checks if translations already exist before hitting the backend model servers, dynamic batching to improve throughput by processing multiple requests together, and an embedding cache layer between encoders and decoders to improve efficiency when translating into multiple target languages from the same source. The API design modified Roblox's in-experience text chat service to send both original and translated messages to each user's device, enabling recipients to view messages in their native language or quickly switch to see the sender's original non-translated message. This design choice respects user agency while defaulting to the translated experience. The backend integration includes comprehensive trust and safety systems to ensure translated text receives the same level of scrutiny as other text for detecting and blocking policy-violating content. This integration of safety considerations directly into the translation pipeline reflects mature thinking about production AI systems. ## Performance and Operational Considerations Roblox reports that their custom model outperforms commercial translation APIs on Roblox-specific content based on their internal metrics, validating their investment in a custom solution. However, the case study should be viewed with appropriate skepticism regarding specific performance claims, as the company has strong incentives to present their system favorably. The stated 100 millisecond latency target and 5,000+ chats per second throughput are impressive if accurate, though the case study doesn't provide detailed benchmark comparisons or discuss failure modes and edge cases. The architecture's design for extensibility is noteworthy—the unified model approach means adding new languages requires relatively low effort as sufficient training data becomes available, rather than requiring exponential increases in model count. This forward-looking design demonstrates consideration for long-term operational sustainability. The team also discusses plans to expand beyond text to translate content on images, textures, and 3D models, and to explore automatic voice chat translation with preservation of tone, rhythm, and emotion. ## Continuous Improvement and Feedback Loops Roblox plans to implement user feedback mechanisms allowing people to flag mistranslations and suggest better translations, which will be incorporated into training data for model improvement. This human-in-the-loop approach creates a continuous improvement cycle leveraging the platform's user base. The team also plans regular updates with latest translation examples from within experiences and popular chat phrases and slang in every supported language. This commitment to ongoing model maintenance reflects understanding that production LLM systems require continuous investment rather than one-time deployment. ## LLMOps Maturity and Challenges This case study demonstrates several markers of mature LLMOps practice: custom model development tailored to specific use case requirements rather than relying solely on general-purpose solutions, sophisticated model compression and optimization techniques to meet latency and cost constraints, custom evaluation frameworks aligned with business objectives, comprehensive integration with adjacent systems like trust and safety, infrastructure designed for caching and efficiency at scale, and planned feedback loops for continuous improvement. However, the case study also reveals common LLMOps challenges: balancing model size and performance against operational constraints, handling data scarcity for less common scenarios (language pairs in this case), maintaining model currency as language evolves, integrating safety considerations without compromising user experience, and scaling to handle massive concurrent usage. The presentation is from Roblox's CTO and clearly serves partly as a showcase of technical capabilities, so claims about superior performance should be interpreted as the company's self-assessment rather than independent validation. The absence of detailed discussions about failure modes, edge cases, error rates, or comparative benchmarks against specific commercial alternatives limits the ability to fully assess the system's performance. Nonetheless, the technical details provided suggest a sophisticated, production-grade LLMOps implementation addressing real challenges at significant scale. ## Strategic Implications Roblox's decision to build a custom translation model rather than rely entirely on third-party APIs reflects a strategic calculation that the investment in custom model development, training infrastructure, and ongoing maintenance would provide superior results for their specific use case while potentially reducing long-term costs and dependency on external providers. This build-versus-buy decision represents an important consideration for organizations deploying LLMs in production. The unified architecture's extensibility and the team's ability to integrate latest research advances provide competitive advantages that would be difficult to achieve with purely off-the-shelf solutions. The case study illustrates how production LLM deployments at scale require end-to-end thinking about the entire system—not just the model itself, but training data pipelines, quality evaluation frameworks, serving infrastructure, safety integrations, user experience considerations, and continuous improvement mechanisms. The approximately 100 millisecond latency requirement drove fundamental architectural decisions from model size to caching strategies, demonstrating how production requirements shape technical choices throughout the stack.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source