DeepL: Enterprise Neural Machine Translation at Scale

Company

DeepL

Title

Enterprise Neural Machine Translation at Scale

Industry

Tech

Link

https://www.youtube.com/watch?v=-ikvSn6xB1I

Year

2025

Summary (short)

DeepL, a translation company founded in 2017, has built a successful enterprise-focused business using neural machine translation models to tackle the language barrier problem at scale. The company handles hundreds of thousands of customers by developing specialized neural translation models that balance accuracy and fluency, training them on curated parallel and monolingual corpora while leveraging context injection rather than per-customer fine-tuning for scalability. By building their own GPU infrastructure early on and developing custom frameworks for inference optimization, DeepL maintains a competitive edge over general-purpose LLMs and established players like Google Translate, demonstrating strong product-market fit in high-stakes enterprise use cases where translation quality directly impacts legal compliance, customer experience, and business operations.

## Overview DeepL represents a compelling LLMOps case study as an enterprise-focused neural machine translation company that launched in 2017, strategically timing their entry to coincide with the industry shift from statistical to neural machine translation. The company serves hundreds of thousands of customers with specialized translation models that compete directly with tech giants like Google and general-purpose LLMs from OpenAI and others. CEO and founder Yaric Kotalowski provides insights into how DeepL maintains technical differentiation through focused model development, custom infrastructure, and deep understanding of enterprise translation workflows. ## Technical Architecture and Model Development DeepL's approach to model architecture reflects a nuanced understanding of the translation task that goes beyond general text generation. The company discovered that translation requires specialized architectures that balance two competing objectives: maintaining accuracy to the source text (copying capability) and generating fluent, natural-sounding output in the target language (creative generation). This dual requirement led them to develop custom architectures that combine monolingual and bilingual modeling approaches, even though these models now compete with large language models on parameter count. The company does leverage pre-trained models from sources like Meta's Llama series as starting points, but invests significant additional compute on top of these foundation models. This additional training uses specialized, curated datasets that DeepL has built over years, with particular attention to maintaining proper distribution across all supported languages. This is especially critical for smaller languages where general-purpose models may have insufficient representation. The training approach represents roughly a 50/50 split between research and engineering work, with all research required to have direct product applicability rather than being purely academic. A key technical insight concerns context handling. DeepL has found that sentence-level translation without context is often inadequate even for human translators—understanding the document type, company domain, and surrounding text is essential for high-quality translation. Rather than training separate models per customer (which would not scale to hundreds of thousands of customers), DeepL developed mechanisms for context injection that allow models to dynamically incorporate customer-specific terminology, document context, and domain information at inference time without retraining. ## Model Training and Data Strategy DeepL's data strategy involves both web scraping for parallel corpora (sentence-aligned texts in multiple languages) and monolingual data collection. The monolingual data becomes particularly important for languages where parallel corpora are scarce, allowing the models to learn language-specific fluency patterns. The company notes that while web scraping was more challenging in 2017, pre-crawled corpora are now more readily available, though extracting and matching parallel sentences at scale across large web domains remains a computationally interesting algorithmic challenge. The company maintains multiple model variants with different characteristics. Some are tuned for technical accuracy where consistency and precision matter (such as legal or technical documentation), while others allow more creativity for marketing content where fluency and natural expression are prioritized. This tuning affects how the models sample from probability distributions during generation. For technical use cases, customers can upload custom terminology glossaries that ensure consistent translation of domain-specific terms across their entire documentation base without requiring model retraining. Training infrastructure has been a major investment area. DeepL began building their own GPU data centers in 2017, with the CEO personally racking early machines. This early infrastructure investment was necessary because GPU compute was difficult to procure at the time, even with unlimited budget. The company currently runs entirely on NVIDIA GPUs and has scaled to significant compute footprints including DGX systems and newer Blackwell architecture. While they monitor alternative GPU vendors and conduct benchmarking, migration costs are substantial given their custom model architectures, and NVIDIA's speed advantages remain important for their business. The company employs thousands of human translators not for production inference but for model training, quality assurance, and feedback collection. This human-in-the-loop approach during training helps ensure quality without requiring human review during production translation at the scale DeepL operates. ## Inference Infrastructure and Production Deployment DeepL's production inference infrastructure represents one of the more sophisticated LLMOps implementations discussed in the interview. The company had to build much of their deployment stack from scratch because they started before standard tooling existed. Key infrastructure challenges include: **Request routing and batch optimization**: The system must balance GPU utilization (which benefits from larger batch sizes) against user latency requirements (which favor immediate processing). DeepL developed custom technology to intelligently group incoming translation requests and route them to appropriate GPU resources. **Multi-model management**: The company manages multiple models for different language pairs and use cases. The infrastructure includes dynamic model scheduling that responds to load patterns—for example, spinning up more Japanese translation capacity during Asian business hours and spinning down models for other language pairs. This dynamic resource allocation is more complex for GPU compute compared to traditional CPU-based services. **Language-specific optimization**: DeepL has experimented with both consolidated multilingual models and separate models per language pair or language group. The choice involves tradeoffs between engineering complexity (version management, deployment) and model performance. Grouping similar languages helps them share learning, particularly for lower-resource languages. For speech translation specifically, latency requirements may necessitate smaller models that cannot handle all languages simultaneously, leading to more specialized model variants. The company has moved toward consolidating models into groups that can handle multiple related languages rather than maintaining hundreds of separate models, which simplifies operations while maintaining quality through shared linguistic features. ## Speech Translation and Latency Optimization DeepL launched speech translation in 2024 (described as "last year" in the conversation), representing a newer market vertical that introduces additional LLMOps challenges. Speech translation requires integrating speech recognition with neural machine translation, dealing with messier input since spoken language is less structured than written text. Speech recognition errors propagate to the translation model, which must be robust enough to handle potentially garbled input, sometimes making intelligent substitutions when the recognized text doesn't make sense. Latency is the dominant concern for speech translation production systems. Real-time translation requires extremely fast model inference to maintain conversation flow and allow speakers to match translations with visual cues like facial expressions. The system must also handle context-specific terminology and proper nouns (like CEO names) to make good impressions in business settings. These requirements led to architectural decisions favoring smaller, faster models even at some cost to handling all languages in a single model. ## Quality Evaluation and Model Capabilities DeepL's approach to quality emphasizes that translation requirements vary significantly by use case. A casual email between colleagues has different quality needs than legal contracts or terms of service that could have legal consequences if mistranslated. Higher quality unlocks new use cases—each quality improvement makes machine translation viable for more demanding applications. Many enterprise workflows still include human post-editing, where translators review and refine machine translations. DeepL measures quality partly by how much editing is required, which directly impacts customer ROI since post-editors (like paralegals) have high hourly costs. Reducing the number of required edits provides measurable business value. The company observes that specialized translation models hallucinate less than general-purpose LLMs when used for translation tasks. They maintain control over creativity through the balance between accuracy and fluency mentioned earlier, and can post-factum validate translations against source text since they always have the original for comparison. DeepL acknowledges that language models still don't understand the world as deeply as humans despite having seen vast amounts of text. This shows up in edge cases, very short UI strings, malformed text from parsing errors, or highly ambiguous situations where human world knowledge and understanding of intent becomes necessary. However, models are more reliable than humans for consistency and don't make typos or mental slips that humans occasionally produce. ## Competitive Positioning and Market Evolution The case study reveals how DeepL thinks about competing against both established players (Google Translate) and general-purpose LLMs (OpenAI and others). Their strategy centers on specialization—models focused purely on translation with appropriate architectural choices, training data curation, and quality-tuning perform better for translation than general-purpose systems. However, CEO Kotalowski acknowledges this may change as general-purpose models become more powerful. DeepL's evolution strategy involves moving "up the stack" from simple sentence translation to understanding complete enterprise workflows. Rather than just translating text from language A to B, they're embedding translation into broader business processes—understanding whether translations will be reviewed, incorporating previous translation versions and human edits as context, and building functionality that addresses higher-order translation workflow problems rather than just the core translation task. This workflow-level product development, informed by deep customer research, represents their moat as horizontal translation becomes commoditized. The company started in 2017 at an opportune moment when everyone had to switch to neural approaches, creating an opening for a startup to build state-of-the-art models. They found better architectures than what existed in academia at the time. Now with transformers and massive LLMs as the baseline, maintaining competitive advantage requires this shift toward workflow integration and enterprise-specific features. ## Business Model and Enterprise Focus DeepL generates significant revenue through enterprise customers who need translation for customer support, marketing materials, legal documents, technical documentation, and internal communications across globally distributed teams. The value proposition centers on speed (instant translation rather than waiting for human translation services) and enabling new use cases that weren't economically viable with human translation. The business model has evolved translation from a centralized function (typically handled by specialized agencies) to self-service tools that individual departments like legal or marketing can use directly. This democratization has increased translation volume dramatically as more content becomes worth translating when the process is fast and affordable. Language quality has crossed thresholds that unlock different use cases—what's sufficient for internal emails differs from what's needed for published legal terms across 20 languages. Each quality improvement opens new market segments. The company maintains focus on serving enterprise customers well rather than trying to handle all possible translation scenarios. ## Operational Challenges and Scale The company is not purely compute-constrained in the sense that more GPUs would immediately generate more revenue. While GPU availability has been a challenge at various points (DeepL notes going through periods where GPUs couldn't be obtained even with unlimited money), they're currently able to procure what they need. The limiting factor is increasingly research and engineering talent to utilize compute effectively rather than raw hardware availability. Infrastructure decisions involve tradeoffs between using standard tooling versus custom solutions. Migration costs are significant given custom model architectures, making it difficult to switch to off-the-shelf inference providers. The company continues monitoring alternative GPU vendors as the market diversifies beyond NVIDIA's near-monopoly. DeepL serves diverse language pairs with varying data availability and quality requirements. They prioritize language investments based on customer demand and business ROI rather than purely linguistic considerations. This creates a natural tiering where major global languages receive the most investment, medium-sized languages (like Polish, Kotalowski's native language) get good coverage, and very small languages remain challenging to serve at the same quality level without breakthroughs in low-resource learning techniques. The company employs techniques like tokenization strategy variations for different languages and continues exploring how neural networks represent multilingual concepts, noting interesting research (like Anthropic's findings about similar neurons firing for equivalent meanings across languages) that validates their multilingual model consolidation efforts. ## Long-term Outlook and Industry Impact The case study touches on broader implications of high-quality, accessible translation technology. Kotalowski sees this democratizing business communication for non-English speakers who have been at a disadvantage in international commerce. However, he maintains that certain contexts—particularly personal relationships—will always benefit from humans actually learning languages rather than relying on AI intermediation, as language embeds cultural understanding that translation can't fully capture. The translation industry itself is being disrupted, with traditional human translation companies shrinking as AI capabilities improve. Human translators will likely focus on the highest-stakes, most complex translation work rather than routine content. DeepL positions themselves as enabling this transition rather than simply replacing humans, though they're realistic that routine translation work will be increasingly automated. The company's trajectory from 2017 to present demonstrates how a focused AI application with strong product-market fit can build a substantial business even in competition with tech giants. Their success stems from technical specialization, deep understanding of enterprise workflows, early infrastructure investments, and strategic focus on quality improvements that unlock successively more demanding use cases. As general-purpose LLMs become more capable, DeepL's moat increasingly depends on workflow integration and enterprise-specific features rather than pure translation quality alone.

Start deploying reproducible AI workflows today