Duolingo: Scaling Audio Content Generation with LLMs and TTS for Language Learning

LLMOps Database

Education

Duolingo

Company

Duolingo

Title

Scaling Audio Content Generation with LLMs and TTS for Language Learning

Industry

Education

Link

https://blog.duolingo.com/scaling-duoradio/

Year

2025

Summary (short)

Duolingo tackled the challenge of scaling their DuoRadio feature, a podcast-like audio learning experience, by implementing an AI-driven content generation pipeline. They transformed a labor-intensive manual process into an automated system using LLMs for script generation and evaluation, coupled with Text-to-Speech technology. This allowed them to expand from 300 to 15,000+ episodes across 25+ language courses in under six months, while reducing costs by 99% and growing daily active users from 100K to 5.5M.

Tags

## Overview Duolingo's DuoRadio case study presents an interesting example of scaling educational content production through generative AI pipelines. DuoRadio is an audio feature that provides podcast-like radio shows to help language learners improve their listening comprehension using Duolingo's character-driven content. The case study, published in March 2025, describes how the team transformed a labor-intensive manual process into a largely automated end-to-end pipeline, achieving significant scale improvements. The core problem was straightforward: DuoRadio launched in late 2023 and showed promise for learning outcomes, but the production process was extremely resource-intensive. Creating just 300 episodes for a handful of courses took nearly a year, requiring meticulous scripting, curriculum alignment, voice actors, and specialized audio editing. This bottleneck meant DuoRadio remained a niche offering despite its popularity. ## Initial AI Approaches and Failures The case study is refreshingly honest about early failures with generative AI. Their first approach—generating original scripts from scratch—produced subpar results requiring extensive manual editing. The second approach, automated translation of existing English episodes, also fell short on translation accuracy and proficiency-level appropriateness. Both required significant human intervention, defeating the purpose of automation. These failures highlight a common LLMOps lesson: naive application of LLMs to content generation often produces outputs that don't meet production quality standards without substantial human review. ## The Breakthrough: Curriculum-Driven Prompting The key insight came during an internal hackathon. Rather than adding more constraints to prompts (which didn't work well), the team found that feeding existing curriculum content directly into prompts produced dramatically better results. By supplying the LLM with well-crafted sentences and exercises already created by Learning Designers for Duolingo lessons, the model had specific patterns to follow rather than attempting to interpret complex instructions. This is a significant prompt engineering insight: providing concrete examples from the target domain often outperforms elaborate instruction-based prompting. The curriculum content served as a form of few-shot learning, grounding the model's outputs in proven educational material. ## Quality Evaluation and Filtering A critical component of the production system was the AI-powered evaluation layer. The team recognized that while generative AI could produce many candidate scripts, not all would meet quality standards. To address this, they built a filtering process using additional generative AI prompts designed to assess scripts on multiple criteria: - Naturalness of language - Grammaticality - Coherence and logical flow - Alignment with learning criteria This multi-model approach—using LLMs both for generation and evaluation—is a common pattern in production LLMOps systems. The Learning Designers iteratively refined these evaluator prompts over time, continuously raising the quality bar. This iterative refinement of evaluation criteria represents a human-in-the-loop approach to maintaining quality standards while scaling automation. It's worth noting that the effectiveness of this LLM-as-evaluator approach depends heavily on how well the evaluation prompts capture actual quality criteria. The case study doesn't provide detailed metrics on false positive/negative rates or how the AI evaluations compare to human expert assessments. ## Language-Specific Challenges An interesting operational finding was that English-only prompt instructions were less effective when generating content for courses teaching languages other than English. By leveraging language-specific content from each course's curriculum, they achieved better accuracy and relevance. This suggests that for multilingual content generation, prompts and examples should be tailored to the target language rather than relying on translation or English-centric approaches. ## Exercise Standardization The team found that giving generative AI freedom to sequence and place exercises within episodes produced inconsistent quality. They solved this by leveraging learner session data to determine optimal exercise placement and standardizing the order and structure. This is an example of constraining the LLM's output space based on empirical data—reducing the degrees of freedom where user behavior data already suggests optimal patterns. ## Automated Audio Production Pipeline Beyond script generation, the automation extended to audio production. The team integrated advanced Text-to-Speech (TTS) systems to automatically generate voiceovers in multiple languages. They also implemented audio hashing techniques for storing and retrieving pre-generated audio segments (like consistent intros and outros), reducing redundant audio generation and editing time. The full end-to-end pipeline required zero human intervention post-initiation, covering the entire lifecycle from script creation to final deployment. This level of automation is notable, though the case study doesn't detail the monitoring, error handling, or rollback capabilities that would typically be necessary for such a hands-off production system. ## Reported Results The case study reports impressive metrics, though these should be considered with appropriate skepticism as they come from the company itself: - Daily sessions grew from 500K to 5M in fewer than 6 months - DAUs grew from 100K to 5.5M - Courses expanded from 2 to 25+ - Episodes grew from 300 to 15,000+ - Claimed 99% cost reduction - What would have taken 5+ years was accomplished in under two quarters These numbers suggest substantial improvements in both reach and efficiency. However, the case study doesn't detail how quality was measured against the original manually-produced episodes, or provide learner outcome data comparing the automated versus manual content. ## Internal Tooling The team mentions using "Workflow Builder"—described as their internal content generation prototyping tool—to automatically generate DuoRadio content at scale. This suggests Duolingo has invested in internal tooling infrastructure for LLM-powered content generation, which likely enables rapid iteration and experimentation across content teams. ## Key LLMOps Patterns Demonstrated Several LLMOps patterns emerge from this case study: **Domain-specific grounding**: Rather than relying solely on prompt instructions, the most effective approach was grounding the LLM's outputs in existing domain-specific content (curriculum materials). This reduced hallucination and improved alignment with educational standards. **LLM-as-evaluator**: Using generative AI not just for content creation but also for quality assessment, with human experts designing and refining the evaluation criteria over time. **Constraint-based generation**: Standardizing structural elements (like exercise placement) based on empirical user data, reducing the problem space where the LLM operates. **Multi-stage pipelines**: Combining content generation, quality filtering, and audio synthesis into end-to-end automated pipelines with appropriate handoffs between stages. **Iterative prompt refinement**: Learning Designers continuously refined prompts based on output quality, representing an ongoing human oversight role even in highly automated systems. ## Limitations and Considerations While the case study presents a success story, a few considerations warrant mention. The reliance on existing curriculum content means this approach may be less applicable to truly novel content creation where no reference material exists. The 99% cost reduction claim is striking but lacks detailed breakdown—it's unclear whether this accounts for the development investment in building the automation infrastructure. Additionally, the case study notes that simplification of certain feature aspects was necessary "to make automation more feasible." This suggests some tradeoffs were made between full feature fidelity and automation capability, though the core educational value was reportedly preserved. The case study also doesn't address potential concerns around fully automated content pipelines, such as monitoring for model drift, handling edge cases, or quality assurance sampling in production. For educational content, where accuracy is particularly important, these would typically be important operational considerations. Overall, this case study demonstrates a thoughtful approach to scaling content production through generative AI, with appropriate emphasis on quality controls and human expertise in designing evaluation criteria, while acknowledging that significant infrastructure and prompt engineering investment was required to achieve production-quality results.

Initial Experimentation and Failures:

Their first attempt at using LLMs to generate scripts from scratch produced low-quality content requiring extensive manual editing
A second attempt at using LLMs for automated translations of existing English content also failed due to accuracy and proficiency level issues
These early failures highlighted the importance of proper prompt engineering and the need for domain-specific context

Breakthrough Approach:

Instead of relying on complex constraint-based prompts, they discovered that feeding existing curriculum content into their LLM yielded better results. This approach provided the model with specific patterns to follow, resulting in more appropriate and accurate content generation. This insight demonstrates the importance of high-quality training data and context in LLM applications.

The Production Pipeline:

Duolingo developed a sophisticated end-to-end content generation pipeline with several key components:

Curriculum-Driven Generation:

They leveraged language-specific content from their existing curriculum to improve the accuracy and relevance of generated scripts
This approach proved particularly important for non-English language courses where English-only prompts were less effective

‍Quality Control System:

They implemented a multi-stage filtering process using LLMs to evaluate generated content
The evaluation criteria included naturalness, grammaticality, coherence, and logic
They generated excess content and filtered down to only the highest quality material
Learning Designers continuously refined the evaluator prompts to improve quality standards

Audio Production Automation:

Advanced Text-to-Speech (TTS) technology was integrated for automated voiceover generation
They implemented audio hashing techniques for consistent audio elements like intros and outros
This reduced manual editing time significantly while maintaining quality

‍Technical Infrastructure:

They developed "Workflow Builder," an internal content generation prototyping tool
The system was designed to run without human intervention post-initiation
The pipeline integrated script generation, evaluation, audio production, and deployment

Results and Metrics:

The implementation of this LLMOps pipeline delivered impressive results:

Scaled from 2 to 25+ courses
Increased from 300 to 15,000+ episodes
Grew daily active users from 100K to 5.5M
Achieved 99% cost reduction
Completed in less than 6 months what would have taken 5+ years manually

Key LLMOps Lessons:

The case study highlights several important principles for successful LLM implementation in production:

The importance of starting with high-quality, domain-specific data rather than relying on complex prompt engineering
The value of building robust evaluation systems to maintain quality at scale
The benefit of standardizing certain aspects (like exercise placement) to make automation more reliable
The need for continuous refinement of prompts and evaluation criteria
The importance of end-to-end automation while maintaining quality control checkpoints

Particularly noteworthy is their approach to quality assurance, which involved overproducing content and then using LLMs themselves to filter for quality, rather than trying to perfect the generation process itself. This approach acknowledges the probabilistic nature of LLM outputs and builds that understanding into the system design.

The case study also demonstrates the importance of having domain experts (Learning Designers) involved in the process of refining and improving the LLM systems over time. Rather than treating the LLM as a black box, they continuously improved the prompts and evaluation criteria based on expert feedback and learner data.

Future Directions:

‍Duolingo plans to expand this approach to other forms of longform content, suggesting that the pipeline they've built is flexible enough to be adapted to different content types while maintaining quality standards. This scalability and adaptability is a crucial aspect of successful LLMOps implementations.Duolingo's implementation of LLMs to scale their DuoRadio feature represents a comprehensive case study in applying generative AI to solve content generation challenges in education technology. The case study demonstrates how a thoughtful, iterative approach to LLMOps can transform a manual, resource-intensive process into an efficient, automated system while maintaining high quality standards.

The initial challenge faced by Duolingo was significant: their DuoRadio feature, which provided podcast-like audio content for language learning, required extensive manual effort for script creation, voice acting, and audio editing. This manual process limited their ability to scale, with just 300 episodes taking nearly a year to produce.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source