Babbel: Building an AI-Assisted Content Creation Platform for Language Learning

LLMOps Database

Education

Babbel

Company

Babbel

Title

Building an AI-Assisted Content Creation Platform for Language Learning

Industry

Education

Link

https://www.youtube.com/watch?v=fIFA4eHBI8s

Year

2023

Summary (short)

Babbel developed an AI-assisted content creation tool to streamline their traditional 35-hour content creation pipeline for language learning materials. The solution integrates LLMs with human expertise through a gradio-based interface, enabling prompt management, content generation, and evaluation while maintaining quality standards. The system successfully reduced content creation time while maintaining high acceptance rates (>85%) from editors.

Tags

## Overview Babbel, a well-established language learning company that has been in the market for over 15 years, partnered with Inovex to develop an AI-aided content creation tool called the "Dynamic Content Creator." This presentation, delivered at a conference by Leia (Data/AI Product Manager at Inovex) and Hector (Computational Linguist at Babbel), details their journey of integrating LLMs into Babbel's traditionally rigorous content creation pipeline. The core challenge they faced was that Babbel's existing content creation process, while producing high-quality educational content, was extremely time-consuming and rigid. A single unit comprising three to four self-study lessons required approximately 35 hours of production time. The process involved five overarching steps: pre-production, production, localization, editorial checks, and staging/release. Each stage depended heavily on deliverables from the previous one, making it difficult to scale and personalize content for learners across their eight supported languages. ## Technical Architecture and Stack The Dynamic Content Creator was built with a relatively straightforward but effective tech stack: - **Programming Language**: Python - **LLM Orchestration**: LangChain - **User Interface**: Gradio (chosen specifically because it automatically creates API endpoints for all functionalities, enabling programmatic access) - **LLM Provider**: OpenAI GPT (though built in a modular way to allow easy swapping of underlying models) - **Deployment**: AWS instance - **Integration**: Connected to Babbel's content management systems The architecture was intentionally designed to be modular, acknowledging that the LLM landscape is rapidly evolving and the underlying model might need to be exchanged in the future. This forward-thinking approach to system design is a notable LLMOps best practice. ## Product Development Process The team followed a careful, iterative development approach that emphasizes several LLMOps principles: **Phase 1 - Process Onboarding and Internal POC**: The team began by deeply understanding the existing workflow through extensive conversations with subject matter experts. They selected a high-value, low-risk use case for their initial proof of concept—generating learning items at scale for a grammar feature within the app. This strategy of starting with internal needs allowed them to work with a "friendly user group" who understood the tool's limitations and could provide constructive feedback. **Phase 2 - Internal MVP**: Based on POC feedback, they expanded use cases and added features. Crucially, while they focused on immediate use cases, they designed features with generalization in mind for future applications. This phase also introduced stable deployment, moving from local development setups to a proper hosted solution. **Phase 3 - Rollout and Enablement**: The tool was released to a broader internal audience at Babbel, with onboarding sessions to help users understand capabilities and realize their own use cases. The team continues to collect feedback and support users in implementing new functionalities. ## Core Features and Workflow The Dynamic Content Creator implements a multi-stage content generation workflow with human-in-the-loop integration throughout: **Prompt Generation Stage**: Users can access a prompt library with pre-built templates, create their own templates, and utilize AI-aided prompt optimization. The tool includes functionality to condense prompts to reduce API costs (though the presenters noted this is becoming less critical as costs decrease). A particularly useful feature is dynamic variable selection that connects to Babbel's data systems, allowing users to select placeholders from existing content databases. This data integration is essential for scaling content generation. **Content Generation Stage**: Users can select different output formats and the system integrates Constitutional AI principles for model reflection and self-evaluation. Babbel's own content principles around inclusivity and diversity are embedded into the evaluation criteria, ensuring generated content aligns with brand values and educational standards. **Human Review and Feedback Loop**: Content creators can provide explicit or implicit feedback by editing generated content. This feedback can then be used to optimize the initial prompt templates, creating a virtuous cycle of improvement. The presenters emphasized that the workflow is not strictly linear—users can loop back between stages and even start at different points depending on their needs. **Content Transformation**: This stage handles localization (translation to different target languages) and conversion to Babbel-specific formats for integration with their systems. **Content Saving and Export**: The team is working on integrating direct upload to content management systems and report generation for tracking what was created. ## Demonstration Insights The live demo presented by Pascal (the main developer) showcased the practical application of the tool for creating language learning dialogues. Key observations from the demo include: - **Prompt Templates with Variables**: Templates include variables that can be populated with data (e.g., learner's native language, target language, proficiency level, topic), enabling scalable generation of customized content. - **AI-Powered Prompt Optimization**: The tool can automatically enhance user prompts using prompt engineering best practices, such as adding expert role-playing ("as an expert in language education...") and structuring instructions more clearly. - **Self-Evaluation**: After content generation, the AI evaluates its own output against configurable criteria (inclusion/diversity, prompt performance) and provides ratings and suggestions for improvement. - **Constructive Feedback Loop**: Evaluation results can be converted into actionable feedback that is then used to automatically revise the original prompt template, demonstrating an elegant approach to iterative improvement. ## Addressing Hallucinations and Quality The team found that hallucinations were significantly reduced by incorporating existing Babbel content as examples in prompts. Rather than elaborate prompt engineering instructions alone, providing concrete examples of desired output grounded the LLM's responses effectively. They described their approach as more "in-context prompting" than traditional RAG, though they acknowledged that their content management system's structure posed challenges for more sophisticated retrieval approaches. For now, human reviewers remain the final quality gatekeepers. The team achieved an impressive 85%+ acceptance rate from human editors on generated items for their initial use case, demonstrating that the quality threshold was sufficient for practical use while still maintaining human oversight. ## Organizational and Cultural Challenges The presenters candidly discussed the challenge of building trust with existing content creators. Some staff were skeptical that AI could match their quality standards, while others were more curious and open to experimentation. Their strategy involved: - Engaging early adopters who could become internal champions - Demonstrating value through low-risk, high-volume tasks that would have been time-consuming manually - Positioning the tool as an enabler rather than a replacement, allowing experts to focus on conceptual and learning science aspects while AI handles scaling and generation - Acknowledging that management support was important in pursuing this direction They noted that maintaining healthy skepticism is actually valuable—it keeps the team focused on quality and prevents over-reliance on AI outputs. ## Evaluation Challenges and Future Directions The team acknowledged that evaluation remains one of their most significant challenges. While they experimented with tracking edit rates and acceptance rates, they found that purely quantitative metrics can be misleading—an edit might reflect personal taste rather than actual quality issues. Traditional NLP metrics like ROUGE or BLEU have limitations because they don't capture cultural relevance or educational value. Looking forward, the team is exploring: - Personalized content generation based on user preferences and learning states (currently working in clusters rather than fully individualized) - Better integration with content management systems to create a unified "single source of truth" - Potential fine-tuning approaches, though they're weighing whether the improvement over in-context learning would justify the investment - Expanding dynamic, real-time content generation (another Babbel team is already experimenting with conversational AI tutoring) ## Key Learnings and Best Practices The presentation highlighted several transferable LLMOps lessons: - Start with low-risk, high-value use cases to build organizational trust and demonstrate value - Leverage existing content and data to ground LLM outputs—this proved more effective than elaborate prompting alone - Design for modularity from the start, anticipating that underlying models will change - Maintain human expertise in the loop, especially for culturally sensitive or educational content - Build cross-functional teams that combine domain expertise (linguistics, education) with technical skills (data science, engineering) - Accept that evaluation is hard and build iterative feedback mechanisms rather than seeking perfect metrics upfront The case study represents a thoughtful, production-focused approach to integrating LLMs into an established content creation workflow, balancing the efficiency gains from AI with the quality standards that define Babbel's brand reputation.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source