ZenML

Building an AI-Assisted Content Creation Platform for Language Learning

Babbel 2023
View original source

Babbel developed an AI-assisted content creation tool to streamline their traditional 35-hour content creation pipeline for language learning materials. The solution integrates LLMs with human expertise through a gradio-based interface, enabling prompt management, content generation, and evaluation while maintaining quality standards. The system successfully reduced content creation time while maintaining high acceptance rates (>85%) from editors.

Industry

Education

Technologies

Overview

Babbel, a well-established language learning company that has been in the market for over 15 years, partnered with Inovex to develop an AI-aided content creation tool called the “Dynamic Content Creator.” This presentation, delivered at a conference by Leia (Data/AI Product Manager at Inovex) and Hector (Computational Linguist at Babbel), details their journey of integrating LLMs into Babbel’s traditionally rigorous content creation pipeline.

The core challenge they faced was that Babbel’s existing content creation process, while producing high-quality educational content, was extremely time-consuming and rigid. A single unit comprising three to four self-study lessons required approximately 35 hours of production time. The process involved five overarching steps: pre-production, production, localization, editorial checks, and staging/release. Each stage depended heavily on deliverables from the previous one, making it difficult to scale and personalize content for learners across their eight supported languages.

Technical Architecture and Stack

The Dynamic Content Creator was built with a relatively straightforward but effective tech stack:

The architecture was intentionally designed to be modular, acknowledging that the LLM landscape is rapidly evolving and the underlying model might need to be exchanged in the future. This forward-thinking approach to system design is a notable LLMOps best practice.

Product Development Process

The team followed a careful, iterative development approach that emphasizes several LLMOps principles:

Phase 1 - Process Onboarding and Internal POC: The team began by deeply understanding the existing workflow through extensive conversations with subject matter experts. They selected a high-value, low-risk use case for their initial proof of concept—generating learning items at scale for a grammar feature within the app. This strategy of starting with internal needs allowed them to work with a “friendly user group” who understood the tool’s limitations and could provide constructive feedback.

Phase 2 - Internal MVP: Based on POC feedback, they expanded use cases and added features. Crucially, while they focused on immediate use cases, they designed features with generalization in mind for future applications. This phase also introduced stable deployment, moving from local development setups to a proper hosted solution.

Phase 3 - Rollout and Enablement: The tool was released to a broader internal audience at Babbel, with onboarding sessions to help users understand capabilities and realize their own use cases. The team continues to collect feedback and support users in implementing new functionalities.

Core Features and Workflow

The Dynamic Content Creator implements a multi-stage content generation workflow with human-in-the-loop integration throughout:

Prompt Generation Stage: Users can access a prompt library with pre-built templates, create their own templates, and utilize AI-aided prompt optimization. The tool includes functionality to condense prompts to reduce API costs (though the presenters noted this is becoming less critical as costs decrease). A particularly useful feature is dynamic variable selection that connects to Babbel’s data systems, allowing users to select placeholders from existing content databases. This data integration is essential for scaling content generation.

Content Generation Stage: Users can select different output formats and the system integrates Constitutional AI principles for model reflection and self-evaluation. Babbel’s own content principles around inclusivity and diversity are embedded into the evaluation criteria, ensuring generated content aligns with brand values and educational standards.

Human Review and Feedback Loop: Content creators can provide explicit or implicit feedback by editing generated content. This feedback can then be used to optimize the initial prompt templates, creating a virtuous cycle of improvement. The presenters emphasized that the workflow is not strictly linear—users can loop back between stages and even start at different points depending on their needs.

Content Transformation: This stage handles localization (translation to different target languages) and conversion to Babbel-specific formats for integration with their systems.

Content Saving and Export: The team is working on integrating direct upload to content management systems and report generation for tracking what was created.

Demonstration Insights

The live demo presented by Pascal (the main developer) showcased the practical application of the tool for creating language learning dialogues. Key observations from the demo include:

Addressing Hallucinations and Quality

The team found that hallucinations were significantly reduced by incorporating existing Babbel content as examples in prompts. Rather than elaborate prompt engineering instructions alone, providing concrete examples of desired output grounded the LLM’s responses effectively. They described their approach as more “in-context prompting” than traditional RAG, though they acknowledged that their content management system’s structure posed challenges for more sophisticated retrieval approaches.

For now, human reviewers remain the final quality gatekeepers. The team achieved an impressive 85%+ acceptance rate from human editors on generated items for their initial use case, demonstrating that the quality threshold was sufficient for practical use while still maintaining human oversight.

Organizational and Cultural Challenges

The presenters candidly discussed the challenge of building trust with existing content creators. Some staff were skeptical that AI could match their quality standards, while others were more curious and open to experimentation. Their strategy involved:

They noted that maintaining healthy skepticism is actually valuable—it keeps the team focused on quality and prevents over-reliance on AI outputs.

Evaluation Challenges and Future Directions

The team acknowledged that evaluation remains one of their most significant challenges. While they experimented with tracking edit rates and acceptance rates, they found that purely quantitative metrics can be misleading—an edit might reflect personal taste rather than actual quality issues. Traditional NLP metrics like ROUGE or BLEU have limitations because they don’t capture cultural relevance or educational value.

Looking forward, the team is exploring:

Key Learnings and Best Practices

The presentation highlighted several transferable LLMOps lessons:

The case study represents a thoughtful, production-focused approach to integrating LLMs into an established content creation workflow, balancing the efficiency gains from AI with the quality standards that define Babbel’s brand reputation.

More Like This

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling 2025

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality +44

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco 2025

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

fraud_detection document_processing content_moderation +52

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90