ZenML

AI-Powered Lesson Generation System for Language Learning

Duolingo 2023
View original source

Duolingo implemented an LLM-based system to accelerate their lesson creation process, enabling their teaching experts to generate language learning content more efficiently. The system uses carefully crafted prompts that combine fixed rules and variable parameters to generate exercises that meet specific educational requirements. This has resulted in faster course development, allowing Duolingo to expand their course offerings and deliver more advanced content while maintaining quality through human expert oversight.

Industry

Education

Technologies

Overview

Duolingo, the popular language learning platform with over 21 million daily active users, has implemented Large Language Models (LLMs) into their content creation workflow to accelerate the production of language learning exercises. This case study, published in June 2023, describes how the company transitioned from a fully manual content creation process to an AI-assisted workflow where learning designers use LLMs as a productivity tool while maintaining full editorial control over the final output.

The context is important here: Duolingo operates with fewer than 1,000 employees serving a massive user base, which creates significant resource constraints. Prior to this implementation, building, updating, and maintaining courses required substantial time investments, with most courses releasing new content only a few times per year. The company already had experience with AI through their “Birdbrain” model, which personalizes exercise difficulty based on individual learner performance, but this new initiative extends AI usage into the content creation pipeline itself.

The LLM-Assisted Content Creation Workflow

Duolingo’s approach to integrating LLMs into production follows a structured, human-in-the-loop methodology that deserves careful examination. The workflow consists of three main stages:

Curriculum Design Phase: Learning Designers first plan the pedagogical elements of a lesson, including theme, grammar focus, vocabulary targets, and exercise types. For example, they might design a Spanish lesson around “nostalgic memories” to align with teaching the preterite and imperfect tenses. This crucial step remains entirely human-driven, ensuring that the educational strategy and learning objectives are set by qualified teaching experts rather than delegated to the AI.

Prompt Preparation Phase: The company has developed what they describe as a “Mad Lib” style prompt template system. Some elements of the prompt are automatically populated by their engineering infrastructure (such as language, CEFR level, and theme), while Learning Designers manually specify other parameters like exercise type and grammar focus. The prompt structure includes fixed rules (e.g., “The exercise must have two answer options” and character limits) combined with variable parameters that change based on the specific lesson requirements.

Generation and Review Phase: The LLM generates multiple exercise options (the example shows ten exercises produced in seconds), from which Learning Designers select their preferred options and apply edits before publication. The article explicitly notes that generated content may “sound a little stilted or unnatural,” requiring human refinement for naturalness, learning value, and appropriate vocabulary selection.

Prompt Engineering Approach

The case study provides a concrete example of their prompt structure, which reveals their prompt engineering methodology:

The prompts include explicit constraints around:

This structured approach to prompting represents a relatively sophisticated production use of LLMs, where the prompts serve as configurable templates rather than ad-hoc queries. The engineering team has built tooling to automate the population of certain prompt parameters, suggesting an investment in infrastructure to scale this approach across their content creation teams.

Human-in-the-Loop Quality Control

A notable aspect of this implementation is the strong emphasis on human oversight. The article explicitly states that “our Spanish teaching experts always have the final say,” positioning the LLM as an assistant that generates drafts rather than a replacement for human expertise. This approach addresses several production concerns:

The Learning Designers review all generated content before it reaches users, providing a quality gate that catches grammatical issues, unnatural phrasing, and pedagogically suboptimal constructions. The example output demonstrates that even with well-crafted prompts, LLM outputs can vary in quality and naturalness, reinforcing the need for expert review.

This human-in-the-loop approach also maintains the educational integrity of the content. Language teaching requires nuanced understanding of learner progression, cultural context, and pedagogical best practices that current LLMs cannot reliably produce autonomously. By keeping humans in the critical evaluation role, Duolingo balances efficiency gains with quality assurance.

Claimed Benefits and Critical Assessment

The article claims three main benefits: convenience, speed, and productivity. However, it’s worth noting that the case study is published by Duolingo itself on their company blog, so these claims should be considered with appropriate skepticism regarding potential selection bias in the examples shown.

The stated goals for this implementation include:

What the case study does not provide is quantitative evidence of these improvements. There are no specific metrics shared about content creation speed improvements, quality metrics, or user satisfaction with AI-generated versus human-written content. The comparison to calculators and GPS systems, while illustrative, does not substitute for empirical evidence of effectiveness.

Technical Infrastructure Considerations

While the article focuses primarily on the workflow rather than technical infrastructure, several LLMOps considerations can be inferred:

Tooling Integration: The engineering team has built internal tooling that integrates LLM capabilities into the Learning Designers’ workflow, with automated parameter population and presumably a user interface for prompt submission and output review. This suggests investment in making LLM capabilities accessible to non-technical content creators.

Prompt Management: The “Mad Lib” template approach implies some form of prompt management system where templates can be maintained, versioned, and updated as the team refines their prompting strategies. The article mentions “constantly adjusting the instructions we give the model,” indicating an iterative optimization process.

Quality Metrics: While not explicitly discussed, an organization of Duolingo’s scale would presumably have mechanisms for tracking the quality of AI-generated content over time, though this is not detailed in the case study.

Broader Context

The article mentions that Duolingo has also launched “Duolingo Max,” which brings AI capabilities directly to learners, suggesting that this internal content creation use case is part of a broader AI strategy at the company. The existence of the Birdbrain recommendation model also indicates organizational experience with deploying ML models at scale, which likely informed their approach to LLM integration.

Limitations and Open Questions

Several aspects of this implementation remain unclear from the available information:

The case study presents an optimistic view of LLM integration in content creation, but production deployments often encounter challenges not visible in introductory blog posts. The emphasis on human oversight is prudent and represents a responsible approach to deploying generative AI in an educational context where content quality directly impacts learning outcomes.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations

Nubank 2025

Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.

customer_support fraud_detection chatbot +36

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49