Duolingo: AI-Powered Lesson Generation System for Language Learning

Overview

Duolingo, the popular language learning platform with over 21 million daily active users, has implemented Large Language Models (LLMs) into their content creation workflow to accelerate the production of language learning exercises. This case study, published in June 2023, describes how the company transitioned from a fully manual content creation process to an AI-assisted workflow where learning designers use LLMs as a productivity tool while maintaining full editorial control over the final output.

The context is important here: Duolingo operates with fewer than 1,000 employees serving a massive user base, which creates significant resource constraints. Prior to this implementation, building, updating, and maintaining courses required substantial time investments, with most courses releasing new content only a few times per year. The company already had experience with AI through their “Birdbrain” model, which personalizes exercise difficulty based on individual learner performance, but this new initiative extends AI usage into the content creation pipeline itself.

The LLM-Assisted Content Creation Workflow

Duolingo’s approach to integrating LLMs into production follows a structured, human-in-the-loop methodology that deserves careful examination. The workflow consists of three main stages:

Curriculum Design Phase: Learning Designers first plan the pedagogical elements of a lesson, including theme, grammar focus, vocabulary targets, and exercise types. For example, they might design a Spanish lesson around “nostalgic memories” to align with teaching the preterite and imperfect tenses. This crucial step remains entirely human-driven, ensuring that the educational strategy and learning objectives are set by qualified teaching experts rather than delegated to the AI.

Prompt Preparation Phase: The company has developed what they describe as a “Mad Lib” style prompt template system. Some elements of the prompt are automatically populated by their engineering infrastructure (such as language, CEFR level, and theme), while Learning Designers manually specify other parameters like exercise type and grammar focus. The prompt structure includes fixed rules (e.g., “The exercise must have two answer options” and character limits) combined with variable parameters that change based on the specific lesson requirements.

Generation and Review Phase: The LLM generates multiple exercise options (the example shows ten exercises produced in seconds), from which Learning Designers select their preferred options and apply edits before publication. The article explicitly notes that generated content may “sound a little stilted or unnatural,” requiring human refinement for naturalness, learning value, and appropriate vocabulary selection.

Prompt Engineering Approach

The case study provides a concrete example of their prompt structure, which reveals their prompt engineering methodology:

The prompts include explicit constraints around:

Target vocabulary words to incorporate
Language and CEFR proficiency level
Grammar structures that must be demonstrated
Format requirements (number of answer options, character limits)

This structured approach to prompting represents a relatively sophisticated production use of LLMs, where the prompts serve as configurable templates rather than ad-hoc queries. The engineering team has built tooling to automate the population of certain prompt parameters, suggesting an investment in infrastructure to scale this approach across their content creation teams.

Human-in-the-Loop Quality Control

A notable aspect of this implementation is the strong emphasis on human oversight. The article explicitly states that “our Spanish teaching experts always have the final say,” positioning the LLM as an assistant that generates drafts rather than a replacement for human expertise. This approach addresses several production concerns:

The Learning Designers review all generated content before it reaches users, providing a quality gate that catches grammatical issues, unnatural phrasing, and pedagogically suboptimal constructions. The example output demonstrates that even with well-crafted prompts, LLM outputs can vary in quality and naturalness, reinforcing the need for expert review.

This human-in-the-loop approach also maintains the educational integrity of the content. Language teaching requires nuanced understanding of learner progression, cultural context, and pedagogical best practices that current LLMs cannot reliably produce autonomously. By keeping humans in the critical evaluation role, Duolingo balances efficiency gains with quality assurance.

Claimed Benefits and Critical Assessment

The article claims three main benefits: convenience, speed, and productivity. However, it’s worth noting that the case study is published by Duolingo itself on their company blog, so these claims should be considered with appropriate skepticism regarding potential selection bias in the examples shown.

The stated goals for this implementation include:

Teaching more advanced concepts by going deeper into the CEFR scale
Allocating resources to additional features like Stories and DuoRadio
Expanding support for smaller, less popular language courses

What the case study does not provide is quantitative evidence of these improvements. There are no specific metrics shared about content creation speed improvements, quality metrics, or user satisfaction with AI-generated versus human-written content. The comparison to calculators and GPS systems, while illustrative, does not substitute for empirical evidence of effectiveness.

Technical Infrastructure Considerations

While the article focuses primarily on the workflow rather than technical infrastructure, several LLMOps considerations can be inferred:

Tooling Integration: The engineering team has built internal tooling that integrates LLM capabilities into the Learning Designers’ workflow, with automated parameter population and presumably a user interface for prompt submission and output review. This suggests investment in making LLM capabilities accessible to non-technical content creators.

Prompt Management: The “Mad Lib” template approach implies some form of prompt management system where templates can be maintained, versioned, and updated as the team refines their prompting strategies. The article mentions “constantly adjusting the instructions we give the model,” indicating an iterative optimization process.

Quality Metrics: While not explicitly discussed, an organization of Duolingo’s scale would presumably have mechanisms for tracking the quality of AI-generated content over time, though this is not detailed in the case study.

Broader Context

The article mentions that Duolingo has also launched “Duolingo Max,” which brings AI capabilities directly to learners, suggesting that this internal content creation use case is part of a broader AI strategy at the company. The existence of the Birdbrain recommendation model also indicates organizational experience with deploying ML models at scale, which likely informed their approach to LLM integration.

Limitations and Open Questions

Several aspects of this implementation remain unclear from the available information:

Which specific LLM provider or model is being used
How the company evaluates and monitors output quality at scale
What percentage of final content is AI-generated versus fully human-written
How they handle edge cases where the LLM consistently fails to produce acceptable output
Whether they have established systematic feedback loops to improve prompts based on editor interventions

The case study presents an optimistic view of LLM integration in content creation, but production deployments often encounter challenges not visible in introductory blog posts. The emphasis on human oversight is prudent and represents a responsible approach to deploying generative AI in an educational context where content quality directly impacts learning outcomes.

AI-Powered Lesson Generation System for Language Learning

Industry

Technologies

Overview

The LLM-Assisted Content Creation Workflow

Prompt Engineering Approach

Human-in-the-Loop Quality Control

Claimed Benefits and Critical Assessment

Technical Infrastructure Considerations

Broader Context

Limitations and Open Questions

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Building Production AI Agents at Scale with Temporal and KGoose

Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations