ZenML

LLM Integration in EdTech: Lessons from Duolingo, Brainly, and SoloLearn

Various 2023
View original source

Leaders from three major EdTech companies share their experiences implementing LLMs in production for language learning, coding education, and homework help. They discuss challenges around cost-effective scaling, fact generation accuracy, and content personalization, while highlighting successful approaches like retrieval-augmented generation, pre-generation of options, and using LLMs to create simpler production rules. The companies focus on using AI not just for content generation but for improving the actual teaching and learning experience.

Industry

Education

Technologies

Overview

This case study is drawn from a panel discussion featuring AI and product leaders from three prominent EdTech companies: Duolingo (Clinton Bicknell, Head of AI), Brainly (Bill Slawski, CTO), and SoloLearn (Yeva Hyusyan, Founder and CEO). Together, these platforms serve close to a billion learners globally. The discussion, framed around the theme of “LLMs in Production,” provides a candid view of the challenges, lessons learned, and practical strategies these companies have developed while integrating large language models into their educational products.

The panelists share a healthy skepticism about LLM capabilities while also expressing cautious optimism about the transformative potential of these technologies in education. A recurring theme throughout the discussion is the gap between what LLMs can do out of the box versus what is required to build genuinely useful, personalized, and pedagogically sound learning experiences.

Company Contexts and Use Cases

SoloLearn

SoloLearn is a mobile-first platform for learning to code, with over 30 million registered users. Their approach emphasizes bite-sized, practice-heavy content with strong community support. Yeva Hyusyan describes how the company has been experimenting with AI for approximately 18 months, with some features now in production (notably an AI assistant that explains coding concepts in human language), while other experiments were discontinued when they didn’t work.

A particularly interesting project currently in development involves training models not just to produce content, but to produce it in a format that actually teaches. This reflects a key insight: generating factual content is relatively easy, but generating content that follows sound pedagogical principles is a much harder problem that LLMs don’t solve out of the box.

Brainly

Brainly operates as a community learning platform—essentially the world’s largest study group—with hundreds of millions of learners monthly. Their goal is to provide explanations that match what a learner’s teacher would have given, accounting for vocabulary, concepts the learner has been exposed to, and personalization at the individual level.

Bill Slawski emphasizes that Brainly doesn’t own the curriculum—they support any school system and any curriculum. This creates unique challenges for personalization, as they must infer what each learner is learning based on behavioral patterns, cohort analysis, and content traversal patterns.

Duolingo

Duolingo, primarily known for language learning but expanding into math and music, has been using AI for personalization since nearly its founding about a decade ago. Clinton Bicknell describes how the latest generation of LLMs has enabled new interactive features, such as having conversations in the language being learned—something that wasn’t technically feasible before.

Key use cases include interactive conversation practice (e.g., ordering coffee at a restaurant), explanations for incorrect answers, and curriculum generation. However, Clinton is careful to note that generating content that teaches well is fundamentally different from just generating content.

LLMOps Challenges and Lessons Learned

The Fact Generation Problem

A fascinating point of discussion emerged around where LLMs should and shouldn’t be trusted. Bill Slawski was clear that relying on LLMs as “fact generation engines” is where disappointment lies. Instead, Brainly finds success using LLMs to synthesize or reframe information that they’ve already validated. The model is put on “rails” through augmented prompts and provided with trusted information to work with.

Clinton Bicknell noted some nuance here, suggesting that both generating facts and presenting them pedagogically are hard problems, but generating reliably accurate facts is often the harder challenge to solve completely. This led to an interesting exchange where both panelists acknowledged that work is needed on both ends—neither pure fact generation nor pure presentation is solved out of the box.

Retrieval-Augmented Generation (RAG)

Bill Slawski described how Brainly’s journey with LLMs progressed from constrained use cases to more sophisticated RAG implementations. They started with simple features like “simplify this explanation” or “explain this in more detail”—cases where the facts were already established and trusted, and the LLM was just reformulating the presentation. Only after gaining confidence with these constrained applications did they move into RAG use cases for generating more complete responses.

RAG at Brainly involves understanding learner behavior patterns, identifying which cohort of learners a user resembles, and using that understanding to augment prompts with relevant context. Bill compares this to recommendation systems at Netflix or YouTube but acknowledges it is actually quite difficult to implement well.

Cost Management at Scale

A critical operational challenge that all three companies face is the cost of running LLM inference at scale. Clinton Bicknell was particularly emphatic that even cheaper models become cost-prohibitive when you’re processing every user response for hundreds of millions of learners.

Duolingo has developed several strategies to address this:

Yeva Hyusyan from SoloLearn recounted a specific lesson where they got excited about initial results from a small cohort test, did the math on scaling costs, and realized they couldn’t scale the approach. This led to training custom models to achieve cost efficiency.

Differentiation and Competitive Advantage

Bill Slawski made a strategic point about differentiation: relying on LLMs to do what everyone else can do with them puts you at the “lowest common denominator.” True competitive advantage comes from your proprietary data—what you know about your learners, their behavior patterns, the content they engage with, and how you augment LLM capabilities with this unique knowledge.

This perspective reframes the value proposition: LLMs are tools that amplify your existing assets rather than replacements for domain expertise. Companies that treat LLMs as black-box solutions without layering their own insights will produce generic, undifferentiated experiences.

The Teaching vs. Content Generation Problem

Perhaps the most pedagogically interesting insight came from the discussion about what it means to actually teach versus simply generating content. Yeva Hyusyan noted that the last decade of EdTech has focused heavily on content generation—creating video lessons on any topic. But replicating a great classroom experience online has been much harder.

The companies are now working on training models to produce content in formats that actually teach, incorporating feedback from real teachers into the training loop. This represents a shift from viewing LLMs as content factories to viewing them as apprentice tutors that need to learn pedagogical principles.

Clinton Bicknell echoed this, noting that LLMs can generate content easily, but generating content that teaches well requires understanding that “large language models don’t know how to do out of the box, except in kind of obvious common sense ways.”

Language Level Calibration

Duolingo faces a specific challenge in language learning: keeping generated content at the appropriate proficiency level for each learner. While you can add instructions to prompts like “write at a third-grade level,” this only gets you so far. More sophisticated AI work is required to properly calibrate the language difficulty—fine-tuning has proven helpful but the problem remains challenging.

Interactive Features vs. Chatbots

Clinton Bicknell raised an important product insight: it’s easy to make a chatbot, but creating a compelling interactive product based on chat that users want to return to daily is a fundamentally different challenge. Some of this is product work (figuring out what experiences resonate with users), but there are also AI problems embedded within—like generating goal-directed conversations with character personalities while staying at the right language level for the learner.

Production Considerations

The discussion touched on several operational realities of running LLMs in production at EdTech scale:

Looking Forward

The panelists expressed cautious optimism about the future while acknowledging that progress will take time and continued iteration. Key areas of excitement include:

The overall message is one of measured optimism tempered by hard-won lessons: LLMs offer genuine potential to transform education, but realizing that potential requires deep domain expertise, careful operational planning, significant investment in evaluation and iteration, and a clear-eyed view of where these models excel versus where they fall short.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise LLM Implementation Panel: Lessons from Box, Glean, Tyace, Security AI and Citibank

Various 2023

A panel discussion featuring leaders from multiple enterprises sharing their experiences implementing LLMs in production. The discussion covers key challenges including data privacy, security, cost management, and enterprise integration. Speakers from Box discuss content management challenges, Glean covers enterprise search implementations, Tyace shares content generation experiences, Security AI addresses data safety, and Citibank provides CIO perspective on enterprise-wide AI deployment. The panel emphasizes the importance of proper data governance, security controls, and the need for systematic approach to move from POCs to production.

compliance cost_optimization databases +26

Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations

Nubank 2025

Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.

customer_support fraud_detection chatbot +36