ZenML

Automated Synopsis Generation Pipeline with Human-in-the-Loop Quality Control

Netflix 2025
View original source

Netflix developed an automated pipeline for generating show and movie synopses using LLMs, replacing a highly manual context-gathering process. The system uses Metaflow to orchestrate LLM-based content summarization and synopsis generation, with multiple human feedback loops and automated quality control checks. While maintaining human writers and editors in the process, the system has significantly improved efficiency and enabled the creation of more synopses per title while maintaining quality standards.

Industry

Media & Entertainment

Technologies

Overview

This case study comes from a Netflix engineering talk describing how the company has implemented LLMs into their content synopsis generation workflow. Synopses are the short descriptions viewers see when browsing content, helping them make informed decisions about what to watch. At Netflix’s scale, creating multiple synopses for every title across their catalog requires significant creative resources, and the team sought to enhance this process through automation while keeping human creativity at the center.

The speaker emphasizes a key philosophical point that shapes their entire approach: Netflix is explicitly not in the business of replacing creative workers. The goal is to augment writers’ capabilities by handling the time-consuming context gathering and initial drafting, allowing them to focus on what they do best—editing and ensuring quality.

The Original Manual Process

Before LLM integration, the synopsis fulfillment flow was highly manual and unpredictable. Content would arrive from production in various forms—viewables, audio, closed caption scripts, and sometimes full scripts—in different formats and languages. A post-production team would gather these sources and create formatted briefs for the company. Writers would then need to read through all materials (sometimes watching entire shows for sensitive content) to gather enough context to write synopses.

The challenges with this approach included:

The LLM-Powered Solution Architecture

The team built their solution on top of Metaflow, an open-source workflow orchestration framework. The speaker expresses enthusiasm for Metaflow’s out-of-the-box capabilities, noting that much of the infrastructure complexity was already solved by the framework.

Generation Flow

The generation pipeline follows a modular architecture designed for flexibility. The key components include:

Asset Store and Prompt Library: These are curated in collaboration with writers, containing the building blocks for generation. This collaborative curation is important—the team explicitly notes that some of them “are not even English native speakers,” making writer input essential for quality prompts.

Context Summarization: The first step processes the raw source materials (closed caption scripts, viewables, audio descriptions, sometimes full scripts) into summaries. This addresses the context window limitations of LLMs while extracting the most relevant information.

Synopsis Generation: Using the summarized context, the system generates draft synopses using a dedicated prompt library. The prompts include:

The team currently uses OpenAI models but is actively testing open-source alternatives including Llama. They’ve designed the system to be modular specifically because they recognize foundational models are updated every few months—they can swap out models without rebuilding the entire pipeline.

Evaluation Flow: LLM-as-a-Judge

A critical component of the system is the evaluation pipeline that acts as a quality control filter before exposing drafts to writers. The speaker emphasizes this was essential for building trust: “if we just were to show them what we can generate just by prompting ChatGPT they probably run away.”

The evaluation uses what the speaker calls “LLM as a judge” or “Constitutional AI”—giving models a set of guidelines (a “constitution”) and having them score outputs against criteria. Five specialized models evaluate each synopsis:

Synopses must pass all five criteria to be exposed via the API. Failed synopses go back, and the system catches bad batches that might result from poor-quality source material before they waste writers’ time.

Spoiler Prevention

An interesting operational challenge addressed in the Q&A: preventing spoilers in generated synopses. The solution is multi-layered:

Human Feedback Loops

One of the most exciting aspects of the system, according to the speaker, is the instrumentation of multiple human feedback loops—described as “a treasure to have if you’re in the business of training models.” The system captures four human feedback signals and one model-based signal:

Model Training and Fine-Tuning Strategy

The team is building toward fine-tuned models using the feedback data collected through the system. OpenAI allows limited fine-tuning, but open-source models provide more flexibility. The speaker notes they don’t yet have enough data to fully justify a fine-tuned model but expects to reach that point “in the short term.” They’re still evaluating which of the four feedback sources will prove most useful for training.

A/B Testing and Model Comparison

For testing new models, the team runs parallel generation—putting two models side by side generating for the same title, then measuring which outputs writers prefer. This allows them to evaluate whether switching from OpenAI to Llama or other models makes sense for their use case.

Writer-Facing Integration

The system exposes an API that the writer platform queries. When writers need synopses for an upcoming title (the example given is shows launching in about three months), they can pull available drafts, pre-populate their editing interface, and modify as needed. The system tracks:

All of this feeds back into the quality improvement cycle.

Results and Business Impact

While the speaker couldn’t share exact numbers, they confirmed:

The speaker is careful to frame this as scaling capability rather than headcount reduction: “we kept the room, we kept every single writer, we actually hired more.”

Architectural Philosophy

A key theme throughout the talk is modularity and flexibility. The team explicitly designed for model obsolescence—recognizing that building tightly coupled to any single provider (like OpenAI) would be problematic given the rapid pace of foundational model development. Metaflow’s workflow abstraction enables them to swap components “as much as we want.”

The human-in-the-loop requirement is presented as both a quality necessity and a philosophical commitment. Even looking to the future, the speaker maintains that “for the foreseeable future we will not be removing any humans in the loop” because humans remain “a little better at writing and creative things.”

Technical Considerations and Lessons

Several practical LLMOps insights emerge from this case study:

The importance of quality gates before human exposure cannot be overstated—bad outputs would erode trust and actually increase time spent rather than saving it. The five-criteria evaluation system with hard pass/fail requirements ensures only viable drafts consume writer attention.

Prompt engineering is treated as a collaborative discipline requiring domain expertise. Writers helped craft the prompt libraries because they understand what makes good synopses.

The feedback loop architecture isn’t just about model improvement—it’s about building organizational trust and demonstrating value through measurable adoption metrics.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling 2025

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality +44