Vimeo developed a sophisticated video Q&A system that enables users to interact with video content through natural language queries. The system uses RAG (Retrieval Augmented Generation) to process video transcripts at multiple granularities, combined with an innovative speaker detection system that identifies speakers without facial recognition. The solution generates accurate answers, provides relevant video timestamps, and suggests related questions to maintain user engagement.
Vimeo, a well-known video hosting and sharing platform, built a video Q&A system that allows viewers to interact with video content using natural language queries. The system is designed primarily for knowledge-sharing videos such as meetings, lectures, presentations, and tutorials. Rather than requiring users to watch an entire video to find specific information, the Q&A system can summarize content, answer specific questions, and provide playable moments that link directly to relevant portions of the video.
The core technology powering this system is Retrieval Augmented Generation (RAG), which has become one of the most widely adopted patterns for building production LLM applications. The article was published in July 2024, positioning it during what many in the industry termed “the year of RAG.”
The team chose RAG over alternative approaches for several compelling reasons that are relevant to any production LLM deployment:
The team made a strategic decision to rely solely on the transcript of spoken words in the video, at least for the initial implementation. This decision was pragmatic for two reasons: Vimeo already transcribes videos automatically for closed captioning, so no new infrastructure was needed, and for knowledge-sharing videos, the transcript alone provides sufficient information to answer most important questions. They noted plans to incorporate visual information in the future.
One of the more interesting technical contributions described is the “bottom-up processing” approach for transcript registration into the vector database. This addresses a fundamental challenge in RAG systems: different types of questions require different amounts of context. Questions about specific details might need only a few sentences, while questions about the overall video theme require understanding the entire content.
The multi-level chunking strategy works as follows:
All three levels are stored in the same vector database, with each entry containing the text representation (original or summarized), the vector embedding, original word timings from the transcript, and start/end timestamps. This hierarchical approach allows the retrieval system to return appropriate context regardless of whether the question is about specific details or overall themes.
A notable feature is the speaker identification system that works without any visual analysis or facial recognition. This is important for privacy considerations and also practical since many videos may not show speakers’ faces clearly.
The approach uses audio-based speaker clustering to first segment the conversation by speaker ID (numerical identifiers). The more complex challenge is then mapping these numerical IDs to actual names mentioned in the video.
The team observed that speaker names are most commonly revealed during conversation transitions—moments where speakers hand off to each other, introduce themselves, or thank previous speakers. Their algorithm focuses on these transitions and uses multiple LLM prompts to extract names:
The system uses a voting mechanism across multiple prompts to increase confidence in name assignments. Importantly, they prefer leaving a speaker unidentified rather than assigning an incorrect name—a sensible approach for production systems where false positives can be more damaging than false negatives.
A clever optimization is the use of masking: when analyzing transitions, they hide irrelevant portions of text to reduce noise and focus the LLM on the relevant context.
The production system separates the answering task into two distinct LLM calls:
This separation was motivated by observed performance issues when trying to accomplish both tasks in a single prompt, at least with ChatGPT 3.5. The article attributes this to “capacity issues” in the LLM—a practical observation that reflects real-world constraints when deploying language models.
To make the quote-finding more efficient, they embed the generated answer and compute similarity with the retrieved matches, forming a new filtered context that is more likely to contain relevant quotes.
Beyond answering user questions, the system includes features designed to keep viewers engaged:
The related questions feature demonstrates careful prompt engineering to ensure suggested questions are actually useful—avoiding trivial questions and unanswerable ones.
Several aspects of this case study reflect mature thinking about production LLM deployments:
Latency optimization: RAG inherently supports fast response times by limiting the context that needs to be processed, but the system also pre-computes summaries and video descriptions during registration to avoid real-time summarization.
Confidence thresholds: The speaker detection system explicitly prefers not assigning a name over assigning a wrong name, reflecting appropriate conservatism for user-facing features.
Task decomposition: Breaking complex tasks (answering + finding quotes) into separate prompts improved reliability, a common pattern in production LLM systems.
Hybrid retrieval: The use of embeddings for retrieval combined with similarity filtering against generated answers shows a multi-stage approach to context curation.
Graceful degradation: The system is designed to work even when speaker names cannot be identified, visual information is unavailable, or other metadata is missing.
The article acknowledges several limitations that represent honest assessment of the current system:
The team mentions plans to incorporate visual information in the future, suggesting ongoing development and improvement of the system.
While the article doesn’t provide a complete technology inventory, the following components are mentioned or implied:
Overall, this case study represents a solid example of applying RAG to a novel domain (video content) with thoughtful adaptations for the specific challenges of the medium, including multi-scale context requirements, speaker identification, and the need to link textual answers back to playable video moments.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
Dropbox faced the challenge of enabling users to search and query their work content scattered across 50+ SaaS applications and tabs, which proprietary LLMs couldn't access. They built Dash, an AI-powered universal search and agent platform using a sophisticated context engine that combines custom connectors, content understanding, knowledge graphs, and index-based retrieval (primarily BM25) over federated approaches. The system addresses MCP scalability challenges through "super tools," uses LLM-as-a-judge for relevancy evaluation (achieving high agreement with human evaluators), and leverages DSPy for prompt optimization across 30+ prompts in their stack. This infrastructure enables cross-app intelligence with fast, accurate, and ACL-compliant retrieval for agentic queries at enterprise scale.