ZenML

LLMOps Tag: spacy

32 tools with this tag

← Back to LLMOps Database

Common industries

View all industries →

A Practical Blueprint for Evaluating Conversational AI at Scale

Dropbox

Dropbox shares their comprehensive approach to building and evaluating Dropbox Dash, their conversational AI product. The company faced challenges with ad-hoc testing leading to unpredictable regressions where changes to any part of their LLM pipeline—intent classification, retrieval, ranking, prompt construction, or inference—could cause previously correct answers to fail. They developed a systematic evaluation-first methodology treating every experimental change like production code, requiring rigorous testing before merging. Their solution involved curating diverse datasets (both public and internal), defining actionable metrics using LLM-as-judge approaches that outperformed traditional metrics like BLEU and ROUGE, implementing the Braintrust evaluation platform, and automating evaluation throughout the development-to-production pipeline. This resulted in a robust system with layered gates catching regressions early, continuous live-traffic scoring for production monitoring, and a feedback loop for continuous improvement that significantly improved reliability and deployment safety.

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

AI Agents for Interpretability Research: Experimenter Agents in Production

Goodfire

Goodfire, an AI interpretability research company, deployed AI agents extensively for conducting experiments in their research workflow over several months. They distinguish between "developer agents" (for software development) and "experimenter agents" (for research and discovery), identifying key architectural differences needed for the latter. Their solution, code-named Scribe, leverages Jupyter notebooks with interactive, stateful access via MCP (Model Context Protocol), enabling agents to iteratively run experiments across domains like genomics, vision transformers, and diffusion models. Results showed agents successfully discovering features in genomics models, performing circuit analysis, and executing complex interpretability experiments, though validation, context engineering, and preventing reward hacking remain significant challenges that require human oversight and critic systems.

AI-Powered Code Review Platform Using Abstract Syntax Trees and LLM Context

Baz

Baz is building an AI code review agent that addresses the challenge of understanding complex codebases at scale. The platform combines Abstract Syntax Trees (AST) with LLM semantic understanding to provide automated code reviews that go beyond traditional static analysis. By integrating context from multiple sources including code structure, Jira/Linear tickets, CI logs, and deployment patterns, Baz aims to replicate the knowledge of a staff engineer who understands not just the code but the entire business context. The solution has evolved from basic reviews to catching performance issues and schema changes, with customers using it to review code generated by AI coding assistants like Cursor and Codex.

AI-Powered Contact Center Copilot: From Research to Enterprise-Scale Production

Cresta / OpenAI

Cresta, founded in 2017 by Stanford PhD students with OpenAI research experience, developed an AI copilot system for contact center agents that provides real-time suggestions during customer conversations. The company tackled the challenge of transforming academic NLP and reinforcement learning research into production-grade enterprise software by building domain-specific models fine-tuned on customer conversation data. Starting with Intuit as their first customer through an unconventional internship arrangement, they demonstrated measurable ROI through A/B testing, showing improved conversion rates and agent productivity. The solution evolved from custom LSTM and transformer models to leveraging pre-trained foundation models like GPT-3/4 with fine-tuning, ultimately serving Fortune 500 customers across telecommunications, airlines, and banking with demonstrated value including a pilot generating $100 million in incremental revenue.

AI-Powered Real Estate Transaction Newsworthiness Detection System

The Globe and Mail

A collaboration between journalists and technologists from multiple news organizations (Hearst, Gannett, The Globe and Mail, and E24) developed an AI system to automatically detect newsworthy real estate transactions. The system combines anomaly detection, LLM-based analysis, and human feedback to identify significant property transactions, with a particular focus on celebrity involvement and price anomalies. Early results showed promise with few-shot prompting, and the system successfully identified several newsworthy transactions that might have otherwise been missed by traditional reporting methods.

AI-Powered Skills Extraction and Mapping for the LinkedIn Skills Graph

Linkedin

LinkedIn deployed a sophisticated machine learning pipeline to extract and map skills from unstructured content across their platform (job postings, profiles, resumes, learning courses) to power their Skills Graph. The solution combines token-based and semantic skill tagging using BERT-based models, multitask learning frameworks for domain-specific scoring, and knowledge distillation to serve models at scale while meeting strict latency requirements (100ms for 200 profile edits/second). Product-driven feedback loops from recruiters and job seekers continuously improve model performance, resulting in measurable business impact including 0.46% increase in predicted confirmed hires for job recommendations and 0.76% increase in PPC revenue for job search.

Building and Scaling Conversational Voice AI Agents for Enterprise Go-to-Market

Thoughtly / Gladia

Thoughtly, a voice AI platform founded in late 2023, provides conversational AI agents for enterprise sales and customer support operations. The company orchestrates speech-to-text, large language models, and text-to-speech systems to handle millions of voice calls with sub-second latency requirements. By optimizing every layer of their stack—from telephony providers to LLM inference—and implementing sophisticated caching, conditional navigation, and evaluation frameworks, Thoughtly delivers 3x conversion rates over traditional methods and 15x ROI for customers. The platform serves enterprises with HIPAA and SOC 2 compliance while handling both inbound customer support and outbound lead activation at massive scale across multiple languages and regions.

Building LinkedIn's First Production Agent: Hiring Assistant Platform and Architecture

LinkedIn

LinkedIn evolved from simple GPT-based collaborative articles to sophisticated AI coaches and finally to production-ready agents, culminating in their Hiring Assistant product announced in October 2025. The company faced the challenge of moving from conversational assistants with prompt chains to task automation using agent-based architectures that could handle high-scale candidate evaluation while maintaining quality and enabling rapid iteration. They built a comprehensive agent platform with modular sub-agent architecture, centralized prompt management, LLM inference abstraction, messaging-based orchestration for resilience, and a skill registry for dynamic tool discovery. The solution enabled parallel development of agent components, independent quality evaluation, and the ability to serve both enterprise recruiters and SMB customers with variations of the same underlying platform, processing thousands of candidate evaluations at scale while maintaining the flexibility to iterate on product design.

Building Production LLM Applications with DSPy Framework

AlixPartners

A technical consultant presents a comprehensive workshop on using DSPy, a declarative framework for building modular LLM-powered applications in production. The presenter demonstrates how DSPy enables rapid iteration on LLM applications by treating LLMs as first-class citizens in Python programs, with built-in support for structured outputs, type guarantees, tool calling, and automatic prompt optimization. Through multiple real-world use cases including document classification, contract analysis, time entry correction, and multi-modal processing, the workshop shows how DSPy's core primitives—signatures, modules, tools, adapters, optimizers, and metrics—allow teams to build production-ready systems that are transferable across models, optimizable without fine-tuning, and maintainable at scale.

Climate Tech Foundation Models for Environmental AI Applications

Various

Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.

Context Rot: Evaluating LLM Performance Degradation with Increasing Input Tokens

ChromaDB

ChromaDB's technical report examines how large language models (LLMs) experience performance degradation as input context length increases, challenging the assumption that models process context uniformly. Through evaluation of 18 state-of-the-art models including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 across controlled experiments, the research reveals that model reliability decreases significantly with longer inputs, even on simple tasks like retrieval and text replication. The study demonstrates that factors like needle-question similarity, presence of distractors, haystack structure, and semantic relationships all impact performance non-uniformly as context length grows, suggesting that current long-context benchmarks may not adequately reflect real-world performance challenges.

Context-Aware AI Code Generation and Assistant at Scale

Windsurf

Windsurf, an AI coding toolkit company, addresses the challenge of generating contextually relevant code for individual developers and organizations. While generating generic code has become straightforward, the real challenge lies in producing code that fits into existing large codebases, adheres to organizational standards, and aligns with personal coding preferences. Windsurf's solution centers on a sophisticated context management system that combines user behavioral heuristics (cursor position, open files, clipboard content, terminal activity) with hard evidence from the codebase (code, documentation, rules, memories). Their approach optimizes for relevant context selection rather than simply expanding context windows, leveraging their background in GPU optimization to efficiently find and process relevant context at scale.

Context-Aware Item Recommendations Using Hybrid LLM and Embedding-Based Retrieval

DoorDash

DoorDash's Core Consumer ML team developed a GenAI-powered context shopping engine to address the challenge of lost user intent during in-app searches for items like "fresh vegetarian sushi." The traditional search system struggled to preserve specific user context, leading to generic recommendations and decision fatigue. The team implemented a hybrid approach combining embedding-based retrieval (EBR) using FAISS with LLM-based reranking to balance speed and personalization. The solution achieved end-to-end latency of approximately six seconds with store page loads under two seconds, while significantly improving user satisfaction through dynamic, personalized item carousels that maintained user context and preferences. This hybrid architecture proved more practical than pure LLM or deep neural network approaches by optimizing for both performance and cost efficiency.

Contextual Agent Playbooks and Tools: Enterprise-Scale AI Coding Agent Integration

LinkedIn

LinkedIn faced the challenge that while AI coding agents were powerful, they lacked organizational context about the company's thousands of microservices, internal frameworks, data infrastructure, and specialized systems. To address this, they built CAPT (Contextual Agent Playbooks & Tools), a unified framework built on the Model Context Protocol (MCP) that provides AI agents with access to internal tools and executable playbooks encoding institutional workflows. The system enables over 1,000 engineers to perform complex tasks like experiment cleanup, data analysis, incident debugging, and code review with significant productivity gains: 70% reduction in issue triage time, 3× faster data analysis workflows, and automated debugging that cuts time spent by more than half in many cases.

Evolution of AI Systems and LLMOps from Research to Production: Infrastructure Challenges and Application Design

NVIDA / Lepton

This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.

Healthcare Search Discovery Using ML and Generative AI on E-commerce Platform

Amazon Health Services

Amazon Health Services faced the challenge of integrating healthcare services into Amazon's e-commerce search experience, where traditional product search algorithms weren't designed to handle complex relationships between symptoms, conditions, treatments, and healthcare services. They developed a comprehensive solution combining machine learning for query understanding, vector search for product matching, and large language models for relevance optimization. The solution uses AWS services including Amazon SageMaker for ML models, Amazon Bedrock for LLM capabilities, and Amazon EMR for data processing, implementing a three-component architecture: query understanding pipeline to classify health searches, LLM-enhanced product knowledge base for semantic search, and hybrid relevance optimization using both human labeling and LLM-based classification. This system now serves daily health-related search queries, helping customers find everything from prescription medications to primary care services through improved discovery pathways.

Large-Scale LLM Infrastructure for E-commerce Applications

Coupang

Coupang, a major e-commerce platform operating primarily in South Korea and Taiwan, faced challenges in scaling their ML infrastructure to support LLM applications across search, ads, catalog management, and recommendations. The company addressed GPU supply shortages and infrastructure limitations by building a hybrid multi-region architecture combining cloud and on-premises clusters, implementing model parallel training with DeepSpeed, and establishing GPU-based serving using Nvidia Triton and vLLM. This infrastructure enabled production applications including multilingual product understanding, weak label generation at scale, and unified product categorization, with teams using patterns ranging from in-context learning to supervised fine-tuning and continued pre-training depending on resource constraints and quality requirements.

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

Leveraging NLP and LLMs for Music Industry Royalty Recovery

Love Without Sound

Love Without Sound developed an AI-powered system to help the music industry recover lost royalties due to incorrect metadata and unauthorized usage. The solution combines NLP pipelines for metadata standardization, legal document processing, and is now expanding to include RAG-based querying and audio embedding models. The system processes billions of tracks, operates in real-time, and runs in a fully data-private environment, helping recover millions in revenue for artists.

LLM-Powered Information Extraction from Pediatric Cardiac MRI Reports

UK National Health Service (NHS)

Great Ormond Street Hospital NHS Trust developed a solution to extract information from 15,000 unstructured cardiac MRI reports spanning 10 years. They implemented a hybrid approach using small LLMs for entity extraction and few-shot learning for table structure classification. The system successfully extracted patient identifiers and clinical measurements from heterogeneous reports, enabling linkage with structured data and improving clinical research capabilities. The solution demonstrated significant improvements in extraction accuracy when using contextual prompting with models like FLAN-T5 and RoBERTa, while operating within NHS security constraints.

LLM-Powered Relevance Assessment for Search Results

Pinterest

Pinterest Search faced significant limitations in measuring search relevance due to the high cost and low availability of human annotations, which resulted in large minimum detectable effects (MDEs) that could only identify significant topline metric movements. To address this, they fine-tuned open-source multilingual LLMs on human-annotated data to predict relevance scores on a 5-level scale, then deployed these models to evaluate ranking results across A/B experiments. This approach reduced labeling costs dramatically, enabled stratified query sampling designs, and achieved an order of magnitude reduction in MDEs (from 1.3-1.5% down to ≤0.25%), while maintaining strong alignment with human labels (73.7% exact match, 91.7% within 1 point deviation) and enabling rapid evaluation of 150,000 rows within 30 minutes on a single GPU.

MultiCare: A Large-Scale Medical Case Report Dataset for AI Model Training

National University of the South

The MultiCare dataset project addresses the challenge of training AI models for medical applications by creating a comprehensive, multimodal dataset of clinical cases. The dataset contains over 75,000 case report articles, including 135,000 medical images with associated labels and captions, spanning multiple medical specialties. The project implements sophisticated data processing pipelines to extract, clean, and structure medical case reports, images, and metadata, making it suitable for training language models, computer vision models, or multimodal AI systems in the healthcare domain.

Open Source vs. Closed Source Agentic Stacks: Panel Discussion on Production Deployment Strategies

Various (Alation, GrottoAI, Nvidia, OLX)

This panel discussion brings together experts from Nvidia, OLX, Alation, and GrottoAI to discuss practical considerations for deploying agentic AI systems in production. The conversation explores when to choose open source versus closed source tooling, the challenges of standardizing agent frameworks across enterprise organizations, and the tradeoffs between abstraction levels in agent orchestration platforms. Key themes include starting with closed source models for rapid prototyping before transitioning to open source for compliance and cost reasons, the importance of observability across heterogeneous agent frameworks, the difficulty of enabling non-technical users to build agents, and the critical difference between internal tooling with lower precision requirements versus customer-facing systems demanding 95%+ accuracy.

Production Monitoring and Issue Discovery for AI Agents

Raindrop

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

Production Vector Search and Retrieval System Optimization at Scale

Superlinked

SuperLinked, a company focused on vector search infrastructure, shares production insights from deploying information retrieval systems for e-commerce and enterprise knowledge management with indexes up to 2 terabytes. The presentation addresses challenges in relevance, latency, and cost optimization when deploying vector search systems at scale. Key solutions include avoiding vector pooling/averaging, implementing late interaction models, fine-tuning embeddings for domain-specific needs, combining sparse and dense representations, leveraging graph embeddings, and using template-based query generation instead of unconstrained text-to-SQL. Results demonstrate 5%+ precision improvements through targeted fine-tuning, significant latency reductions through proper database selection and query optimization, and improved relevance through multi-encoder architectures that combine text, graph, and metadata signals.

Scaling Customer Support, Compliance, and Developer Productivity with Gen AI

Coinbase

Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.

Semantic Data Processing at Scale with AI-Powered Query Optimization

DocETL

Shreyaa Shankar presents DocETL, an open-source system for semantic data processing that addresses the challenges of running LLM-powered operators at scale over unstructured data. The system tackles two major problems: how to make semantic operator pipelines scalable and cost-effective through novel query optimization techniques, and how to make them steerable through specialized user interfaces. DocETL introduces rewrite directives that decompose complex tasks and data to improve accuracy and reduce costs, achieving up to 86% cost reduction while maintaining target accuracy. The companion tool Doc Wrangler provides an interactive interface for iteratively authoring and debugging these pipelines. Real-world applications include public defenders analyzing court transcripts for racial bias and medical analysts extracting information from doctor-patient conversations, demonstrating significant accuracy improvements (2x in some cases) compared to baseline approaches.

System Prompt Learning for Coding Agents Using LLM-as-Judge Evaluation

Arize

This case study explores how Arize applied "system prompt learning" to improve the performance of production coding agents (Claude and Cline) without model fine-tuning. The problem addressed was that coding agents rely heavily on carefully crafted system prompts that require continuous iteration, but traditional reinforcement learning approaches are sample-inefficient and resource-intensive. Arize's solution involved an iterative process using LLM-as-judge evaluations to generate English-language feedback on agent failures, which was then fed into a meta-prompt to automatically generate improved system prompt rules. Testing on the SWEBench benchmark with just 150 examples, they achieved a 5% improvement in GitHub issue resolution for Claude and 15% for Cline, demonstrating that well-engineered evaluation prompts can efficiently optimize agent performance with minimal training data compared to approaches like DSPy's MIPRO optimizer.

Systematic AI Application Improvement Through Evaluation-Driven Development

Ragas, Various

This case study presents Ragas' comprehensive approach to improving AI applications through systematic evaluation practices, drawn from their experience working with various enterprises and early-stage startups. The problem addressed is the common challenge of AI engineers making improvements to LLM applications without clear measurement frameworks, leading to ineffective iteration cycles and poor user experiences. The solution involves a structured evaluation methodology encompassing dataset curation, human annotation, LLM-as-judge scaling, error analysis, experimentation, and continuous feedback loops. The results demonstrate that teams can move from subjective "vibe checks" to objective, data-driven improvements that systematically enhance AI application performance and user satisfaction.

T-RAG: Tree-Based RAG Architecture for Question Answering Over Organizational Documents

Qatar Computing Research Institute

Qatar Computing Research Institute developed a novel question-answering system for organizational documents combining RAG, finetuning, and a tree-based entity structure. The system, called T-RAG, handles confidential documents on-premise using open source LLMs and achieves 73% accuracy on test questions, outperforming baseline approaches while maintaining robust entity tracking through a custom tree structure.

Using LLMs for Automated Opinion Summary Evaluation in E-commerce

Flipkart

Flipkart faced the challenge of evaluating AI-generated opinion summaries of customer reviews, where traditional metrics like ROUGE failed to align with human judgment and couldn't comprehensively assess summary quality across multiple dimensions. The company developed OP-I-PROMPT, a novel single-prompt framework that uses LLMs as evaluators across seven critical dimensions (fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, and specificity), along with SUMMEVAL-OP, a new benchmark dataset with 2,912 expert annotations. The solution achieved a 0.70 Spearman correlation with human judgments, significantly outperforming previous approaches especially on open-source models like Mistral-7B, while demonstrating that high-quality summaries directly impact business metrics like conversion rates and product return rates.