ZenML

LLMOps Tag: latency_optimization

614 tools with this tag

← Back to LLMOps Database

Common industries

View all industries →

A Practical Blueprint for Evaluating Conversational AI at Scale

Dropbox

Dropbox shares their comprehensive approach to building and evaluating Dropbox Dash, their conversational AI product. The company faced challenges with ad-hoc testing leading to unpredictable regressions where changes to any part of their LLM pipeline—intent classification, retrieval, ranking, prompt construction, or inference—could cause previously correct answers to fail. They developed a systematic evaluation-first methodology treating every experimental change like production code, requiring rigorous testing before merging. Their solution involved curating diverse datasets (both public and internal), defining actionable metrics using LLM-as-judge approaches that outperformed traditional metrics like BLEU and ROUGE, implementing the Braintrust evaluation platform, and automating evaluation throughout the development-to-production pipeline. This resulted in a robust system with layered gates catching regressions early, continuous live-traffic scoring for production monitoring, and a feedback loop for continuous improvement that significantly improved reliability and deployment safety.

Abstractive Conversation Summarization for Google Chat Spaces

Google

Google deployed an abstractive summarization system to automatically generate conversation summaries in Google Chat Spaces to address information overload from unread messages, particularly in hybrid work environments. The solution leveraged the Pegasus transformer model fine-tuned on a custom ForumSum dataset of forum conversations, then distilled into a hybrid transformer-encoder/RNN-decoder architecture for lower latency. The system surfaces summaries through cards when users enter Spaces with unread messages, with quality controls including heuristics for triggering, detection of low-quality summaries, and ephemeral caching of pre-generated summaries to reduce latency, ultimately delivering production value to premium Google Workspace business customers.

Accelerating Drug Development with AI-Powered Clinical Trial Transformation

Novartis

Novartis partnered with AWS Professional Services and Accenture to modernize their drug development infrastructure and integrate AI across clinical trials with the ambitious goal of reducing trial development cycles by at least six months. The initiative involved building a next-generation GXP-compliant data platform on AWS that consolidates fragmented data from multiple domains, implements data mesh architecture with self-service capabilities, and enables AI use cases including protocol generation and an intelligent decision system (digital twin). Early results from the patient safety domain showed 72% query speed improvements, 60% storage cost reduction, and 160+ hours of manual work eliminated. The protocol generation use case achieved 83-87% acceleration in producing compliant protocols, demonstrating significant progress toward their goal of bringing life-saving medicines to patients faster.

Accelerating Game Asset Creation with Fine-Tuned Diffusion Models

Rovio

Rovio, the Finnish gaming company behind Angry Birds, faced challenges in meeting the high demand for game art assets across multiple games and seasonal events, with artists spending significant time on repetitive tasks. The company developed "Beacon Picasso," a suite of generative AI tools powered by fine-tuned diffusion models running on AWS infrastructure (SageMaker, Bedrock, EC2 with GPUs). By training custom models on proprietary Angry Birds art data and building multiple user interfaces tailored to different user needs—from a simple Slackbot to advanced cloud-based workflows—Rovio achieved an 80% reduction in production time for specific use cases like season pass backgrounds, while maintaining brand quality standards and keeping artists in creative control. The solution enabled artists to focus on high-value creative work while AI handled repetitive variations, ultimately doubling content production capacity.

Accelerating LLM Inference with Speculative Decoding for AI Agent Applications

LinkedIn

LinkedIn's Hiring Assistant, an AI agent for recruiters, faced significant latency challenges when generating long structured outputs (1,000+ tokens) from thousands of input tokens including job descriptions and candidate profiles. To address this, LinkedIn implemented n-gram speculative decoding within their vLLM serving stack, a technique that drafts multiple tokens ahead and verifies them in parallel without compromising output quality. This approach proved ideal for their use case due to the structured, repetitive nature of their outputs (rubric-style summaries with ratings and evidence) and high lexical overlap with prompts. The implementation resulted in nearly 4× higher throughput at the same QPS and SLA ceiling, along with a 66% reduction in P90 end-to-end latency, all while maintaining identical output quality as verified by their evaluation pipelines.

Advanced Context-Aware Code Generation with Custom Infrastructure and Parallel LLM Processing

Codeium

Codeium addressed the limitations of traditional embedding-based retrieval in code generation by developing a novel approach called M-query, which leverages vertical integration and custom infrastructure to run thousands of parallel LLM calls for context analysis. Instead of relying solely on vector embeddings, they implemented a system that can process entire codebases efficiently, resulting in more accurate and contextually aware code generation. Their approach has led to improved user satisfaction and code generation acceptance rates while maintaining rapid response times.

Agent-Based AI Assistants for Enterprise and E-commerce Applications

Prosus

Prosus developed two major AI agent applications: Toan, an internal enterprise AI assistant used by 15,000+ employees across 24 companies, and OLX Magic, an e-commerce assistant that enhances product discovery. Toan achieved significant reduction in hallucinations (from 10% to 1%) through agent-based architecture, while saving users approximately 50 minutes per day. OLX Magic transformed the traditional e-commerce experience by incorporating generative AI features for smarter product search and comparison.

Agentic AI Architecture for Meeting Intelligence and Productivity Automation

Zoom

Zoom developed AI Companion 3.0, an agentic AI system that transforms meeting conversations into actionable outcomes through automated planning, reasoning, and execution. The system addresses the challenge of turning hours of meeting content across distributed teams into coordinated action by implementing a federated AI approach combining small language models (SLMs) with large language models (LLMs), deployed on AWS infrastructure including Bedrock and OpenSearch. The solution enables users to automatically generate meeting summaries, perform cross-meeting analysis, schedule meetings with intelligent calendar management, and prepare meeting agendas—reducing what typically takes days of administrative work to minutes while maintaining low latency and cost-effectiveness at scale.

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

Agentic AI for Automated Absence Reporting and Shift Management at Airport Operations

Manchester Airports Group

Manchester Airports Group (MAG) implemented an agentic AI solution to automate unplanned absence reporting and shift management across their three UK airports handling over 1,000 flights daily. The problem involved complex, non-deterministic workflows requiring coordination across multiple systems, with different processes at each airport and high operational costs from overtime payments when staff couldn't make shifts. MAG built a multi-agent system using Amazon Bedrock Agent Core with both text-to-text and speech-to-speech interfaces, allowing employees to report absences conversationally while the system automatically authenticated users, classified absence types, updated HR and rostering systems, and notified relevant managers. The solution achieved 99% consistency in absence reporting (standardizing previously variable processes) and reduced recording time by 90%, with measurable cost reductions in overtime payments and third-party service fees.

Agentic AI Platform for Clinical Development and Commercial Operations in Pharmaceutical Drug Development

AstraZeneca

AstraZeneca partnered with AWS to deploy agentic AI systems across their clinical development and commercial operations to accelerate their goal of delivering 20 new medicines by 2030. The company built two major production systems: a Development Assistant serving over 1,000 users across 21 countries that integrates 16 data products with 9 agents to enable natural language queries across clinical trials, regulatory submissions, patient safety, and quality domains; and an AZ Brain commercial platform that uses 500+ AI models and agents to provide precision insights for patient identification, HCP engagement, and content generation. The implementation reduced time-to-market for various workflows from months to weeks, with field teams using the commercial assistant generating 2x more prescriptions, and reimbursement dossier authoring timelines dramatically shortened through automated agent workflows.

Agentic AI Search with Custom Evaluation Framework for Church Management

Pushpay

Pushpay, a digital giving and engagement platform for churches and faith-based organizations, developed an agentic AI search feature to help ministry leaders query community data using natural language. The initial solution achieved only 60-70% accuracy and faced challenges in systematic evaluation and improvement. To address these limitations, Pushpay built a comprehensive generative AI evaluation framework on Amazon Bedrock, incorporating a curated golden dataset of over 300 queries, an LLM-as-judge evaluator, domain-based categorization, and performance dashboards. This framework enabled rapid iteration, strategic domain-level feature rollout, and implementation of dynamic prompt construction with semantic search. The solution ultimately achieved 95% accuracy in high-priority domains, reduced time-to-insight from 120 seconds to under 4 seconds, and provided the confidence needed for production deployment.

Agentic AI System for Document Summarization and Analysis

Moveworks

Moveworks developed "Brief Me," an AI-powered productivity tool that enables employees to upload documents (PDF, Word, PPT) and interact with them conversationally through their Copilot assistant. The system addresses the time-consuming challenge of manually processing lengthy documents for tasks like summarization, Q&A, comparisons, and insight extraction. By implementing a sophisticated two-stage agentic architecture with online content ingestion and generation capabilities, including hybrid search with custom-trained embeddings, multi-turn conversation support, operation planning, and a novel map-reduce approach for long context handling, the system achieves high accuracy metrics (97.24% correct actions, 89.21% groundedness, 97.98% completeness) with P90 latency under 10 seconds for ingestion, significantly reducing the hours typically required for document analysis tasks.

Agentic News Analysis Platform for Digital Asset Market Making

FSI

Digital asset market makers face the challenge of rapidly analyzing news events and social media posts to adjust trading strategies within seconds to avoid adverse selection and inventory risk. Traditional dictionary-based and statistical machine learning approaches proved too slow or required extensive labeled data. The solution involved building an agentic LLM-based platform on AWS that processes streaming news in near real-time, using fine-tuned embeddings for deduplication, reasoning models for sentiment analysis and impact assessment, and optimized inference infrastructure. Through progressive optimization from SageMaker JumpStart to VLLM to SGLNG, the team achieved 180 output tokens per second, enabling end-to-end latency under 10 seconds and doubling news processing capacity compared to initial deployment.

Agentic Platform Engineering Hub for Cloud Operations Automation

Thomson Reuters

Thomson Reuters' Platform Engineering team transformed their manual, labor-intensive operational processes into an automated agentic system to address challenges in providing self-service cloud infrastructure and enablement services at scale. Using Amazon Bedrock AgentCore as the foundational orchestration layer, they built "Aether," a custom multi-agent system featuring specialized agents for cloud account provisioning, database patching, network configuration, and architecture review, coordinated through a central orchestrator agent. The solution delivered a 15-fold productivity gain, achieved 70% automation rate at launch, and freed engineering teams from repetitive tasks to focus on higher-value innovation work while maintaining security and compliance standards through human-in-the-loop validation.

AI Agent Evaluation Framework for Travel and Accommodation Platform

Booking.com

Booking.com developed a comprehensive evaluation framework for LLM-based agents that power their AI Trip Planner and other customer-facing features. The framework addresses the unique complexity of evaluating autonomous agents that can use external tools, reason through multi-step problems, and engage in multi-turn conversations. Their solution combines black box evaluation (focusing on task completion using judge LLMs) with glass box evaluation (examining internal decision-making, tool usage, and reasoning trajectories). The framework enables data-driven decisions about deploying agents versus simpler baselines by measuring performance gains against cost and latency tradeoffs, while also incorporating advanced metrics for consistency, reasoning quality, memory effectiveness, and trajectory optimality.

AI Agent for Automated Feature Flag Removal

Duolingo

Duolingo developed an AI agent to automate the removal of feature flags from their codebase, addressing the common engineering problem of technical debt accumulation from abandoned flags. The solution leverages Anthropic's Codex CLI running on Temporal workflow orchestration, allowing engineers to initiate automated code cleanup through an internal self-service UI. The agent clones repositories, uses AI to identify and remove obsolete feature flags across Python and Kotlin codebases, and automatically creates pull requests assigned to the requesting engineer. The tool was developed rapidly—moving from prototype to production in approximately one week—and serves as a foundation pattern for future autonomous coding agents at Duolingo.

AI Agent for Automated Merchant Classification and Transaction Matching

Ramp

Ramp built an AI agent using LLMs, embeddings, and RAG to automatically fix incorrect merchant classifications that previously required hours of manual intervention from customer support teams. The agent processes user requests to reclassify transactions in under 10 seconds, handling nearly 100% of requests compared to the previous 1.5-3% manual handling rate, while maintaining 99% accuracy according to LLM-based evaluation and reducing customer support costs from hundreds of dollars to cents per request.

AI Agent for Real Estate Legal Document Analysis and Lease Reporting

Orbital

Orbital Witness developed Orbital Copilot, an AI agent specifically designed for real estate legal work, to address the time-intensive nature of legal due diligence and lease reporting. The solution evolved from classical machine learning models through LLM-based approaches to a sophisticated agentic architecture that combines planning, memory, and tool use capabilities. The system analyzes hundreds of pages across multiple legal documents, answers complex queries by following information trails across documents, and provides transparent reasoning with source citations. Deployed with prestigious law firms including BCLP, Clifford Chance, and others, Orbital Copilot demonstrated up to 70% time savings on lease reporting tasks, translating to significant cost reductions for complex property analyses that typically require 2-10+ hours of lawyer time.

AI Agent System for Automated B2B Research and Sales Pipeline Generation

Unify

UniFi built an AI agent system that automates B2B research and sales pipeline generation by deploying research agents at scale to answer customer-defined questions about companies and prospects. The system evolved from initial React-based agents using GPT-4 and O1 models to a more sophisticated architecture incorporating browser automation, enhanced internet search capabilities, and cost-optimized model selection, ultimately processing 36+ billion tokens monthly while reducing per-query costs from 35 cents to 10 cents through strategic model swapping and architectural improvements.

AI Agent-Powered Compliance Review Automation for Financial Services

Stripe

Stripe developed an AI agent-based solution to address the growing complexity and resource intensity of compliance reviews in financial services, where enterprises spend over $206 billion annually on financial crime operations. The company implemented ReAct agents powered by Amazon Bedrock to automate the investigative and research portions of Enhanced Due Diligence (EDD) reviews while keeping human analysts in the decision-making loop. By decomposing complex compliance workflows into bite-sized tasks orchestrated through a directed acyclic graph (DAG), the agents perform autonomous investigations across multiple data sources and jurisdictions. The solution achieved a 96% helpfulness rating from reviewers and reduced average handling time by 26%, enabling compliance teams to scale without linearly increasing headcount while maintaining complete auditability for regulatory requirements.

AI Agents and Intelligent Observability for DevOps Modernization

HRS Group / Netflix / Harness

This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.

AI Agents for Data Labeling and Infrastructure Maintenance at Scale

Plaid

Plaid, a financial data connectivity platform, developed two internal AI agents to address operational challenges at scale. The AI Annotator agent automates the labeling of financial transaction data for machine learning model training, achieving over 95% human alignment while dramatically reducing annotation costs and time. The Fix My Connection agent proactively detects and repairs bank integration issues, having enabled over 2 million successful logins and reduced average repair time by 90%. These agents represent Plaid's strategic use of LLMs to improve data quality, maintain reliability across thousands of financial institution connections, and enhance their core product experiences.

AI Agents for Travel Booking and Customer Service Automation

TPConnects

TPConnects, a software solutions provider for airlines and travel sellers, transformed their legacy travel booking APIs and UI into a production-ready AI agent system built on Amazon Bedrock. The company implemented a supervised multi-agent orchestration architecture that handles the complete travel journey from shopping and booking to order management and customer servicing. Key challenges included managing latency with large API responses (2000+ flight offers), orchestrating multiple APIs in a pipeline, handling industry-specific IATA codes, and ensuring JSON formatting consistency. The solution uses Claude 3.5 Sonnet as the primary model, incorporates prompt engineering and knowledge bases for travel domain expertise, and extends beyond traditional chat to WhatsApp Business API integration for proactive disruption management and upselling. The system took 3-4 months to develop with AWS support and represents a shift from manual UI interactions to conversational AI-driven travel experiences.

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

AI Assistant for Global Customer Service Automation

Klarna

Klarna implemented an OpenAI-powered AI assistant for customer service that successfully handled two-thirds of all customer service chats within its first month of global deployment. The system processes 2.3 million conversations, matches human agent satisfaction scores, reduces repeat inquiries by 25%, and cuts resolution time from 11 to 2 minutes, while operating in 23 markets with support for over 35 languages, projected to deliver $40 million in profit improvement for 2024.

AI Lab: A Pre-Production Framework for ML Performance Testing and Optimization

Meta

Meta developed AI Lab, a pre-production framework for continuously testing and optimizing machine learning workflows, with a focus on minimizing Time to First Batch (TTFB). The system enables both proactive improvements and automatic regression prevention for ML infrastructure changes. Using AI Lab, Meta was able to achieve up to 40% reduction in TTFB through the implementation of the Python Cinder runtime, while ensuring no regressions occurred during the rollout process.

AI Managed Services and Agent Operations at Enterprise Scale

PriceWaterhouseCooper

PriceWaterhouseCooper (PWC) addresses the challenge of deploying and maintaining AI systems in production through their managed services practice focused on data analytics and AI. The organization has developed frameworks for deploying AI agents in enterprise environments, particularly in healthcare and back-office operations, using their Agent OS framework built on Python. Their approach emphasizes process standardization, human-in-the-loop validation, continuous model tuning, and comprehensive measurement through evaluations to ensure sustainable AI operations at scale. Results include successful deployments in healthcare pre-authorization processes and the establishment of specialized AI managed services teams comprising MLOps engineers and data scientists who continuously optimize production models.

AI Sales Representatives for Inbound Lead Conversion

ShowMe

ShowMe builds AI sales representatives that function as digital teammates for companies selling primarily through inbound channels. The company was founded in April 2025 after the co-founders identified a critical problem at their previous company: website visitors weren't converting to customers unless engaged directly by human sales representatives, but scaling human engagement was too expensive for unqualified leads. ShowMe's solution involves multi-agent voice and video systems that can conduct sales calls, share screens, demo products, qualify leads, and orchestrate follow-up actions across multiple channels. The AI agents use sophisticated prompt engineering, RAG-based knowledge bases, and workflow orchestration to guide prospects through the sales funnel, ultimately creating qualified meetings or closing contracts directly while reducing the need for human sales intervention by approximately 70%.

AI SRE Agents for Production System Diagnostics

Cleric

Cleric is developing an AI Site Reliability Engineering (SRE) agent system that helps diagnose and troubleshoot production system issues. The system uses knowledge graphs to map relationships between system components, background scanning to maintain system awareness, and confidence scoring to minimize alert fatigue. The solution aims to reduce the burden on human engineers by efficiently narrowing down problem spaces and providing actionable insights, while maintaining strict security controls and read-only access to production systems.

AI-Driven Clinical Trial Transformation with Next-Generation Data Platform

Novartis

Novartis embarked on a comprehensive data and AI modernization journey to accelerate drug development by at least 6 months per clinical trial. The company partnered with AWS Professional Services and Accenture to build a next-generation, GXP-compliant data platform that integrates fragmented data across multiple domains (including patient safety, medical imaging, and regulatory data), enabling both operational AI use cases and ambitious moonshot projects like a digital twin for clinical trial simulation. The initial implementation with the patient safety domain achieved significant results: 16 data pipelines processing 17 terabytes of data, 72% faster query speeds, 60% storage cost reduction, and over 160 hours of manual work eliminated, while protocol generation use cases demonstrated 83-87% acceleration in generating compliance-acceptable protocols.

AI-Driven Collateral Allocation Optimization in Fintech

Mercado Libre

Mercado Pago, the fintech arm of Mercado Libre, faced the challenge of optimizing collateral allocation across billions of dollars in credit lines secured from major banks, requiring daily selection from millions of loans with complex contractual constraints. The company developed Enigma, a solution leveraging linear programming via Google OR-Tools combined with a custom grouping heuristic to handle scalability challenges. While the article primarily focuses on traditional optimization techniques rather than LLMs, it hints at future AI agent exploration for enhanced analytics, strategic constraint proposals, and automated translation of contractual conditions into mathematical constraints, representing a potential future evolution toward LLM integration in financial operations.

AI-Driven Digital Twins for Industrial Infrastructure Optimization

Geminus

Geminus addresses the challenge of optimizing large industrial machinery operations by combining traditional ML models with high-fidelity simulations to create fast, trustworthy digital twins. Their solution reduces model development time from 24 months to just days, while building operator trust through probabilistic approaches and uncertainty bounds. The system provides optimization advice through existing control systems, ensuring safety and reliability while significantly improving machine performance.

AI-Driven Incident Response and Automated Remediation for Digital Media Platform

iHeart

iHeart Media, serving 250 million monthly users across broadcast radio, digital streaming, and podcasting platforms, faced significant operational challenges with incident response requiring engineers to navigate multiple monitoring systems, VPNs, and dashboards during critical 3 AM outages. The company implemented a multi-agent AI system using AWS Bedrock Agent Core and the Strands AI framework to automate incident triage, root cause analysis, and remediation. The solution reduced triage response time dramatically (from minutes of manual investigation to 30-60 seconds), improved operational efficiency by eliminating repetitive manual tasks, and enabled knowledge preservation across incidents while maintaining 24/7 uptime requirements for their infrastructure handling 5-7 billion requests per month.

AI-Driven Student Services and Prescriptive Pathways at UCLA Anderson School of Management

UCLA

UCLA Anderson School of Management partnered with Kindle to address the challenge of helping MBA students navigate their intensive two-year program more effectively. Students were overwhelmed with coursework, career decisions, club activities, and internship searches, receiving extensive information without clear guidance. The solution involved digitizing over 2 million paper records and building an AI-powered application that provides personalized, prescriptive roadmaps for students based on their career goals. The system integrates data from multiple sources including student records, career placement systems, clubs, and course catalogs to recommend specific courses, internships, clubs, and target companies. The project took approximately 8 months (December 2023 to August 2024) and demonstrates how educational institutions can leverage agentic AI frameworks to deliver better student experiences while maintaining data security and privacy standards.

AI-Driven User Memory System for Dynamic Real Estate Personalization

Zillow

Zillow developed a sophisticated user memory system to address the challenge of personalizing real estate discovery for home shoppers whose preferences evolve significantly over time. The solution combines AI-driven preference profiles, embedding models, affordability-aware quantile models, and raw interaction history into a unified memory layer that operates across three dimensions: recency/frequency, flexibility/rigidity, and prediction/planning. This system is powered by a dual-layered architecture blending batch processing for long-term preferences with real-time streaming pipelines for short-term behavioral signals, enabling personalized experiences across search, recommendations, and notifications while maintaining user trust through privacy-centered design.

AI-Powered .NET Application Modernization at Scale

Thomson Reuters

Thomson Reuters faced the challenge of modernizing over 400 legacy .NET Framework applications comprising more than 500 million lines of code, which were running on costly Windows servers and slowing down innovation. By adopting AWS Transform for .NET during its beta phase, the company leveraged agentic AI capabilities powered by Amazon Bedrock LLMs with deep .NET expertise to automate the analysis, dependency mapping, code transformation, and validation process. This approach accelerated their modernization from months of planning to weeks of execution, enabling them to transform over 1.5 million lines of code per month while running 10 parallel modernization projects. The solution not only promised substantial cost savings by migrating to Linux containers and Graviton instances but also freed developers from maintaining legacy systems to focus on delivering customer value.

AI-Powered Autonomous Infrastructure Monitoring and Self-Healing System

Railway

This case study presents a proof-of-concept system for autonomous infrastructure monitoring and self-healing using AI coding agents. The presenter demonstrates a workflow that automatically detects issues in deployed services on Railway (memory leaks, slow database queries, high error rates), analyzes metrics and logs using LLMs to generate diagnostic plans, and then deploys OpenCode—an open-source AI coding agent—to automatically create pull requests with fixes. The system leverages durable workflows via Inngest for reliability, combines multiple data sources (CPU/memory metrics, HTTP metrics, logs), and uses LLMs to analyze infrastructure health and generate remediation plans. While presented as a demo/concept, the approach showcases how LLMs can move from alerting engineers to autonomously proposing code-level fixes for production issues.

AI-Powered Betting Assistant for Sports Wagering Platform

FanDuel

FanDuel, America's leading sportsbook platform handling over 16.6 million bets during Super Bowl Sunday 2025, developed AAI (an AI-powered betting assistant) to address friction in the customer betting journey. Previously, customers would leave the FanDuel app to research bets on external platforms, often getting distracted and missing betting opportunities. Working with AWS's Generative AI Innovation Center, FanDuel built an in-app conversational assistant using Amazon Bedrock that guides customers through research, discovery, bet construction, and execution entirely within their platform. The solution reduced bet construction time from hours to seconds (particularly for complex parlays), improved customer engagement, and was rolled out incrementally across states and sports using a rigorous evaluation framework with thousands of test cases to ensure accuracy and responsible gaming safeguards.

AI-Powered Business Assistant for Solopreneurs

Jimdo

Jimdo, a European website builder serving over 35 million solopreneurs across 190 countries, needed to help their customers—who often lack expertise in marketing, sales, and business strategy—drive more traffic and conversions to their websites. The company built Jimdo Companion, an AI-powered business advisor using LangChain.js and LangGraph.js for orchestration and LangSmith for observability. The system features two main components: Companion Dashboard (an agentic business advisor that queries 10+ data sources to deliver personalized insights) and Companion Assistant (a ChatGPT-like interface that adapts to each business's tone of voice). The solution resulted in 50% more first customer contacts within 30 days and 40% more overall customer activity for users with access to Companion.

AI-Powered Call Center Agents for Healthcare Operations

HeyRevia

HeyRevia has developed an AI call center solution specifically for healthcare operations, where over 30% of operations run on phone calls. Their system uses AI agents to handle complex healthcare-related calls, including insurance verifications, prior authorizations, and claims processing. The solution incorporates real-time audio processing, context understanding, and sophisticated planning capabilities to achieve performance that reportedly exceeds human operators while maintaining compliance with healthcare regulations.

AI-Powered Call Intelligence System for Multi-Location Marketing Analysis

Netsertive

Netsertive, a digital marketing solutions provider for multi-location brands and franchises, implemented an AI-powered call intelligence system using Amazon Bedrock and Amazon Nova Micro to automatically analyze customer call tracking data and extract actionable insights. The solution processes real-time phone call transcripts to provide sentiment analysis, call summaries, keyword identification, coaching suggestions, and performance tracking across locations, reducing analysis time from hours or days to minutes while enabling better customer service optimization and conversion rate improvements for their franchise clients.

AI-Powered Chief of Staff: Scaling Agent Architecture from Monolith to Distributed System

Outropy

Outropy initially built an AI-powered Chief of Staff for engineering leaders that attracted 10,000 users within a year. The system evolved from a simple Slack bot to a sophisticated multi-agent architecture handling complex workflows across team tools. They tackled challenges in agent memory management, event processing, and scaling, ultimately transitioning from a monolithic architecture to a distributed system using Temporal for workflow management while maintaining production reliability.

AI-Powered Clinical Documentation and Data Infrastructure for Point-of-Care Transformation

Veradigm

Veradigm, a healthcare IT company, partnered with AWS to integrate generative AI into their Practice Fusion electronic health record (EHR) system to address clinician burnout caused by excessive documentation tasks. The solution leverages AWS HealthScribe for autonomous AI scribing that generates clinical notes from patient-clinician conversations, and AWS HealthLake as a FHIR-based data foundation to provide patient context at scale. The implementation resulted in clinicians saving approximately 2 hours per day on charting, 65% of users requiring no training to adopt the technology, and high satisfaction with note quality. The system processes 60 million patient visits annually and enables ambient documentation that allows clinicians to focus on patient care rather than typing, with a clear path toward zero-edit note generation.

AI-Powered Clinical Documentation with Multi-Region Healthcare Compliance

Heidi Health

Heidi Health developed an ambient AI scribe to reduce the administrative burden on healthcare clinicians by automatically generating clinical notes from patient consultations. The company faced significant LLMOps challenges including building confidence in non-deterministic AI outputs through "clinicians in the loop" evaluation processes, scaling clinical validation beyond small teams using synthetic data generation and LLM-as-judge approaches, and managing global expansion across regions with different data sovereignty requirements, model availability constraints, and regulatory compliance needs. Their solution involved standardizing infrastructure-as-code deployments across AWS regions, using a hybrid approach of Amazon Bedrock for immediate availability and EKS for self-hosted model control, and integrating clinical ambassadors in each region to validate medical accuracy and local practice patterns. The platform now serves over 370,000 clinicians processing 10 million consultations per month globally.

AI-Powered Code Review Platform at Scale

Uber

Uber developed uReview, an AI-powered code review platform, to address the challenge of reviewing over 65,000 code changes weekly across six monorepos. Traditional peer reviews were becoming overwhelmed by the volume of code and struggled to consistently catch subtle bugs, security issues, and best practice violations. The solution employs a modular, multi-stage GenAI system using prompt chaining with multiple specialized assistants (Standard, Best Practices, and AppSec) that generate, filter, validate, and deduplicate code review comments. The system achieves a 75% usefulness rating from engineers, with 65% of comments being addressed, outperforming human reviewers (51% address rate), and saves approximately 1,500 developer hours weekly across Uber's engineering organization.

AI-Powered Code Review Platform Using Abstract Syntax Trees and LLM Context

Baz

Baz is building an AI code review agent that addresses the challenge of understanding complex codebases at scale. The platform combines Abstract Syntax Trees (AST) with LLM semantic understanding to provide automated code reviews that go beyond traditional static analysis. By integrating context from multiple sources including code structure, Jira/Linear tickets, CI logs, and deployment patterns, Baz aims to replicate the knowledge of a staff engineer who understands not just the code but the entire business context. The solution has evolved from basic reviews to catching performance issues and schema changes, with customers using it to review code generated by AI coding assistants like Cursor and Codex.

AI-Powered Compliance Investigation Agents for Enhanced Due Diligence

Stripe

Stripe developed an LLM-powered AI research agent system to address the scalability challenges of enhanced due diligence (EDD) compliance reviews in financial services. The manual review process was resource-intensive, with compliance analysts spending significant time navigating fragmented data sources across different jurisdictions rather than performing high-value analysis. Stripe built a React-based agent system using Amazon Bedrock that orchestrates autonomous investigations across multiple data sources, pre-fetches analysis before reviewers open cases, and provides comprehensive audit trails. The solution maintains human oversight for final decision-making while enabling agents to handle data gathering and initial research. This resulted in a 26% reduction in average handling time for compliance reviews, with agents achieving 96% helpfulness ratings from reviewers, allowing Stripe to scale compliance operations alongside explosive business growth without proportionally increasing headcount.

AI-Powered Contact Center Transformation for Pet Retail

PetCo

PetCo transformed its contact center operations serving over 10,000 daily customer interactions by implementing Amazon Connect with integrated AI capabilities. The company faced challenges balancing cost efficiency with customer satisfaction while managing 400 care team members handling everything from e-commerce inquiries to veterinary appointments across 1,500+ stores. By deploying call summaries, automated QA, AI-supported agent assistance, and generative AI-powered chatbots using Amazon Q and Connect, PetCo achieved reduced handle times, improved routing efficiency, and launched conversational self-service capabilities. The implementation emphasized starting with high-friction use cases like order status inquiries and grooming salon call routing, with plans to expand into conversational IVR and appointment booking through voice and chat interfaces.

AI-Powered Contact Center Transformation with Amazon Connect

Traeger

Traeger Grills transformed their customer experience operations from a legacy contact center with poor performance metrics (35% CSAT, 30% first contact resolution) into a modern AI-powered system built on Amazon Connect. The company implemented generative AI capabilities for automated case note generation, email composition, and chatbot interactions while building a "single pane of glass" agent experience using Amazon Connect Cases. This eliminated their legacy CRM, reduced new hire training time by 40%, improved agent satisfaction, and enabled seamless integration of their acquired Meater thermometer brand. The implementation leveraged AI to handle non-value-added work while keeping human agents focused on building emotional connections with customers in the "Traeger Hood" community, demonstrating a shift from cost center to profit center thinking.

AI-Powered Content Curation for Financial Crime Detection

LSEG

London Stock Exchange Group (LSEG) Risk Intelligence modernized its WorldCheck platform—a global database used by financial institutions to screen for high-risk individuals, politically exposed persons (PEPs), and adverse media—by implementing generative AI to accelerate data curation. The platform processes thousands of news sources in 60+ languages to help 10,000+ customers combat financial crime including fraud, money laundering, and terrorism financing. By adopting a maturity-based approach that progressed from simple prompt-only implementations to agent orchestration with human-in-the-loop validation, LSEG reduced content curation time from hours to minutes while maintaining accuracy and regulatory compliance. The solution leverages AWS Bedrock for LLM operations, incorporating summarization, entity extraction, classification, RAG for cross-referencing articles, and multi-agent orchestration, all while keeping human analysts at critical decision points to ensure trust and regulatory adherence.

AI-Powered Content Generation and Shot Commentary System for Live Golf Tournament Coverage

PGA Tour

The PGA Tour faced the challenge of engaging fans with golf content across multiple tournaments running nearly every week of the year, generating meaningful content from 31,000+ shots per tournament across 156 players, and maintaining relevance during non-tournament days. They implemented an agentic AI system using AWS Bedrock that generates up to 800 articles per week across eight different content types (betting profiles, tournament previews, player recaps, round recaps, purse breakdowns, etc.) and a real-time shot commentary system that provides contextual narration for live tournament play. The solution achieved 95% cost reduction (generating articles at $0.25 each), enabled content publication within 5-10 minutes of live events, resulted in billions of annual page views for AI-generated content, and became their highest-engaged content on non-tournament days while maintaining brand voice and factual accuracy through multi-agent validation workflows.

AI-Powered Content Moderation at Platform Scale

Roblox

Roblox moderates billions of pieces of user-generated content daily across 28 languages using a sophisticated AI-driven system that combines large transformer-based models with human oversight. The platform processes an average of 6.1 billion chat messages and 1.1 million hours of voice communication per day, requiring ML models that can make moderation decisions in milliseconds. The system achieves over 750,000 requests per second for text filtering, with specialized models for different violation types (PII, profanity, hate speech). The solution integrates GPU-based serving infrastructure, model quantization and distillation for efficiency, real-time feedback mechanisms that reduce violations by 5-6%, and continuous model improvement through diverse data sampling strategies including synthetic data generation via LLMs, uncertainty sampling, and AI-assisted red teaming.

AI-Powered Content Moderation at Scale: SafeChat Platform

DoorDash

DoorDash developed SafeChat, an AI-powered content moderation system to handle millions of daily messages, hundreds of thousands of images, and voice calls exchanged between delivery drivers (Dashers) and customers. The platform employs a multi-layered architecture that evolved from using three external LLMs to a more efficient two-layer approach combining an internally trained model with a precise external LLM, processing text, images, and voice communications in real-time. Since launch, SafeChat has achieved a 50% reduction in low to medium-severity safety incidents while maintaining low latency (under 300ms for most messages) and cost-effectiveness by intelligently routing only 0.2% of content to expensive, high-precision models.

AI-Powered Content Understanding and Ad Targeting Platform

Dotdash

Dotdash Meredith, a major digital publisher, developed an AI-powered system called Decipher that understands user intent from content consumption to deliver more relevant advertising. Through a strategic partnership with OpenAI, they enhanced their content understanding capabilities and expanded their targeting platform across the premium web. The system outperforms traditional cookie-based targeting while maintaining user privacy, proving that high-quality content combined with AI can drive better business outcomes.

AI-Powered Conversational Assistant for Streamlined Home Buying Experience

Rocket

Rocket Companies, a Detroit-based FinTech company, developed Rocket AI Agent to address the overwhelming complexity of the home buying process by providing 24/7 personalized guidance and support. Built on Amazon Bedrock Agents, the AI assistant combines domain knowledge, personalized guidance, and actionable capabilities to transform client engagement across Rocket's digital properties. The implementation resulted in a threefold increase in conversion rates from web traffic to closed loans, 85% reduction in transfers to customer care, and 68% customer satisfaction scores, while enabling seamless transitions between AI assistance and human support when needed.

AI-Powered Conversational Contact Center for Healthcare Patient Communication

Clarus Care

Clarus Care, a healthcare contact center solutions provider serving over 16,000 users and handling 15 million patient calls annually, partnered with AWS Generative AI Innovation Center to transform their traditional menu-driven IVR system into a generative AI-powered conversational contact center. The solution uses Amazon Connect, Amazon Lex, and Amazon Bedrock (with Claude 3.5 Sonnet and Amazon Nova models) to enable natural language interactions that can handle multiple patient intents in a single conversation—such as appointment scheduling, prescription refills, and billing inquiries. The system achieves sub-3-second latency requirements, maintains 99.99% availability SLA, supports both voice and web chat interfaces, and includes smart transfer capabilities for urgent cases. The architecture leverages multi-model selection through Bedrock to optimize for specific tasks based on accuracy and latency requirements, with comprehensive analytics pipelines for monitoring system performance and patient interactions.

AI-Powered CRM Insights with RAG and Text-to-SQL

TP ICAP

TP ICAP faced the challenge of extracting actionable insights from tens of thousands of vendor meeting notes stored in their Salesforce CRM system, where business users spent hours manually searching through records. Using Amazon Bedrock, their Innovation Lab built ClientIQ, a production-ready solution that combines Retrieval Augmented Generation (RAG) and text-to-SQL approaches to transform hours of manual analysis into seconds. The solution uses Amazon Bedrock Knowledge Bases for unstructured data queries, automated evaluations for quality assurance, and maintains enterprise-grade security through permission-based access controls. Since launch with 20 initial users, ClientIQ has driven a 75% reduction in time spent on research tasks and improved insight quality with more comprehensive and contextual information being surfaced.

AI-Powered Customer Conversation Analytics at Scale

GoDaddy

GoDaddy faced the challenge of extracting actionable insights from over 100,000 daily customer service transcripts, which were previously analyzed through limited manual review that couldn't surface systemic issues or emerging problems quickly enough. To address this, they developed Lighthouse, an internal AI analytics platform that uses large language models, prompt engineering, and lexical search to automatically analyze massive volumes of unstructured customer interaction data. The platform successfully processes the full daily volume of 100,000+ transcripts in approximately 80 minutes, enabling teams to identify pain points and operational issues within hours instead of weeks, as demonstrated in a real case where they quickly detected and resolved a spike in customer calls caused by a malfunctioning link before it escalated into a major service disruption.

AI-Powered Customer Service Agent for Healthcare Navigation

Alan

Alan, a healthcare company supporting 1 million members, built AI agents to help members navigate complex healthcare questions and processes. The company transitioned from traditional workflows to playbook-based agent architectures, implementing a multi-agent system with classification and specialized agents (particularly for claims handling) that uses a ReAct loop for tool calling. The solution achieved 30-35% automation of customer service questions with quality comparable to human care experts, with 60% of reimbursements processed in under 5 minutes. Critical to their success was building custom orchestration frameworks and extensive internal tooling that empowered domain experts (customer service operators) to configure, debug, and maintain agents without engineering bottlenecks.

AI-Powered Ecommerce Content Optimization Platform

Pattern

Pattern developed Content Brief, an AI-driven tool that processes over 38 trillion ecommerce data points to optimize product listings across multiple marketplaces. Using Amazon Bedrock and other AWS services, the system analyzes consumer behavior, content performance, and competitive data to provide actionable insights for product content optimization. In one case study, their solution helped Select Brands achieve a 21% month-over-month revenue increase and 14.5% traffic improvement through optimized product listings.

AI-Powered Email Search Assistant with Advanced Cognitive Architecture

Superhuman

Superhuman developed Ask AI to solve the challenge of inefficient email and calendar searching, where users spent up to 35 minutes weekly trying to recall exact phrases and sender names. They evolved from a single-prompt RAG system to a sophisticated cognitive architecture with parallel processing for query classification and metadata extraction. The solution achieved sub-2-second response times and reduced user search time by 14% (5 minutes per week), while maintaining high accuracy through careful prompt engineering and systematic evaluation.

AI-Powered Fan Engagement and Content Personalization for Global Football Audiences

DFL / Bundesliga

DFL / Bundesliga, the organization behind Germany's premier football league, partnered with AWS to enhance fan engagement for their 1 billion global fans through AI and generative AI solutions. The primary challenges included personalizing content at scale across diverse geographies and languages, automating manual content creation processes, and making decades of archival footage searchable and accessible. The solutions implemented included an AI-powered live ticker providing real-time commentary in multiple languages and styles within 7 seconds of events, an intelligent metadata generation (IGM) system to analyze 9+ petabytes of historical footage using multimodal AI, automated content localization for speech-to-speech and speech-to-text translation, AI-generated "Stories" format content from existing articles, and personalized app experiences. Results demonstrated significant impact: 20% increase in overall app usage, 67% increase in articles read through personalization, 75% reduction in processing time for localized content with 5x content output, 2x increase in app dwell time from AI-generated stories, and 67% story retention rate indicating strong user engagement.

AI-Powered Financial Assistant for Automated Expense Management

Brex

Brex developed an AI-powered financial assistant to automate expense management workflows, addressing the pain points of manual data entry, policy compliance, and approval bottlenecks that plague traditional finance operations. Using Amazon Bedrock with Claude models, they built a comprehensive system that automatically processes expenses, generates compliant documentation, and provides real-time policy guidance. The solution achieved 75% automation of expense workflows, saving hundreds of thousands of hours monthly across customers while improving compliance rates from 70% to the mid-90s, demonstrating how LLMs can transform enterprise financial operations when properly integrated with existing business processes.

AI-Powered Fraud Detection in E-commerce Using AWS Fraud Detector

Awaze

E-commerce companies face significant fraud challenges, with UK e-commerce fraud reaching £1 billion stolen in 2024 despite preventing £1.5 billion. The speaker describes implementing AWS Fraud Detector, a fully managed machine learning service, to detect various fraud types including promo abuse, credit card chargeback fraud, account hijacking, and triangulation fraud. The solution uses historical labeled data to build predictive models that score orders between 0-1000 based on fraud likelihood, requiring human review for GDPR compliance. The implementation covers evaluation strategies focusing on true positives and false positives, feature engineering including geolocation enrichment, deployment options via SageMaker or Lambda, and continuous improvement through model retraining at different frequencies depending on fraud trend velocity.

AI-Powered Fraud Detection Using Mixture of Experts and Federated Learning

Feedzai

Feedzai developed TrustScore, an AI-powered fraud detection system that addresses the limitations of traditional rule-based and custom AI models in financial crime detection. The solution leverages a Mixture of Experts (MoE) architecture combined with federated learning to aggregate fraud intelligence from across Feedzai's network of financial institutions processing $8.02T in yearly transactions. Unlike traditional systems that require months of historical data and constant manual updates, TrustScore provides a zero-day, ready-to-use solution that continuously adapts to emerging fraud patterns while maintaining strict data privacy. Real-world deployments have demonstrated significant improvements in fraud detection rates and reductions in false positives compared to traditional out-of-the-box rule systems.

AI-Powered Hormonal Health Platform Built in 8 Weeks

FemmFlo

FemmFlo, a women's health tech startup, developed an LLM-powered platform to address the massive data gap in women's hormonal health, where millions of women wait over seven years for accurate diagnoses. Working with Millio AI and leveraging AWS services, they built a full MVP in just eight weeks that integrates hormonal tracking, lab diagnostics, mental health support, and personalized care recommendations through an AI agent named Gabby. The platform was designed for rapid deployment with beta users, lab integrations, and partnerships, specifically targeting underserved women with culturally relevant, localized healthcare guidance. The solution uses AWS Bedrock agents, API Gateway, DynamoDB, S3, and other managed services to deliver a scalable, cost-effective system that translates complex lab results into actionable health insights while maintaining clinical rigor through a controlled testing environment.

AI-Powered Incident Response System with Multi-Agent Investigation

Incident.io

Incident.io developed an AI SRE product to automate incident investigation and response for tech companies. The product uses a multi-agent system to analyze incidents by searching through GitHub pull requests, Slack messages, historical incidents, logs, metrics, and traces to build hypotheses about root causes. When incidents occur, the system automatically creates investigations that run parallel searches, generate findings, formulate hypotheses, ask clarifying questions through sub-agents, and present actionable reports in Slack within 1-2 minutes. The system demonstrates significant value by reducing mean time to detection and resolution while providing continuous ambient monitoring throughout the incident lifecycle, working collaboratively with human responders.

AI-Powered Insurance Claims Chatbot with Continuous Feedback Loop

Allianz

Allianz Benelux tackled their complex insurance claims process by implementing an AI-powered chatbot using Landbot. The system processed over 92,000 unique search terms, categorized insurance products, and implemented a real-time feedback loop with Slack and Trello integration. The solution achieved 90% positive ratings from 18,000+ customers while significantly simplifying the claims process and improving operational efficiency.

AI-Powered Legal Document Review and Analysis Platform

Lexbe

Lexbe, a legal document review software company, developed Lexbe Pilot, an AI-powered Q&A assistant integrated into their eDiscovery platform using Amazon Bedrock and associated AWS services. The solution addresses the challenge of legal professionals needing to analyze massive document sets (100,000 to over 1 million documents) to identify critical evidence for litigation. By implementing a RAG-based architecture with Amazon Bedrock Knowledge Bases, the system enables legal teams to query entire datasets and retrieve contextually relevant results that go beyond traditional keyword searches. Through an eight-month collaborative development process with AWS, Lexbe achieved a 90% recall rate with the final implementation, enabling the generation of comprehensive findings-of-fact reports and deep automated inference capabilities that can identify relationships and connections across multilingual document collections.

AI-Powered Marketing Content Generation and Compliance Platform at Scale

Volkswagen

Volkswagen Group Services partnered with AWS to build a production-scale generative AI platform for automotive marketing content generation and compliance evaluation. The problem was a slow, manual content supply chain that took weeks to months, created confidentiality risks with pre-production vehicles, and faced massive compliance bottlenecks across 10 brands and 200+ countries. The solution involved fine-tuning diffusion models on proprietary vehicle imagery (including digital twins from CAD), automated prompt enhancement using LLMs, and multi-stage image evaluation using vision-language models for both component-level accuracy and brand guideline compliance. Results included massive time savings (weeks to minutes), automated compliance checks across legal and brand requirements, and a reusable shared platform supporting multiple use cases across the organization.

AI-Powered Marketing Intelligence Platform Accelerates Industry Analysis

CLICKFORCE

CLICKFORCE, a digital advertising leader in Taiwan, faced challenges with generic AI outputs, disconnected internal datasets, and labor-intensive analysis processes that took two to six weeks to complete industry reports. The company built Lumos, an AI-powered marketing analysis platform using Amazon Bedrock Agents for contextualized reasoning, Amazon SageMaker for Text-to-SQL fine-tuning, Amazon OpenSearch for vector embeddings, and AWS Glue for data integration. The solution reduced industry analysis time from weeks to under one hour, achieved a 47% reduction in operational costs, and enabled multiple stakeholder groups to independently generate insights without centralized analyst teams.

AI-Powered Multi-Agent System for Global Compliance Screening at Scale

Amazon

Amazon developed an AI-driven compliance screening system to handle approximately 2 billion daily transactions across 160+ businesses globally, ensuring adherence to sanctions and regulatory requirements. The solution employs a three-tier approach: a screening engine using fuzzy matching and vector embeddings, an intelligent automation layer with traditional ML models, and an AI-powered investigation system featuring specialized agents built on Amazon Bedrock AgentCore Runtime. These agents work collaboratively to analyze matches, gather evidence, and make recommendations following standardized operating procedures. The system achieves 96% accuracy with 96% precision and 100% recall, automating decision-making for over 60% of case volume while reserving human intervention only for edge cases requiring nuanced judgment.

AI-Powered Natural Language Search for Vehicle Marketplace

Coches.net

Coches.net, Spain's leading vehicle marketplace, implemented an AI-powered natural language search system to replace traditional filter-based search. The team completed a 15-day sprint using Amazon Bedrock and Anthropic's Claude Haiku model to translate natural language queries like "family-friendly SUV for mountain trips" into structured search filters. The solution includes content moderation, few-shot prompting, and costs approximately €19 per day to operate. While user adoption remains limited, early results show that users utilizing the AI search generate more value compared to traditional search methods, demonstrating improved efficiency and user experience through automated filter application.

AI-Powered Neurosurgery: From Brain Tumor Classification to Surgical Planning

Cedars Sinai

Cedars Sinai and various academic institutions have implemented AI and machine learning solutions to improve neurosurgical outcomes across multiple areas. The applications include brain tumor classification using CNNs achieving 95% accuracy (surpassing traditional radiologists), hematoma prediction and management using graph neural networks with 80%+ accuracy, and AI-assisted surgical planning and intraoperative guidance. The implementations demonstrate significant improvements in patient outcomes while highlighting the importance of balanced innovation with appropriate regulatory oversight.

AI-Powered On-Call Assistant for Airflow Pipeline Debugging

Wix

Wix developed AirBot, an AI-powered Slack agent to address the operational burden of managing over 3,500 Apache Airflow pipelines processing 4 billion daily HTTP transactions across a 7 petabyte data lake. The traditional manual debugging process required engineers to act as "human error parsers," navigating multiple distributed systems (Airflow, Spark, Kubernetes) and spending approximately 45 minutes per incident to identify root causes. AirBot leverages LLMs (GPT-4o Mini and Claude 4.5 Opus) in a Chain of Thought architecture to automatically investigate failures, generate diagnostic reports, create pull requests with fixes, and route alerts to appropriate team owners. The system achieved measurable impact by saving approximately 675 engineering hours per month (equivalent to 4 full-time engineers), generating 180 candidate pull requests with a 15% fully automated fix rate, and reducing debugging time by at least 15 minutes per incident while maintaining cost efficiency at $0.30 per AI interaction.

AI-Powered Personalized Content Recommendations for Sports and Entertainment Venue

Golden State Warriors

The Golden State Warriors implemented a recommendation engine powered by Google Cloud's Vertex AI to personalize content delivery for their fans across multiple platforms. The system integrates event data, news content, game highlights, retail inventory, and user analytics to provide tailored recommendations for both sports events and entertainment content at Chase Center. The solution enables personalized experiences for 18,000+ venue seats while operating with limited technical resources.

AI-Powered Personalized Year-in-Review Campaign at Scale

Canva

Canva launched DesignDNA, a year-in-review campaign in December 2024 to celebrate their community's design achievements. The campaign needed to create personalized, shareable experiences for millions of users while respecting privacy constraints. Canva leveraged generative AI to match users to design trends using keyword analysis, generate design personalities, and create over a million unique personalized poems across 9 locales. The solution combined template metadata analysis, prompt engineering, content generation at scale, and automated review processes to produce 95 million unique DesignDNA stories. Each story included personalized statistics, AI-generated poems, design personality profiles, and predicted emerging design trends, all dynamically assembled using URL parameters and tagged template elements.

AI-Powered PLC Code Generation for Industrial Automation

Wipro PARI

Wipro PARI, a global automation company, partnered with AWS and ShellKode to develop an AI-powered solution that transforms the manual process of generating Programmable Logic Controller (PLC) ladder text code from complex process requirements. Using Amazon Bedrock with Anthropic's Claude models, advanced prompt engineering techniques, and custom validation logic, the system reduces PLC code generation time from 3-4 days to approximately 10 minutes per requirement while achieving up to 85% code accuracy. The solution automates validation against IEC 61131-3 industry standards, handles complex state management and transition logic, and provides a user-friendly interface for industrial engineers, resulting in 5,000 work-hours saved across projects and enabling Wipro PARI to win key automotive clients.

AI-Powered Real-Time Content Moderation with Prevalence Measurement

Pinterest

Pinterest built a real-time AI-assisted system to measure the prevalence of policy-violating content—the percentage of daily views that went to harmful content—to address the limitations of relying solely on user reports. The company developed a workflow combining ML-assisted impression-weighted sampling with multimodal LLM labeling to process daily samples at scale. This approach reduced labeling turnaround time by 15x compared to human-only review while maintaining comparable decision quality, enabling continuous monitoring across multiple policy areas, faster intervention testing, and proactive risk detection that was previously impossible with infrequent manual studies.

AI-Powered Revenue Operating System with Multi-Agent Orchestration

Rox

Rox built a revenue operating system to address the challenge of fragmented sales data across CRM, marketing automation, finance, support, and product usage systems that create silos and slow down sales teams. The solution uses Amazon Bedrock with Anthropic's Claude Sonnet 4 to power intelligent AI agent swarms that unify disparate data sources into a knowledge graph and execute multi-step GTM workflows including research, outreach, opportunity management, and proposal generation. Early customers reported 50% higher representative productivity, 20% faster sales velocity, 2x revenue per rep, 40-50% increase in average selling price, 90% reduction in prep time, and 50% faster ramp time for new reps.

AI-Powered Security Operations Center with Agentic AI for Threat Detection and Response

Trellix

Trellix, in partnership with AWS, developed an AI-powered Security Operations Center (SOC) using agentic AI to address the challenge of overwhelming security alerts that human analysts cannot effectively process. The solution leverages AWS Bedrock with multiple models (Amazon Nova for classification, Claude Sonnet for analysis) to automatically investigate security alerts, correlate data across multiple sources, and provide detailed threat assessments. The system uses a multi-agent architecture where AI agents autonomously select tools, gather context from various security platforms, and generate comprehensive incident reports, significantly reducing the burden on human analysts while improving threat detection accuracy.

AI-Powered Self-Remediation Loop for Large-Scale Kubernetes Operations

Salesforce

Salesforce's Hyperforce Kubernetes platform team manages over 1,400 clusters scaling millions of pods, facing significant operational challenges with engineers spending over 1,000 hours monthly on support tasks. They developed a multi-agent AI-powered self-remediation loop built on AWS Bedrock's multi-agent collaboration framework, integrating with their existing monitoring and automation tools (Prometheus, K8sGPT, Argo CD, and custom tools like Sloop and Periscope). The solution features a manager AI agent that orchestrates multiple specialized worker agents to retrieve telemetry data, perform root cause analysis using RAG-augmented runbooks, and execute safe remediation actions with human-in-the-loop approval via Slack. The implementation achieved a 30% improvement in troubleshooting time and saved approximately 150 hours per month in operational toil, with plans to expand capabilities using knowledge graphs and advanced anomaly detection.

AI-Powered Semantic Job Search at Scale

Linkedin

LinkedIn transformed their traditional keyword-based job search into an AI-powered semantic search system to serve 1.2 billion members. The company addressed limitations of exact keyword matching by implementing a multi-stage LLM architecture combining retrieval and ranking models, supported by synthetic data generation, GPU-optimized embedding-based retrieval, and cross-encoder ranking models. The solution enables natural language job queries like "Find software engineer jobs that are mostly remote with above median pay" while maintaining low latency and high relevance at massive scale through techniques like model distillation, KV caching, and exhaustive GPU-based nearest neighbor search.

AI-Powered Similar Issues Detection for Project Management

Linear

Linear developed a Similar Issues matching feature to address the persistent challenge of duplicate issues and backlog management in large team workflows. The solution uses large language models to generate vector embeddings that capture the semantic meaning of issue descriptions, enabling accurate detection of related or duplicate issues across their project management platform. The feature integrates at multiple touchpoints—during issue creation, in the Triage inbox, and within support integrations like Intercom—allowing teams to identify duplicates before they enter the system. The implementation uses PostgreSQL with pgvector on Google Cloud Platform for vector storage and search, with partitioning strategies to handle tens of millions of issues at scale.

AI-Powered Skills Extraction and Mapping for the LinkedIn Skills Graph

Linkedin

LinkedIn deployed a sophisticated machine learning pipeline to extract and map skills from unstructured content across their platform (job postings, profiles, resumes, learning courses) to power their Skills Graph. The solution combines token-based and semantic skill tagging using BERT-based models, multitask learning frameworks for domain-specific scoring, and knowledge distillation to serve models at scale while meeting strict latency requirements (100ms for 200 profile edits/second). Product-driven feedback loops from recruiters and job seekers continuously improve model performance, resulting in measurable business impact including 0.46% increase in predicted confirmed hires for job recommendations and 0.76% increase in PPC revenue for job search.

AI-Powered Slack Conversation Summarization System

Salesforce

Salesforce AI Research developed AI Summarist, a conversational AI-powered tool to address information overload in Slack workspaces. The system uses state-of-the-art AI to automatically summarize conversations, channels, and threads, helping users manage their information consumption based on work preferences. The solution processes messages through Slack's API, disentangles conversations, and generates concise summaries while maintaining data privacy by not storing any summarized content.

AI-Powered Supply Chain Visibility and ETA Prediction System

Toyota / IBM

Toyota partnered with IBM and AWS to develop an AI-powered supply chain visibility platform that addresses the automotive industry's challenges with delivery prediction accuracy and customer transparency. The system uses machine learning models (XGBoost, AdaBoost, random forest) for time series forecasting and regression to predict estimated time of arrival (ETA) for vehicles throughout their journey from manufacturing to dealer delivery. The solution integrates real-time event streaming, feature engineering with Amazon SageMaker, and batch inference every four hours to provide near real-time predictions. Additionally, the team implemented an agentic AI chatbot using AWS Bedrock to enable natural language queries about vehicle status. The platform provides customers and dealers with visibility into vehicle journeys through a "pizza tracker" style interface, improving customer satisfaction and enabling proactive delay management.

AI-Powered Transformation of AWS Support for Mission-Critical Workloads

Whoop

AWS Support transformed from a reactive firefighting model to a proactive AI-augmented support system to handle the increasing complexity of cloud operations. The transformation involved building autonomous agents, context-aware systems, and structured workflows powered by Amazon Bedrock and Connect to provide faster incident response and proactive guidance. WHOOP, a health wearables company, utilized AWS's new Unified Operations offering to successfully launch two new hardware products with 10x mobile traffic and 200x e-commerce traffic scaling, achieving 100% availability in May 2025 and reducing critical case response times from 8 minutes to under 2.5 minutes, ultimately improving quarterly availability from 99.85% to 99.95%.

AI-Powered Travel Assistant for Rail and Coach Platform

Trainline

Trainline, the world's leading rail and coach ticketing platform serving 27 million customers across 40 countries, developed an AI-powered travel assistant to address underserved customer needs during the travel experience. The company identified that while they excelled at selling tickets, customers lacked support during their journeys when disruptions occurred or they had questions about their travel. They built an agentic AI system using LLMs that could answer diverse customer questions ranging from refund requests to real-time train information to unusual queries like bringing pets or motorbikes on trains. The solution went from concept to production in five months, launching in February 2025, and now handles over 300,000 conversations monthly. The system uses a central orchestrator with multiple tools including RAG with 700,000 pages of curated content, real-time train data APIs, terms and conditions lookups, and automated refund capabilities, all protected by multiple layers of guardrails to ensure safety and factual accuracy.

AI-Powered Video Analysis and Highlight Generation Platform

Accenture

Accenture developed Spotlight, a scalable video analysis and highlight generation platform using Amazon Nova foundation models and Amazon Bedrock Agents to automate the creation of video highlights across multiple industries. The solution addresses the traditional bottlenecks of manual video editing workflows by implementing a multi-agent system that can analyze long-form video content and generate personalized short clips in minutes rather than hours or days. The platform demonstrates 10x cost savings over conventional approaches while maintaining quality through human-in-the-loop validation and supporting diverse use cases from sports highlights to retail personalization.

AI-Powered Video Workflow Orchestration Platform for Broadcasting

Cires21

Cires21, a Spanish live streaming services company, developed MediaCoPilot to address the fragmented ecosystem of applications used by broadcasters, which resulted in slow content delivery, high costs, and duplicated work. The solution is a unified serverless platform on AWS that integrates custom AI models for video and audio processing (ASR, diarization, scene detection) with Amazon Bedrock for generating complex metadata like subtitles, highlights, and summaries. The platform uses AWS Step Functions for orchestration, exposes capabilities via API for integration into client workflows, and recently added AI agents powered by AWS Agent Core that can handle complex multi-step tasks like finding viral moments, creating social media clips, and auto-generating captions. The architecture delivers faster time-to-market, improved scalability, and automated content workflows for broadcast clients.

AI-Powered Voice Agents for Proactive Hotel Payment Verification

Perk

Perk, a business travel management platform, faced a critical problem where virtual credit cards sent to hotels sometimes weren't charged before guest arrival, leading to catastrophic check-in experiences for exhausted travelers. To prevent this, their customer care team was making approximately 10,000 proactive phone calls per week to hotels. The team built an AI voice agent system that autonomously calls hotels to verify and request payment processing. Starting with a rapid prototype using Make.com, they iterated through extensive prompt engineering, call structure refinement, and comprehensive evaluation frameworks. The solution now successfully handles tens of thousands of calls weekly across multiple languages (English, German), matching or exceeding human performance while dramatically reducing manual workload and uncovering additional operational insights through systematic call classification.

Automated Evaluation Framework for LLM-Powered Features

Slack

Slack's machine learning team developed a comprehensive evaluation framework for their LLM-powered features, including message summarization and natural language search. They implemented a three-tiered evaluation approach using golden sets, validation sets, and A/B testing, combined with automated quality metrics to assess various aspects like hallucination detection and system integration. This framework enabled rapid prototyping and continuous improvement of their generative AI products while maintaining quality standards.

Automated GPU Kernel Generation Using LLMs and Inference-Time Scaling

NVIDIA

NVIDIA engineers developed a novel approach to automatically generate optimized GPU attention kernels using the DeepSeek-R1 language model combined with inference-time scaling. They implemented a closed-loop system where the model generates code that is verified and refined through multiple iterations, achieving 100% accuracy for Level-1 problems and 96% for Level-2 problems in Stanford's KernelBench benchmark. This approach demonstrates how additional compute resources during inference can improve code generation capabilities of LLMs.

Automated LLM Evaluation and Quality Monitoring in Customer Support Analytics

Echo AI

Echo AI, leveraging Log10's platform, developed a system for analyzing customer support interactions at scale using LLMs. They faced the challenge of maintaining accuracy and trust while processing high volumes of customer conversations. The solution combined Echo AI's conversation analysis capabilities with Log10's automated feedback and evaluation system, resulting in a 20-point F1 score improvement in accuracy and the ability to automatically evaluate LLM outputs across various customer-specific use cases.

Automated Log Classification System for Device Security Infrastructure

Palo Alto Networks

Palo Alto Networks' Device Security team faced challenges with reactively processing over 200 million daily service and application log entries, resulting in delayed response times to critical production issues. In partnership with AWS Generative AI Innovation Center, they developed an automated log classification pipeline powered by Amazon Bedrock using Anthropic's Claude Haiku model and Amazon Titan Text Embeddings. The solution achieved 95% precision in detecting production issues while reducing incident response times by 83%, transforming reactive log monitoring into proactive issue detection through intelligent caching, context-aware classification, and dynamic few-shot learning.

Automated Performance Optimization with GenAI-Powered Code Analysis

Uber

Uber developed PerfInsights to address unsustainable compute costs from inefficient Go services, where traditionally manual performance optimization required deep expertise and days or weeks of effort. The system combines runtime CPU/memory profiling with GenAI-powered static analysis to automatically detect performance antipatterns in Go code, using LLM juries and rule-based validation (LLMCheck) to reduce hallucinations and false positives from over 80% to the low teens. Since deployment, PerfInsights has generated hundreds of merged optimization diffs, reduced antipattern detection time by 93% (from 14.5 hours to under 1 hour per issue), eliminated approximately 3,800 hours of manual engineering effort annually, and achieved a 33.5% reduction in codebase antipatterns over four months while delivering measurable compute cost savings.

Automated Product Attribute Extraction and Title Standardization Using Agentic AI

Delivery Hero

Delivery Hero Quick Commerce faced significant challenges managing vast product catalogs across multiple platforms and regions, where manual verification of product attributes was time-consuming, costly, and error-prone. They implemented an agentic AI system using Large Language Models to automatically extract 22 predefined product attributes from vendor-provided titles and images, then generate standardized product titles conforming to their format. Using a predefined agent architecture with two sequential LLM components, optimized through prompt engineering, Teacher/Student knowledge distillation for the title generation step, and confidence scoring for quality control, the system achieved significant improvements in efficiency, accuracy, data quality, and customer satisfaction while maintaining cost-effectiveness and predictability.

Automated Product Classification and Attribute Extraction Using Vision LLMs

Shopify

Shopify tackled the challenge of automatically understanding and categorizing millions of products across their platform by implementing a multi-step Vision LLM solution. The system extracts structured product information including categories and attributes from product images and descriptions, enabling better search, tax calculation, and recommendations. Through careful fine-tuning, evaluation, and cost optimization, they scaled the solution to handle tens of millions of predictions daily while maintaining high accuracy and managing hallucinations.

Automated Prompt Optimization for Intelligent Text Processing using Amazon Bedrock

Yuewen Group

Yuewen Group, a global online literature platform, transitioned from traditional NLP models to Claude 3.5 Sonnet on Amazon Bedrock for intelligent text processing. Initially facing challenges with unoptimized prompts performing worse than traditional models, they implemented Amazon Bedrock's Prompt Optimization feature to automatically enhance their prompts. This led to significant improvements in accuracy for tasks like character dialogue attribution, achieving 90% accuracy compared to the previous 70% with unoptimized prompts and 80% with traditional NLP models.

Automated Sign Language Translation Using Large Language Models

VSL Labs

VSL Labs is developing an automated system for translating English into American Sign Language (ASL) using generative AI models. The solution addresses the significant challenges faced by the deaf community, including limited availability and high costs of human interpreters. Their platform uses a combination of in-house and GPT-4 models to handle text processing, cultural adaptation, and generates precise signing instructions including facial expressions and body movements for realistic avatar-based sign language interpretation.

Automated Software Development Insights and Communication Platform

Blueprint AI

Blueprint AI addresses the challenge of communication and understanding between business and technical teams in software development by leveraging LLMs. The platform automatically analyzes data from various sources like GitHub and Jira, creating intelligent reports that surface relevant insights, track progress, and identify potential blockers. The system provides 24/7 monitoring and context-aware updates, helping teams stay informed about development progress without manual reporting overhead.

Automating Merchant Onboarding with Reinforcement Learning

Doordash

DoorDash faced challenges with menu accuracy during merchant onboarding, where their existing AI system struggled with diverse and messy real-world menu formats. Working with Applied Compute, they developed an automated grading system calibrated to internal expert standards, then used reinforcement learning to train a menu error correction model against this grader as a reward function. The solution achieved a 30% relative reduction in low-quality menus and was rolled out to all USA menu traffic, demonstrating how institutional knowledge can be encoded into automated training signals for production LLM systems.

Automating Supplier Ticket Management with LLM Agents

Wayfair

Wayfair developed Wilma, an LLM-based ticket automation system, to automate the manual triage of supplier support tickets in their SupportHub JIRA-based system. The solution uses LangGraph to orchestrate LLM calls and tool interactions for intent classification, language detection, and supplier ID lookup through a ReAct agent with BigQuery access. The system achieved better-than-human performance with 93% accuracy on question type identification (vs. 75% human accuracy), 98% on language detection, and 88% on supplier ID identification, while reducing processing time and allowing associates to focus on higher-value work.

Automating Translation Workflows with LLMs for Post-Editing and Transcreation

TransPerfect

TransPerfect integrated Amazon Bedrock into their GlobalLink translation management system to automate and improve translation workflows. The solution addressed two key challenges: automating post-editing of machine translations and enabling AI-assisted transcreation of creative content. By implementing LLM-powered workflows, they achieved up to 50% cost savings in translation post-editing, 60% productivity gains in transcreation, and up to 80% reduction in project turnaround times while maintaining high quality standards.

Automating Video Ad Classification with GenAI

MediaRadar | Vivvix

MediaRadar | Vivvix faced challenges with manual video ad classification and fragmented workflows that couldn't keep up with growing ad volumes. They implemented a solution using Databricks Mosaic AI and Apache Spark Structured Streaming to automate ad classification, combining GenAI models with their own classification systems. This transformation enabled them to process 2,000 ads per hour (up from 800), reduced experimentation time from 2 days to 4 hours, and significantly improved the accuracy of insights delivered to customers.

Automating Weather Forecast Text Generation Using Fine-Tuned Vision-Language Models

UK MetOffice

The UK Met Office partnered with AWS to automate the generation of the Shipping Forecast, a 100-year-old maritime weather forecast that traditionally required expert meteorologists several hours daily to produce. The solution involved fine-tuning Amazon Nova foundation models (both LLM and vision-language model variants) to convert complex multi-dimensional weather data into structured text forecasts. Within four weeks of prototyping, they achieved 52-62% accuracy using vision-language models and 62% accuracy using text-based LLMs, reducing forecast generation time from hours to under 5 minutes. The project demonstrated scalable architectural patterns for data-to-text conversion tasks involving massive datasets (45GB+ per forecast run) and established frameworks for rapid experimentation with foundation models in production weather services.

Autonomous Coding Agent Evolution: From Short-Burst to Extended Runtime Operations

Replit

Replit evolved their AI coding agent from V1 (running autonomously for only a couple of minutes) to V2 (running for 10-15 minutes of productive work) through significant rearchitecting and leveraging new frontier models. The company focuses on enabling non-technical users to build complete applications without writing code, emphasizing performance and cost optimization over latency while maintaining comprehensive observability through tools like Langsmith to manage the complexity of production AI agents at scale.

Autonomous Network Operations Using Agentic AI

British Telecom

British Telecom (BT) partnered with AWS to deploy agentic AI systems for autonomous network operations across their 5G standalone mobile network infrastructure serving 30 million subscribers. The initiative addresses major operational challenges including high manual operations costs (up to 20% of revenue), complex failure diagnosis in containerized networks with 20,000 macro sites generating petabytes of data, and difficulties in change impact analysis with 11,000 weekly network changes. The solution leverages AWS Bedrock Agent Core, Amazon SageMaker for multivariate anomaly detection, Amazon Neptune for network topology graphs, and domain-specific community agents for root cause analysis and service impact assessment. Early results focus on cost reduction through automation, improved service level agreements, faster customer impact identification, and enhanced change efficiency, with plans to expand coverage optimization, dynamic network slicing, and further closed-loop automation across all network domains.

Autonomous Software Development Using Multi-Model LLM System with Advanced Planning and Tool Integration

Factory.ai

Factory.ai has developed Code Droid, an autonomous software development system that leverages multiple LLMs and sophisticated planning capabilities to automate various programming tasks. The system incorporates advanced features like HyperCode for codebase understanding, ByteRank for information retrieval, and multi-model sampling for solution generation. In benchmark testing, Code Droid achieved 19.27% on SWE-bench Full and 31.67% on SWE-bench Lite, demonstrating strong performance in real-world software engineering tasks while maintaining focus on safety and explainability.

Autonomous SRE Agent for Cloud Infrastructure Monitoring Using FastMCP

FuzzyLabs

FuzzyLabs developed an autonomous Site Reliability Engineering (SRE) agent using Anthropic's Model Context Protocol (MCP) with FastMCP to automate the diagnosis of production incidents in cloud-native applications. The agent integrates with Kubernetes, GitHub, and Slack to automatically detect issues, analyze logs, identify root causes in source code, and post diagnostic summaries to development teams. While the proof-of-concept successfully demonstrated end-to-end incident response automation using a custom MCP client with optimizations like tool caching and filtering, the project raises important questions about effectiveness measurement, security boundaries, and cost optimization that require further research.

AWS Trainium & Metaflow: Democratizing Large-Scale ML Training Through Infrastructure Evolution

Outerbounds / AWS

The key lesson from this meetup is that we're seeing a fundamental shift in how organizations can approach large-scale ML training and deployment. Through the combination of purpose-built hardware (AWS Trainium/Inferentia) and modern MLOps frameworks (Metaflow), teams can now achieve enterprise-grade ML infrastructure without requiring deep expertise in distributed systems. The traditional approach of having ML experts manually manage infrastructure is being replaced by more automated, standardized workflows that integrate with existing software delivery practices. This democratization is enabled by significant cost reductions (up to 50-80% compared to traditional GPU deployments), simplified deployment patterns through tools like Optimum Neuron, and the ability to scale from small experiments to massive distributed training with minimal code changes. Perhaps most importantly, the barrier to entry for sophisticated ML infrastructure has been lowered to the point where even small teams can leverage these tools effectively.

BERT-Based Sequence Models for Contextual Product Recommendations

Instacart

Instacart built a centralized contextual retrieval system powered by BERT-like transformer models to provide real-time product recommendations across multiple shopping surfaces including search, cart, and item detail pages. The system replaced disparate legacy retrieval systems that relied on ad-hoc combinations of co-occurrence, similarity, and popularity signals with a unified approach that predicts next-product probabilities based on in-session user interaction sequences. The solution achieved a 30% lift in user cart additions for cart recommendations, 10-40% improvement in Recall@K metrics over randomized sequence baselines, and enabled deprecation of multiple legacy ad-hoc retrieval systems while serving both ads and organic recommendation surfaces.

Best Practices for Building Production-Grade MCP Servers for AI Agents

Prefect

This case study presents best practices for designing and implementing Model Context Protocol (MCP) servers for AI agents in production environments, addressing the widespread problem of poorly designed MCP servers that fail to account for agent-specific constraints. The speaker, founder and CEO of Prefect Technologies and creator of fastmcp (a widely-adopted framework downloaded 1.5 million times daily), identifies key design principles including outcome-oriented tool design, flattened arguments, comprehensive documentation, token budget management, and ruthless curation. The solution involves treating MCP servers as agent-optimized user interfaces rather than simple REST API wrappers, acknowledging fundamental differences between human and agent capabilities in discovery, iteration, and context management. Results include actionable guidelines that have shaped the MCP ecosystem, with the fastmcp framework becoming the de facto standard for building MCP servers and influencing the official Anthropic SDK design.

Best Practices for LLM Production Deployments: Evaluation, Prompt Management, and Fine-tuning

HumanLoop

HumanLoop, based on their experience working with companies from startups to large enterprises like Jingo, shares key lessons for successful LLM deployment in production. The talk emphasizes three critical aspects: systematic evaluation frameworks for LLM applications, treating prompts as serious code artifacts requiring proper versioning and collaboration, and leveraging fine-tuning for improved performance and cost efficiency. The presentation uses GitHub Copilot as a case study of successful LLM deployment at scale.

BM25 vs Vector Search for Large-Scale Code Repository Search

Github

Github faces the challenge of providing efficient search across 100+ billion documents while maintaining low latency and supporting diverse search use cases. They chose BM25 over vector search due to its computational efficiency, zero-shot capabilities, and ability to handle diverse query types. The solution involves careful optimization of search infrastructure, including strategic data routing and field-specific indexing approaches, resulting in a system that effectively serves Github's massive scale while keeping costs manageable.

Build vs. Buy AI Agents: Enterprise Deployment Lessons from 1,000+ Companies

Dust

Dust, an AI agent platform company, shares insights from deploying AI agents across over 1,000 enterprise customers to address the common build-versus-buy dilemma. The case study explores the hidden costs of building custom AI infrastructure—including longer time-to-value (6-12 months underestimation), ongoing maintenance burden, and opportunity costs that divert engineering resources from core business objectives. Multiple customer examples demonstrate that buying a platform enabled rapid deployment (20 minutes to functional agents at November Five, 70% adoption in two months at Wakam, 95% adoption in 90 days at Ardabelle) with enterprise-grade security, continuous improvements, and significant productivity gains. The study advocates that most companies should buy AI infrastructure and focus engineering talent on competitive differentiation, though building may make sense for truly unique requirements or when AI infrastructure is the core product itself.

Building a Bot Factory: Standardizing AI Agent Development with Multi-Agent Architecture

AutoScout24

AutoScout24, Europe's leading automotive marketplace, addressed the challenge of fragmented AI experimentation across their organization by building a "Bot Factory" - a standardized framework for creating and deploying AI agents. The initial use case targeted internal developer support, where platform engineers were spending 30% of their time on repetitive tasks like answering questions and granting access. By partnering with AWS, they developed a serverless, event-driven architecture using Amazon Bedrock AgentCore, Knowledge Bases, and the Strands Agents SDK to create a multi-agent system that handles both knowledge retrieval (RAG) and action execution. The solution produced a production-ready Slack support bot and a reusable blueprint that enables teams across the organization to rapidly build secure, scalable AI agents without reinventing infrastructure.

Building a Complex AI Answer Engine with Multi-Step Reasoning

Perplexity

Perplexity developed Pro Search, an advanced AI answer engine that handles complex, multi-step queries by breaking them down into manageable steps. The system combines careful prompt engineering, step-by-step planning and execution, and an interactive UI to deliver precise answers. The solution resulted in a 50% increase in query search volume, demonstrating its effectiveness in handling complex research questions efficiently.

Building a Comprehensive AI Platform with SageMaker and Bedrock for Experience Management

Qualtrics

Qualtrics built Socrates, an enterprise-level ML platform, to power their experience management solutions. The platform leverages Amazon SageMaker and Bedrock to enable the full ML lifecycle, from data exploration to model deployment and monitoring. It includes features like the Science Workbench, AI Playground, unified GenAI Gateway, and managed inference APIs, allowing teams to efficiently develop, deploy, and manage AI solutions while achieving significant cost savings and performance improvements through optimized inference capabilities.

Building a Comprehensive LLM Platform for Food Delivery Services

Swiggy

Swiggy implemented various generative AI solutions to enhance their food delivery platform, focusing on catalog enrichment, review summarization, and vendor support. They developed a platformized approach with a middle layer for GenAI capabilities, addressing challenges like hallucination and latency through careful model selection, fine-tuning, and RAG implementations. The initiative showed promising results in improving customer experience and operational efficiency across multiple use cases including image generation, text descriptions, and restaurant partner support.

Building a Context-Aware AI Assistant with RAG for Developer Support

Vectorize

Vectorize, a platform for building RAG pipelines, faced a challenge where users frequently asked questions already answered in their documentation but were reluctant to leave the UI to search for answers. To address this, they built an AI assistant integrated directly into their product interface using RAG technology. The solution leverages their own platform to ingest documentation from multiple sources (docs site, Discord, Intercom), implements context-sensitive retrieval using page topics, employs reranking models to filter irrelevant results, and uses anti-hallucination prompting with Llama 3.1 70B on Groq. The resulting assistant provides users with immediate, contextually relevant answers without requiring them to leave their workflow, while the system continuously improves as new support content and documentation are added.

Building a Conversational AI Agent for Slack Integration

Linear

Linear, a project management tool for product teams, developed an experimental AI agent that operates within Slack to allow users to create issues and query workspace data without leaving their communication platform. The project faced challenges around balancing context provision to the LLM, maintaining conversation continuity, and determining appropriate boundaries between LLM-driven decisions and programmatic logic. The team solved these issues by providing localized context (10 messages) rather than full conversation history, splitting the system early to distinguish between issue creation and data lookup requests, and limiting LLM involvement to tasks it excels at (summarization, title generation) while handling complex business logic programmatically. This approach resulted in higher accuracy for issue creation, faster response times, and improved user satisfaction as the agent could quickly generate well-formed issues that users could then refine manually.

Building a Conversational Shopping Assistant with Multi-Modal Search and Agent Architecture

OLX

OLX developed "OLX Magic", a conversational AI shopping assistant for their secondhand marketplace. The system combines traditional search with LLM-powered agents to handle natural language queries, multi-modal searches (text, image, voice), and comparative product analysis. The solution addresses challenges in e-commerce personalization and search refinement, while balancing user experience with technical constraints like latency and cost. Key innovations include hybrid search combining keyword and semantic matching, visual search with modifier capabilities, and an agent architecture that can handle both broad and specific queries.

Building a Custom LLM for Automated Documentation Generation

Databricks

Databricks developed an AI-generated documentation feature for automatically documenting tables and columns in Unity Catalog. After initially using SaaS LLMs that faced challenges with quality, performance, and cost, they built a custom fine-tuned 7B parameter model in just one month with two engineers and less than $1,000 in compute costs. The bespoke model achieved better quality than cheaper SaaS alternatives, 10x cost reduction, and higher throughput, now powering 80% of table metadata updates on their platform.

Building a Custom Vision LLM for Document Processing at Scale

Grab

Grab developed a custom lightweight vision LLM to address the challenges of extracting information from diverse user-submitted documents like ID cards and driver's licenses across Southeast Asia. Traditional OCR systems struggled with the variety of document templates and languages, while proprietary LLMs had high latency and poor SEA language support. The team fine-tuned and ultimately built a custom ~1B parameter vision LLM from scratch, achieving performance comparable to larger 2B models while significantly reducing latency. The solution involved a four-stage training process using synthetic OCR datasets, an auto-labeling framework called Documint, and full-parameter fine-tuning, resulting in dramatic accuracy improvements (+70pp for Thai, +40pp for Vietnamese) and establishing a unified model to replace traditional OCR pipelines.

Building a Generic Recommender System API with Privacy-First Design

Slack

Slack developed a generic recommendation API to serve multiple internal use cases for recommending channels and users. They started with a simple API interface hiding complexity, used hand-tuned models for cold starts, and implemented strict privacy controls to protect customer data. The system achieved over 10% improvement when switching from hand-tuned to ML models while maintaining data privacy and gaining internal customer trust through rapid iteration cycles.

Building a Global Product Catalogue with Multimodal LLMs at Scale

Shopify

Shopify addressed the challenge of fragmented product data across millions of merchants by building a Global Catalogue using multimodal LLMs to standardize and enrich billions of product listings. The system processes over 10 million product updates daily through a four-layer architecture involving product data foundation, understanding, matching, and reconciliation. By fine-tuning open-source vision language models and implementing selective field extraction, they achieve 40 million LLM inferences daily with 500ms median latency while reducing GPU usage by 40%. The solution enables improved search, recommendations, and conversational commerce experiences across Shopify's ecosystem.

Building a Healthcare Copilot for Biology and Life Science Research

Owkin

Owkin, a company focused on drug discovery and AI for healthcare, developed a copilot system in four months to help biology and life science researchers navigate complex healthcare data and answer scientific questions. The system addresses challenges unique to healthcare including strict regulations, semantic complexity, and data sensitivity by implementing two main tools: a text-to-SQL system that queries structured biological databases (using natural language to SQL translation with Polars), and a RAG-based literature search tool that retrieves relevant information from PubMed's 26 million abstracts. The copilot was deployed for academic researchers with monitoring via LangFuse and OpenTelemetry, though the team faced challenges with evaluation in a domain where questions rarely have binary answers, and noted that frameworks and models change rapidly in the LLM space.

Building a High-Quality RAG-based Support System with LLM Guardrails and Quality Monitoring

Doordash

Doordash implemented a RAG-based chatbot system to improve their Dasher support automation, replacing a traditional flow-based system. They developed a comprehensive quality control approach combining LLM Guardrail for real-time response verification, LLM Judge for quality monitoring, and an iterative improvement pipeline. The system successfully reduced hallucinations by 90% and severe compliance issues by 99%, while handling thousands of support requests daily and allowing human agents to focus on more complex cases.

Building a Hybrid Cloud AI Infrastructure for Large-Scale ML Inference

Roblox

Roblox underwent a three-phase transformation of their AI infrastructure to support rapidly growing ML inference needs across 250+ production models. They built a comprehensive ML platform using Kubeflow, implemented a custom feature store, and developed an ML gateway with vLLM for efficient large language model operations. The system now processes 1.5 billion tokens weekly for their AI Assistant, handles 1 billion daily personalization requests, and manages tens of thousands of CPUs and over a thousand GPUs across hybrid cloud infrastructure.

Building a Hyper-Personalized Food Ordering Agent for E-commerce at Scale

iFood

iFood, Brazil's largest food delivery platform with 160 million monthly orders and 55 million users, built ISO, an AI agent designed to address the paradox of choice users face when ordering food. The agent uses hyper-personalization based on user behavior, interprets complex natural language intents, and autonomously takes actions like applying coupons, managing carts, and processing payments. Deployed on both the iFood app and WhatsApp, ISO handles millions of users while maintaining sub-10 second P95 latency through aggressive prompt optimization, context window management, and intelligent tool routing. The team achieved this by moving from a 30-second to a 10-second P95 latency through techniques including asynchronous processing, English-only prompts to avoid tokenization penalties, and deflating bloated system prompts by improving tool naming conventions.

Building a Low-Latency Global Code Completion Service

Github

Github built Copilot, a global code completion service handling hundreds of millions of daily requests with sub-200ms latency. The system uses a proxy architecture to manage authentication, handle request cancellation, and route traffic to the nearest available LLM model. Key innovations include using HTTP/2 for efficient connection management, implementing a novel request cancellation system, and deploying models across multiple global regions for improved latency and reliability.

Building a Multi-Agent Healthcare Analytics Assistant with LLM-Powered Natural Language Queries

Komodo Health

Komodo Health, a company with a large database of anonymized American patient medical events, developed an AI assistant over two years to answer complex healthcare analytics queries through natural language. The system evolved from a simple chaining architecture with fine-tuned models to a sophisticated multi-agent system using a supervisor pattern, where an intelligent agent-based supervisor routes queries to either deterministic workflows or sub-agents as needed. The architecture prioritizes trust by ensuring raw database outputs are presented directly to users rather than LLM-generated content, with LLMs primarily handling natural language to structured query conversion and explanations. The production system balances autonomous AI capabilities with control, avoiding the cost and latency issues of pure agentic approaches while maintaining flexibility for unexpected user queries.

Building a Multi-Agent LLM Platform for Customer Service Automation

Deutsche Telekom

Deutsche Telekom developed a comprehensive multi-agent LLM platform to automate customer service across multiple European countries and channels. They built their own agent computing platform called LMOS to manage agent lifecycles, routing, and deployment, moving away from traditional chatbot approaches. The platform successfully handled over 1 million customer queries with an 89% acceptable answer rate and showed 38% better performance compared to vendor solutions in A/B testing.

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

Building a Multi-Model LLM API Marketplace and Infrastructure Platform

OpenRouter

OpenRouter was founded in early 2023 to address the fragmented landscape of large language models by creating a unified API marketplace that aggregates over 400 models from 60+ providers. The company identified that the LLM inference market would not be winner-take-all, and built infrastructure to normalize different model APIs, provide intelligent routing, caching, and uptime guarantees. Their platform enables developers to switch between models with near-zero switching costs while providing better prices, uptime, and choice compared to using individual model providers directly.

Building a Multi-Model LLM Marketplace and Routing Platform

OpenRouter

OpenRouter was founded in 2023 to address the challenge of choosing between rapidly proliferating language models by creating a unified API marketplace that aggregates over 400 models from 60+ providers. The platform solves the problem of model selection, provider heterogeneity, and high switching costs by providing normalized access, intelligent routing, caching, and real-time performance monitoring. Results include 10-100% month-over-month growth, sub-30ms latency, improved uptime through provider aggregation, and evidence that the AI inference market is becoming multi-model rather than winner-take-all.

Building a Multi-Provider GenAI Gateway for Enterprise-Scale LLM Access

Grab

Grab developed an AI Gateway to provide centralized, secure access to multiple GenAI providers (including OpenAI, Azure, AWS Bedrock, and Google VertexAI) for their internal developers. The gateway handles authentication, cost management, auditing, and rate limiting while providing a unified API interface. Since its launch in 2023, it has enabled over 300 unique use cases across the organization, from real-time audio analysis to content moderation, while maintaining security and cost efficiency through centralized management.

Building a Next-Generation AI-Enhanced Code Editor with Real-Time Inference

Cursor

Cursor built a modern AI-enhanced code editor by forking VS Code and incorporating advanced LLM capabilities. Their approach focused on creating a more responsive and predictive coding environment that goes beyond simple autocompletion, using techniques like mixture of experts (MoE) models, speculative decoding, and sophisticated caching strategies. The editor aims to eliminate low-entropy coding actions and predict developers' next actions, while maintaining high performance and low latency.

Building a Next-Generation AI-Powered Code Editor

Cursor

Cursor, founded by MIT graduates, developed an AI-powered code editor that goes beyond simple code completion to reimagine how developers interact with AI while coding. By focusing on innovative features like instructed edits and codebase indexing, along with developing custom models for specific tasks, they achieved rapid growth to $100M in revenue. Their success demonstrates how combining frontier LLMs with custom-trained models and careful UX design can transform developer productivity.

Building a Production AI Translation and Lip-Sync System at Scale

Meta

Meta developed an AI-powered system for automatically translating and lip-syncing video content across multiple languages. The system combines Meta's Seamless universal translator model with custom lip-syncing technology to create natural-looking translated videos while preserving the original speaker's voice characteristics and emotions. The solution includes comprehensive safety measures, complex model orchestration, and handles challenges like background noise and timing alignment. Early alpha testing shows 90% eligibility rates for submitted content and meaningful increases in content impressions due to expanded language accessibility.

Building a Production Coding Agent Model with Speed and Intelligence

Cursor

Cursor developed Composer, a specialized coding agent model designed to balance speed and intelligence for real-world software engineering tasks. The challenge was creating a model that could perform at near-frontier levels while being four times more efficient at token generation than comparable models, moving away from the "airplane Wi-Fi" problem where agents were either too slow for synchronous work or required long async waits. The solution involved extensive reinforcement learning (RL) training in an environment that closely mimicked production, using custom kernels for low-precision training, parallel tool calling capabilities, semantic search with custom embeddings, and a fleet of cloud VMs to simulate the real Cursor IDE environment. The result was a model that performs close to frontier models like GPT-4.5 and Claude Sonnet 3.5 on coding benchmarks while maintaining significantly faster token generation, enabling developers to stay in flow state rather than context-switching during long agent runs.

Building a Production Fantasy Football AI Assistant in 8 Weeks

NFL

The NFL, in collaboration with AWS Generative AI Innovation Center, developed a fantasy football AI assistant for NFL Plus users that went from concept to production in just 8 weeks. Fantasy football managers face overwhelming amounts of data and conflicting expert advice, making roster decisions stressful and time-consuming. The team built an agentic AI system using Amazon Bedrock, Strands Agent framework, and Model Context Protocol (MCP) to provide analyst-grade fantasy advice in under 5 seconds, achieving 90% analyst approval ratings. The system handles complex multi-step reasoning, accesses NFL NextGen Stats data through semantic data layers, and successfully manages peak Sunday traffic loads with zero reported incidents in the first month of 10,000+ questions.

Building a Production MCP Server for AI Assistant Integration

Hugging Face

Hugging Face developed an official Model Context Protocol (MCP) server to enable AI assistants to access their AI model hub and thousands of AI applications through a simple URL. The team faced complex architectural decisions around transport protocols, choosing Streamable HTTP over deprecated SSE transport, and implementing a stateless, direct response configuration for production deployment. The server provides customizable tools for different user types and integrates seamlessly with existing Hugging Face infrastructure including authentication and resource quotas.

Building a Production Voice AI Agent for Customer Support in 100 Days

Intercom

Intercom developed Finn Voice, a voice AI agent for phone-based customer support, in approximately 100 days. The solution builds on their existing text-based AI agent Finn, which already served over 5,000 customers with a 56% average resolution rate. Finn Voice handles phone calls, answers customer questions using knowledge base content, and escalates to human agents when needed. The system uses a speech-to-text, language model, text-to-speech architecture with RAG capabilities and achieved deployment across several enterprise customers' main phone lines, offering significant cost savings compared to human-only support.

Building a Production-Grade GenAI Customer Support Assistant with Comprehensive Observability

Elastic

Elastic developed a customer support chatbot using generative AI and RAG, focusing heavily on production-grade observability practices. They implemented a comprehensive observability strategy using Elastic's own stack, including APM traces, custom dashboards, alerting systems, and detailed monitoring of LLM interactions. The system successfully launched with features like streaming responses, rate limiting, and abuse prevention, while maintaining high reliability through careful monitoring of latency, errors, and usage patterns.

Building a Production-Grade LLM Orchestration System for Conversational Search

Perplexity

Perplexity has built a conversational search engine that combines LLMs with various tools and knowledge sources. They tackled key challenges in LLM orchestration including latency optimization, hallucination prevention, and reliable tool integration. Through careful engineering and prompt management, they reduced query latency from 6-7 seconds to near-instant responses while maintaining high quality results. The system uses multiple specialized LLMs working together with search indices, tools like Wolfram Alpha, and custom embeddings to deliver personalized, accurate responses at scale.

Building a Production-Ready AI Phone Call Assistant with Multi-Modal Processing

RealChar

RealChar is developing an AI assistant that can handle customer service phone calls on behalf of users, addressing the frustration of long wait times and tedious interactions. The system uses a complex architecture combining traditional ML and generative AI, running multiple models in parallel through an event bus system, with fallback mechanisms for reliability. The solution draws inspiration from self-driving car systems, implementing real-time processing of multiple input streams and maintaining millisecond-level observability.

Building a Resilient Embedding System for Semantic Search

Airtable

Airtable built a production-scale embedding system to enable semantic search across customer data, allowing teams to ask questions like "find past campaigns similar to this one" or "find engineers whose expertise matches this project." The system manages the complete lifecycle of embeddings including generation, storage, consistency tracking, and migrations while handling the challenge of maintaining eventual consistency between their primary in-memory database (MemApp) and a separate vector database. Their approach centers on a flexible "embedding config" abstraction and a reset-based strategy for handling migrations and failures, trading off temporary downtime and regeneration costs for operational simplicity and resilience across diverse scenarios like database migrations, model changes, and data residency requirements.

Building a Rust-Based AI Agentic Framework for Multimodal Data Quality Monitoring

Zectonal

Zectonal, a data quality monitoring company, developed a custom AI agentic framework in Rust to scale their multimodal data inspection capabilities beyond traditional rules-based approaches. The framework enables specialized AI agents to autonomously call diagnostic function tools for detecting defects, errors, and anomalous conditions in large datasets, while providing full audit trails through "Agent Provenance" tracking. The system supports multiple LLM providers (OpenAI, Anthropic, Ollama) and can operate both online and on-premise, packaged as a single binary executable that the company refers to as their "genie-in-a-binary."

Building a Scalable Chatbot Platform with Edge Computing and Multi-Layer Security

Fastmind

Fastmind developed a chatbot builder platform that focuses on scalability, security, and performance. The solution combines edge computing via Cloudflare Workers, multi-layer rate limiting, and a distributed architecture using Next.js, Hono, and Convex. The platform uses Cohere's AI models and implements various security measures to prevent abuse while maintaining cost efficiency for thousands of users.

Building a Scalable Conversational Video Agent with LangGraph and Twelve Labs APIs

Jockey

Jockey is an open-source conversational video agent that leverages LangGraph and Twelve Labs' video understanding APIs to process and analyze video content intelligently. The system evolved from v1.0 to v1.1, transitioning from basic LangChain to a more sophisticated LangGraph architecture, enabling better scalability and precise control over video workflows through a multi-agent system consisting of a Supervisor, Planner, and specialized Workers.

Building a Scalable LLM Gateway for E-commerce Recommendations

Mercado Libre

Mercado Libre developed a centralized LLM gateway to handle large-scale generative AI deployments across their organization. The gateway manages multiple LLM providers, handles security, monitoring, and billing, while supporting 50,000+ employees. A key implementation was a product recommendation system that uses LLMs to generate personalized recommendations based on user interactions, supporting multiple languages across Latin America.

Building a Scalable ML Platform with Metaflow for Distributed LLM Training

Autodesk

Autodesk built a machine learning platform from scratch using Metaflow as the foundation for their managed training infrastructure. The platform enables data scientists to construct end-to-end ML pipelines, with particular focus on distributed training of large language models. They successfully integrated AWS services, implemented security measures, and created a user-friendly interface that supported both experimental and production workflows. The platform has been rolled out to 50 users and demonstrated successful fine-tuning of large language models, including a 6B parameter model in 50 minutes using 16 A10 GPUs.

Building a Scalable Retriever-Ranker Architecture: Malt's Journey with Vector Databases and LLM-Powered Freelancer Matching

Malt

Malt's implementation of a retriever-ranker architecture for their freelancer recommendation system, leveraging a vector database (Qdrant) to improve matching speed and scalability. The case study highlights the importance of carefully selecting and integrating vector databases in LLM-powered systems, emphasizing performance benchmarking, filtering capabilities, and deployment considerations to achieve significant improvements in response times and recommendation quality.

Building a Search Engine for AI Agents: Infrastructure, Product Development, and Production Deployment

Exa.ai

Exa.ai has built the first search engine specifically designed for AI agents rather than human users, addressing the fundamental problem that existing search engines like Google are optimized for consumer clicks and keyword-based queries rather than semantic understanding and agent workflows. The company trained its own models, built its own index, and invested heavily in compute infrastructure (including purchasing their own GPU cluster) to enable meaning-based search that returns raw, primary data sources rather than listicles or summaries. Their solution includes both an API for developers building AI applications and an agentic search tool called Websites that can find and enrich complex, multi-criteria queries. The results include serving hundreds of millions of queries across use cases like sales intelligence, recruiting, market research, and research paper discovery, with 95% inbound growth and expanding from 7 to 28+ employees within a year.

Building a Silicon Brain for Universal Enterprise Search

Dropbox

Dropbox is transforming from a file storage company to an AI-powered universal search and organization platform. Through their Dash product, they are implementing LLM-powered search and organization capabilities across enterprise content, while maintaining strict data privacy and security. The engineering approach combines open-source LLMs, custom inference stacks, and hybrid architectures to deliver AI features to 700M+ users cost-effectively.

Building a Structured AI Evaluation Framework for Educational Tools

Coursera

Coursera developed a robust AI evaluation framework to support the deployment of their Coursera Coach chatbot and AI-assisted grading tools. They transitioned from fragmented offline evaluations to a structured four-step approach involving clear evaluation criteria, curated datasets, combined heuristic and model-based scoring, and rapid iteration cycles. This framework resulted in faster development cycles, increased confidence in AI deployments, and measurable improvements in student engagement and course completion rates.

Building a Tool Calling Platform for LLM Agents

Arcade AI

Arcade AI developed a comprehensive tool calling platform to address key challenges in LLM agent deployments. The platform provides a dedicated runtime for tools separate from orchestration, handles authentication and authorization for agent actions, and enables scalable tool management. It includes three main components: a Tool SDK for easy tool development, an engine for serving APIs, and an actor system for tool execution, making it easier to deploy and manage LLM-powered tools in production.

Building a Universal Search Product with RAG and AI Agents

Dropbox

Dropbox developed Dash, a universal search and knowledge management product that addresses the challenges of fragmented business data across multiple applications and formats. The solution combines retrieval-augmented generation (RAG) and AI agents to provide powerful search capabilities, content summarization, and question-answering features. They implemented a custom Python interpreter for AI agents and developed a sophisticated RAG system that balances latency, quality, and data freshness requirements for enterprise use.

Building a Voice Assistant from Open Source LLMs: A Home Project Case Study

Weights & Biases

A developer built a custom voice assistant similar to Alexa using open-source LLMs, demonstrating the journey from prototype to production-ready system. The project used Whisper for speech recognition and various LLM models (Llama 2, Mistral) running on consumer hardware, with systematic improvements through prompt engineering and fine-tuning to achieve 98% accuracy in command interpretation, showing how iterative improvement and proper evaluation frameworks are crucial for LLM applications.

Building a Voice Assistant with Open Source LLMs: From Demo to Production

Weights & Biases

A case study of building an open-source Alexa alternative using LLMs, demonstrating the journey from prototype to production. The project used Llama 2 and Mistral models running on affordable hardware, combined with Whisper for speech recognition. Through iterative improvements including prompt engineering and fine-tuning with QLoRA, the system's accuracy improved from 0% to 98%, while maintaining real-time performance requirements.

Building Agent-Native Infrastructure for Autonomous AI Development

Daytona

Daytona addresses the challenge of building infrastructure specifically designed for AI agents rather than humans, recognizing that agents will soon be the primary users of development tools. The company created an "agent-native runtime" - secure, elastic sandboxes that spin up in 27 milliseconds, providing agents with computing environments to run code, perform data analysis, and execute tasks autonomously. Their solution includes declarative image builders, shared volume systems, and parallel execution capabilities, all accessible via APIs to enable agents to operate without human intervention in the loop.

Building Agentic AI Assistant for Observability Platform

Grafana

Grafana Labs developed an agentic AI assistant integrated into their observability platform to help users query data, create dashboards, troubleshoot issues, and learn the platform. The team started with a hackathon project that ran entirely in the browser, iterating rapidly from a proof-of-concept to a production system. The assistant uses Claude as the primary LLM, implements tool calling with extensive context about Grafana's features, and employs multiple techniques including tool overloading, error feedback loops, and natural language tool responses. The solution enables users to investigate incidents, generate queries across multiple data sources, and modify visualizations through conversational interfaces while maintaining transparency by showing all intermediate steps and data to keep humans in the loop.

Building AI Memory Layers with File-Based Vector Storage and Knowledge Graphs

Cognee

Cognee, a platform that helps AI agents retrieve, reason, and remember with structured context, needed a vector storage solution that could support per-workspace isolation for parallel development and testing without the operational overhead of managing multiple database services. The company implemented LanceDB, a file-based vector database, which enables each developer, user, or test instance to have its own fully independent vector store. This solution, combined with Cognee's Extract-Cognify-Load pipeline that builds knowledge graphs alongside embeddings, allows teams to develop locally with complete isolation and then seamlessly transition to production through Cognee's hosted service (cogwit). The results include faster development cycles due to eliminated shared state conflicts, improved multi-hop reasoning accuracy through graph-aware retrieval, and a simplified path from prototype to production without architectural redesign.

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

Building Alfred: Production-Ready Agentic Orchestration Layer for E-commerce

Loblaws

Loblaws Digital, the technology arm of one of Canada's largest retail companies, developed Alfred—a production-ready orchestration layer for running agentic AI workflows across their e-commerce, pharmacy, and loyalty platforms. The system addresses the challenge of moving agent prototypes into production at enterprise scale by providing a reusable template-based architecture built on LangGraph, FastAPI, and Google Cloud Platform components. Alfred enables teams across the organization to quickly deploy conversational commerce applications and agentic workflows (such as recipe-based shopping) while handling critical enterprise requirements including security, privacy, PII masking, observability, and integration with 50+ platform APIs through their Model Context Protocol (MCP) ecosystem.

Building Alyx: An AI Agent for LLM Observability and Debugging

Arize AI

Arize AI built "Alyx," an AI agent embedded in their observability platform to help users debug and optimize their machine learning and LLM applications. The problem they addressed was that their platform had advanced features that required significant expertise to use effectively, with customers needing guidance from solutions architects to extract maximum value. Their solution was to create an AI agent that emulates an expert solutions architect, capable of performing complex debugging workflows, optimizing prompts, generating evaluation templates, and educating users on platform features. Starting in November 2023 with GPT-3.5 and launching at their July 2024 conference, Alyx evolved from a highly structured, on-rails decision tree architecture to a more autonomous agent leveraging modern LLM capabilities. The team used their own platform to build and evaluate Alex, establishing comprehensive evaluation frameworks across multiple levels (tool calls, tasks, sessions, traces) and involving cross-functional stakeholders in defining success criteria.

Building an AI Agent Platform with Cloud-Based Virtual Machines and Extended Context

Manus

Manus AI, founded in late 2024, developed a consumer-focused AI agent platform that addresses the limitation of frontier LLMs having intelligence but lacking the ability to take action in digital environments. The company built a system where each user task is assigned a fully functional cloud-based virtual machine (Linux, with plans for Windows and Android) running real applications including file systems, terminals, VS Code, and Chromium browsers. By adopting a "less structure, more intelligence" philosophy that avoids predefined workflows and multi-role agent systems, and instead provides rich context to foundation models (primarily Anthropic's Claude), Manus created an agent capable of handling diverse long-horizon tasks from office location research to furniture shopping to data extraction, with users reporting up to 2 hours of daily GPU consumption. The platform launched publicly in March 2024 after five months of development and reportedly spent $1 million on Claude API usage in its first 14 days.

Building an AI Legal Assistant: From Early Testing to Production Deployment

Casetext

Casetext transformed their legal research platform into an AI-powered legal assistant called Co-Counsel using GPT-4, leading to a $650M acquisition by Thomson Reuters. The company shifted their entire 120-person team to focus on building this AI assistant after early access to GPT-4 showed promising results. Through rigorous testing, prompt engineering, and a test-driven development approach, they created a reliable AI system that could perform complex legal tasks like document review and research that previously took lawyers days to complete. The product achieved rapid market acceptance and true product-market fit within months of launch.

Building an AI-Native Code Editor in a Competitive Market

Cursor

Cursor, an AI-powered code editor startup, entered an extremely competitive market dominated by Microsoft's GitHub Copilot and well-funded competitors like Poolside, Augment, and Magic.dev. Despite initial skepticism from advisors about competing against Microsoft's vast resources and distribution, Cursor succeeded by focusing on the right short-term product decisions—specifically deep IDE integration through forking VS Code and delivering immediate value through "Cursor Tab" code completion. The company differentiated itself through rapid iteration, concentrated talent, bottom-up adoption among developers, and eventually building their own fast agent models. Cursor demonstrated that startups can compete against tech giants by moving quickly, dog-fooding their own product, and correctly identifying what developers need in the near term rather than betting solely on long-term agent capabilities.

Building an AI-Powered Browser Extension for Product Documentation with RAG and Chain-of-Thought

Reforge

Reforge developed a browser extension to help product professionals draft and improve documents like PRDs by integrating expert knowledge directly into their workflow. The team evolved from simple RAG (Retrieve and Generate) to a sophisticated Chain-of-Thought approach that classifies document types, generates tailored suggestions, and filters content based on context. Operating with a lean team of 2-3 people, they built the extension through rapid prototyping and iterative development, integrating into popular tools like Google Docs, Notion, and Confluence. The extension uses OpenAI models with Pinecone for vector storage, emphasizing privacy by not storing user data, and leverages innovative testing approaches like analyzing course recommendation distributions and reference counts to optimize model performance without accessing user content.

Building an AI-Powered IDE at Scale: Architectural Deep Dive

Cursor

Cursor, an AI-powered IDE built by Anysphere, faced the challenge of scaling from zero to serving billions of code completions daily while handling 1M+ queries per second and 100x growth in load within 12 months. The solution involved building a sophisticated architecture using TypeScript and Rust, implementing a low-latency sync engine for autocomplete suggestions, utilizing Merkle trees and embeddings for semantic code search without storing source code on servers, and developing Anyrun, a Rust-based orchestrator service. The results include reaching $500M+ in annual revenue, serving more than half of the Fortune 500's largest tech companies, and processing hundreds of millions of lines of enterprise code written daily, all while maintaining privacy through encryption and secure indexing practices.

Building an AI-Powered Software Development Platform with Multiple LLM Integration

Lovable

Lovable addresses the challenge of making software development accessible to non-programmers by creating an AI-powered platform that converts natural language descriptions into functional applications. The solution integrates multiple LLMs (including OpenAI and Anthropic models) in a carefully orchestrated system that prioritizes speed and reliability over complex agent architectures. The platform has achieved significant success, with over 1,000 projects being built daily and a rapidly growing user base that doubled its paying customers in a recent month.

Building an Enterprise GenAI Platform with Standardized LLMOps Framework

FactSet

FactSet, a financial data and analytics provider, faced challenges with fragmented LLM development approaches across teams, leading to collaboration barriers and inconsistent quality. They implemented a standardized LLMOps framework using Databricks Mosaic AI and MLflow, enabling unified governance, efficient model development, and improved deployment capabilities. This transformation resulted in significant performance improvements, including a 70% reduction in response time for code generation and 60% reduction in end-to-end latency for formula generation, while maintaining high accuracy and enabling cost-effective use of fine-tuned open-source models alongside commercial LLMs.

Building an Enterprise LLMOps Stack: Lessons from Doordash

Doordash

The ML Platform team at Doordash shares their exploration and strategy for building an enterprise LLMOps stack, discussing the unique challenges of deploying LLM applications at scale. The presentation covers key components needed for production LLM systems, including gateway services, prompt management, RAG implementations, and fine-tuning capabilities, while drawing insights from industry leaders like LinkedIn and Uber's approaches to LLMOps architecture.

Building an Enterprise-Grade AI Agent for Recruiting at Scale

LinkedIn

LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.

Building an Internal Background Coding Agent with Full Development Environment Integration

Ramp

Ramp built Inspect, an internal background coding agent that automates code generation while closing the verification loop with comprehensive testing and validation capabilities. The agent runs in sandboxed VMs on Modal with full access to all engineering tools including databases, CI/CD pipelines, monitoring systems, and feature flags. Within months of deployment, Inspect reached approximately 30% of all pull requests merged to frontend and backend repositories, demonstrating rapid adoption without mandating usage. The system's key innovation is providing agents with the same context and tools as human engineers while enabling unlimited concurrent sessions with near-instant startup times.

Building an On-Premise Health Insurance Appeals Generation System

HealthInsuranceLLM

Development of an LLM-based system to help generate health insurance appeals, deployed on-premise with limited resources. The system uses fine-tuned models trained on publicly available medical review board data to generate appeals for insurance claim denials. The implementation includes Kubernetes deployment, GPU inference, and a Django frontend, all running on personal hardware with multiple internet providers for reliability.

Building and Deploying Enterprise-Grade LLMs: Lessons from Mistral

Mistral

Mistral, a European AI company, evolved from developing academic LLMs to building and deploying enterprise-grade language models. They started with the successful launch of Mistral-7B in September 2023, which became one of the top 10 most downloaded models on Hugging Face. The company focuses not just on model development but on providing comprehensive solutions for enterprise deployment, including custom fine-tuning, on-premise deployment infrastructure, and efficient inference optimization. Their approach demonstrates the challenges and solutions in bringing LLMs from research to production at scale.

Building and Deploying Production LLM Code Review Agents: Architecture and Best Practices

Ellipsis

Ellipsis developed an AI-powered code review system that uses multiple specialized LLM agents to analyze pull requests and provide feedback. The system employs parallel comment generators, sophisticated filtering pipelines, and advanced code search capabilities backed by vector stores. Their approach emphasizes accuracy over latency, uses extensive evaluation frameworks including LLM-as-judge, and implements robust error handling. The system successfully processes GitHub webhooks and provides automated code reviews with high accuracy and low false positive rates.

Building and Deploying the Codex App: A Multi-Agent AI Development Environment

OpenAI

OpenAI's Codex team developed a dedicated GUI application for AI-powered coding that serves as a command center for multi-agent systems, moving beyond traditional IDE and terminal interfaces. The team addressed the challenge of making AI coding agents accessible to broader audiences while maintaining professional-grade capabilities for software developers. By combining the GPT-5.3 Codex model with agent skills, automations, and a purpose-built interface, they created a production system that enables delegation-based development workflows where users supervise AI agents performing complex coding tasks. The result was over one million downloads in the first week, widespread internal adoption at OpenAI including by research teams, and a strategic shift positioning AI coding tools for mainstream use, culminating in a Super Bowl advertisement.

Building and Evaluating Production AI Agents: From Function Calling to Complex Multi-Agent Systems

Google Deepmind

This case study explores the evolution of LLM-based systems in production through discussions with Raven Kumar from Google DeepMind about building products like Notebook LM, Project Mariner, and working with the Gemini and Gemma model families. The conversation covers the rapid progression from simple function calling to complex agentic systems capable of multi-step reasoning, the critical importance of evaluation harnesses as competitive advantages, and practical considerations around context engineering, tool orchestration, and model selection. Key insights include how model improvements are causing teams to repeatedly rebuild agent architectures, the importance of shipping products quickly to learn from real users, and strategies for evaluating increasingly complex multi-modal agentic systems across different scales from edge devices to cloud-based deployments.

Building and Evaluating Production Voice Agents: From Custom Infrastructure to Platform Solutions

Nomore Engineering

A team explored building a phone agent system for handling doctor appointments in Polish primary care, initially attempting to build their own infrastructure before evaluating existing platforms. They implemented a complex system involving speech-to-text, LLMs, text-to-speech, and conversation orchestration, along with comprehensive testing approaches. After building the complete system, they ultimately decided to use a third-party platform (Vapi.ai) due to the complexities of maintaining their own infrastructure, while gaining valuable insights into voice agent architecture and testing methodologies.

Building and Evolving a Production GenAI Application Stack

LinkedIn

LinkedIn's journey in developing their GenAI application tech stack, transitioning from simple prompt-based solutions to complex conversational agents. The company evolved from Java-based services to a Python-first approach using LangChain, implemented comprehensive prompt management, developed a skill-based task automation framework, and built robust conversational memory infrastructure. This transformation included migrating existing applications while maintaining production stability and enabling both commercial and fine-tuned open-source LLM deployments.

Building and Managing Production Agents with Testing and Evaluation Infrastructure

Nearpod

Nearpod, an edtech company, implemented a sophisticated agent-based architecture to help teachers generate educational content. They developed a framework for building, testing, and deploying AI agents with robust evaluation capabilities, ensuring 98-100% accuracy while managing costs. The system includes specialized agents for different tasks, an agent registry for reuse across teams, and extensive testing infrastructure to ensure reliable production deployment of non-deterministic systems.

Building and Operating a CLI-Based LLM Coding Assistant

Anthropic

Anthropic developed Claude Code, a CLI-based coding assistant that provides direct access to their Sonnet LLM for software development tasks. The tool started as an internal experiment but gained rapid adoption within Anthropic, leading to its public release. The solution emphasizes simplicity and Unix-like utility design principles, achieving an estimated 2-10x developer productivity improvement for active users while maintaining a pay-as-you-go pricing model averaging $6/day per active user.

Building and Operating an MCP Server for LLM-Powered Cloud Infrastructure Queries

CloudQuery

CloudQuery built a Model Context Protocol (MCP) server in Go to enable Claude and Cursor to directly query their cloud infrastructure database. They encountered significant challenges with LLM tool selection, context window limitations, and non-deterministic behavior. By rewriting tool descriptions to be longer and more domain-specific, renaming tools to better match user intent, implementing schema filtering to reduce token usage by 90%, and embedding recommended multi-tool workflows, they dramatically improved how the LLM engaged with their system. The solution transformed Claude's interaction from hallucinating queries to systematically following a discovery-to-execution pipeline.

Building and Optimizing a RAG-based Customer Service Chatbot

HDI

HDI, a German insurance company, implemented a RAG-based chatbot system to help customer service agents quickly find and access information across multiple knowledge bases. The system processes complex insurance documents, including tables and multi-column layouts, using various chunking strategies and vector search optimizations. After 120 experiments to optimize performance, the production system now serves 800+ users across multiple business lines, handling 26 queries per second with 88% recall rate and 6ms query latency.

Building and Optimizing AI Programming Agents with MLOps Infrastructure at Scale

Weights & Biases

This case study describes Weights & Biases' development of programming agents that achieved top performance on the SWEBench benchmark, demonstrating how MLOps infrastructure can systematically improve AI agent performance through experimental workflows. The presenter built "Tiny Agent," a command-line programming agent, then optimized it through hundreds of experiments using OpenAI's O1 reasoning model to achieve the #1 position on SWEBench leaderboard. The approach emphasizes systematic experimentation with proper tracking, evaluation frameworks, and infrastructure scaling, while introducing tools like Weave for experiment management and WB Launch for distributed computing. The work also explores reinforcement learning for agent improvement and introduces the concept of "researcher agents" that can autonomously improve AI systems.

Building and Scaling a Production Generative AI Assistant for Professional Networking

LinkedIn

LinkedIn developed a generative AI-powered experience to enhance job searches and professional content browsing. The system uses a RAG-based architecture with specialized AI agents to handle different query types, integrating with internal APIs and external services. Key challenges included evaluation at scale, API integration, maintaining consistent quality, and managing computational resources while keeping latency low. The team achieved basic functionality quickly but spent significant time optimizing for production-grade reliability.

Building and Scaling AI-Powered Password Detection in Production

Github

Github developed and deployed Copilot secret scanning to detect generic passwords in codebases using AI/LLMs, addressing the limitations of traditional regex-based approaches. The team iteratively improved the system through extensive testing, prompt engineering, and novel resource management techniques, ultimately achieving a 94% reduction in false positives while maintaining high detection accuracy. The solution successfully scaled to handle enterprise workloads through sophisticated capacity management and workload-aware request handling.

Building and Scaling AI-Powered Visual Search Infrastructure

Figma

Figma implemented AI-powered search features to help users find designs and components across their organization using text descriptions or visual references. The solution leverages the CLIP multimodal embedding model, with infrastructure built to handle billions of embeddings while keeping costs down. The system combines traditional lexical search with vector similarity search, using AWS services including SageMaker, OpenSearch, and DynamoDB to process and index designs at scale. Key optimizations included vector quantization, software rendering, and cluster autoscaling to manage computational and storage costs.

Building and Scaling Codex: OpenAI's Production Coding Agent

OpenAI

OpenAI developed Codex, a coding agent that serves as an AI-powered software engineering teammate, addressing the challenge of accelerating software development workflows. The solution combines a specialized coding model (GPT-5.1 Codex Max), a custom API layer with features like context compaction, and an integrated harness that works through IDE extensions and CLI tools using sandboxed execution environments. Since launching and iterating based on user feedback in August, Codex has grown 20x, now serves many trillions of tokens per week, has become the most-served coding model both in first-party use and via API, and has enabled dramatic productivity gains including shipping the Sora Android app (which became the #1 app in the app store) in just 28 days with 2-3 engineers, demonstrating significant acceleration in production software development at scale.

Building and Scaling Conversational Voice AI Agents for Enterprise Go-to-Market

Thoughtly / Gladia

Thoughtly, a voice AI platform founded in late 2023, provides conversational AI agents for enterprise sales and customer support operations. The company orchestrates speech-to-text, large language models, and text-to-speech systems to handle millions of voice calls with sub-second latency requirements. By optimizing every layer of their stack—from telephony providers to LLM inference—and implementing sophisticated caching, conditional navigation, and evaluation frameworks, Thoughtly delivers 3x conversion rates over traditional methods and 15x ROI for customers. The platform serves enterprises with HIPAA and SOC 2 compliance while handling both inbound customer support and outbound lead activation at massive scale across multiple languages and regions.

Building and Scaling Enterprise LLMOps Platforms: From Team Topology to Production

Various

A comprehensive overview of how enterprises are implementing LLMOps platforms, drawing from DevOps principles and experiences. The case study explores the evolution from initial AI adoption to scaling across teams, emphasizing the importance of platform teams, enablement, and governance. It highlights the challenges of testing, model management, and developer experience while providing practical insights into building robust AI infrastructure that can support multiple teams within an organization.

Building and Scaling GitHub Copilot: From Prototype to Enterprise AI Coding Assistant

GitHub

GitHub shares the three-year journey of developing GitHub Copilot, an LLM-powered code completion tool, from concept to general availability. The team followed a "find it, nail it, scale it" framework to identify the problem space (helping developers code faster), create a smooth product experience through rapid iteration and A/B testing, and scale to enterprise readiness. Starting with a focused problem of function-level code completion in IDEs, they leveraged OpenAI's LLMs and Microsoft Azure infrastructure, implementing techniques like neighboring tabs processing, caching for consistency, and security filters. Through technical previews and community feedback, they achieved a 55% faster coding speed and 74% reduction in developer frustration, while addressing responsible AI concerns through code reference tools and vulnerability filtering.

Building and Scaling Internal Data Agents and AI-Powered Frontend Development Tools

Vercel

Vercel developed two significant production AI applications: DZ, an internal text-to-SQL data agent that enables employees to query Snowflake using natural language in Slack, and V0, a public-facing AI tool for generating full-stack web applications. The company initially built DZ as a traditional tool-based agent but completely rebuilt it as a coding-style agent with simplified architecture (just two tools: bash and SQL execution), dramatically improving performance by leveraging models' native coding capabilities. V0 evolved from a 2023 prototype targeting frontend engineers into a comprehensive full-stack development tool as models improved, finding strong product-market fit with tech-adjacent users and enabling significant internal productivity gains. Both products demonstrate Vercel's philosophy that building custom agents is straightforward and preferable to buying off-the-shelf solutions, with the company successfully deploying these AI systems at scale while maintaining reliability and supporting their core infrastructure business.

Building and Scaling LLM Applications at Discord

Discord

Discord shares their comprehensive approach to building and deploying LLM-powered features, from ideation to production. They detail their process of identifying use cases, defining requirements, prototyping with commercial LLMs, evaluating prompts using AI-assisted evaluation, and ultimately scaling through either hosted or self-hosted solutions. The case study emphasizes practical considerations around latency, quality, safety, and cost optimization while building production LLM applications.

Building and Scaling Production Code Agents: Lessons from Replit

Replit

Replit developed and deployed a production-grade code agent that helps users create and modify code through natural language interaction. The team faced challenges in defining their target audience, detecting failure cases, and implementing comprehensive evaluation systems. They scaled from 3 to 20 engineers working on the agent, developed custom evaluation frameworks, and successfully launched features like rapid build mode that reduced initial application setup time from 7 to 2 minutes. The case study highlights key learnings in agent development, testing, and team scaling in a production environment.

Building and Scaling Production-Ready AI Agents: Lessons from Agent Force

Salesforce

Salesforce introduced Agent Force, a low-code/no-code platform for building, testing, and deploying AI agents in enterprise environments. The case study explores the challenges of moving from proof-of-concept to production, emphasizing the importance of comprehensive testing, evaluation, monitoring, and fine-tuning. Key insights include the need for automated evaluation pipelines, continuous monitoring, and the strategic use of fine-tuning to improve performance while reducing costs.

Building and Sunsetting Ada: An Internal LLM-Powered Chatbot Assistant

Leboncoin

Leboncoin, a French e-commerce platform, built Ada—an internal LLM-powered chatbot assistant—to provide employees with secure access to GenAI capabilities while protecting sensitive data from public LLM services. Starting in late 2023, the project evolved from a general-purpose Claude-based chatbot to a suite of specialized RAG-powered assistants integrated with internal knowledge sources like Confluence, Backstage, and organizational data. Despite achieving strong technical results and valuable learning outcomes around evaluation frameworks, retrieval optimization, and enterprise LLM deployment, the project was phased out in early 2025 in favor of ChatGPT Enterprise with EU data residency, allowing the team to redirect their expertise toward more user-facing use cases while reducing operational overhead.

Building and Testing a Production LLM-Powered Quiz Application

Google

A case study of transforming a traditional trivia quiz application into an LLM-powered system using Google's Vertex AI platform. The team evolved from using static quiz data to leveraging PaLM and later Gemini models for dynamic quiz generation, addressing challenges in prompt engineering, validation, and testing. They achieved significant improvements in quiz accuracy from 70% with Gemini Pro to 91% with Gemini Ultra, while implementing robust validation methods using LLMs themselves to evaluate quiz quality.

Building ART·E: Reinforcement Learning for Email Search Agent Development

OpenPipe

OpenPipe developed ART·E, an email research agent that outperforms OpenAI's o3 model on email search tasks. The project involved creating a synthetic dataset from the Enron email corpus, implementing a reinforcement learning training pipeline using Group Relative Policy Optimization (GRPO), and developing a multi-objective reward function. The resulting model achieved higher accuracy while being faster and cheaper than o3, taking fewer turns to answer questions correctly and hallucinating less frequently, all while being trained on a single H100 GPU for under $80.

Building Ask Learn: A Large-Scale RAG-Based Knowledge Service for Azure Documentation

Microsoft

Microsoft's Skilling organization built "Ask Learn," a retrieval-augmented generation (RAG) system that powers AI-driven question-answering capabilities for Microsoft Q&A and serves as ground truth for Microsoft Copilot for Azure. Starting from a 2023 hackathon project, the team evolved a naïve RAG implementation into an advanced RAG system featuring sophisticated pre- and post-processing pipelines, continuous content ingestion from Microsoft Learn documentation, vector database management, and comprehensive evaluation frameworks. The system handles massive scale, provides accurate and verifiable answers, and serves multiple use cases including direct question answering, grounding data for other chat handlers, and fallback functionality when the Copilot cannot complete requested tasks.

Building Claude Code: Scaling AI-Powered Development from Terminal Prototype to Production

Anthropic

Anthropic's Boris Churnney, creator of Claude Code, describes the journey from an accidental terminal prototype in September 2024 to a production coding tool used by 70% of startups and responsible for 4% of all public commits globally. Starting as a simple API testing tool, Claude Code evolved through continuous user feedback and rapid iteration, with the entire codebase rewritten every few months to adapt to improving model capabilities. The tool achieved remarkable productivity gains at Anthropic itself, with engineers seeing 70% productivity increases per capita despite team doubling, and total productivity improvements of 150% since launch. The development philosophy centered on building for future model capabilities rather than current ones, anticipating improvements 6 months ahead, and minimizing scaffolding that would become obsolete with each new model release.

Building Cursor Composer: A Fast, Intelligent Agent-Based Coding Model with Reinforcement Learning

Cursor

Cursor's AI research team built Composer, an agent-based LLM designed for coding that combines frontier-level intelligence with four times faster token generation than comparable models. The problem they addressed was creating an agentic coding assistant that feels fast enough for interactive use while maintaining high intelligence for realistic software engineering tasks. Their solution involved training a large mixture-of-experts model using reinforcement learning (RL) at scale, developing custom low-precision training kernels, and building infrastructure that integrates their production environment directly into the training loop. The result is a model that performs nearly as well as the best frontier models on their internal benchmarks while delivering edits and tool calls in seconds rather than minutes, fundamentally changing how developers interact with AI coding assistants.

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

Building Effective Agents: Practical Framework and Design Principles

Anthropic

Anthropic presents a practical framework for building production-ready AI agents, addressing the challenge of when and how to deploy agentic systems effectively. The presentation introduces three core principles: selective use of agents for appropriate use cases, maintaining simplicity in design, and adopting the agent's perspective during development. The solution emphasizes a checklist-based approach for evaluating agent suitability considering task complexity, value justification, capability validation, and error costs. Results include successful deployment of coding agents and other domain-specific agents that share a common backbone of environment, tools, and system prompts, demonstrating that simple architectures can deliver sophisticated behavior when properly designed and iterated upon.

Building Enterprise-Grade GenAI Platform with Multi-Cloud Architecture

Coinbase

Coinbase developed CB-GPT, an enterprise GenAI platform, to address the challenges of deploying LLMs at scale across their organization. Initially focused on optimizing cost versus accuracy, they discovered that enterprise-grade LLM deployment requires solving for latency, availability, trust and safety, and adaptability to the rapidly evolving LLM landscape. Their solution was a multi-cloud, multi-LLM platform that provides unified access to models across AWS Bedrock, GCP VertexAI, and Azure, with built-in RAG capabilities, guardrails, semantic caching, and both API and no-code interfaces. The platform now serves dozens of internal use cases and powers customer-facing applications including a conversational chatbot launched in June 2024 serving all US consumers.

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

Building Fully Autonomous Coding Agents for Non-Technical Users

Replit

Replit developed autonomous coding agents designed specifically for non-technical users, evolving from basic code completion tools to fully autonomous agents capable of running for hours while handling all technical decisions. The company identified that autonomy shouldn't be conflated with long runtimes but rather defined by the agent's ability to make technical decisions without user intervention. Their solution involved three key pillars: leveraging frontier model capabilities, implementing comprehensive autonomous testing using browser automation and Playwright, and sophisticated context management through sub-agent orchestration. The approach reduced context compression needs significantly (from 35 to 45-50 memories per compression), enabled agents to run coherently for extended periods without technical user input, and achieved order-of-magnitude improvements in testing cost and latency compared to computer vision approaches.

Building Gemini Deep Research: An Agentic Research Assistant with Custom-Tuned Models

Google Deepmind

Google DeepMind developed Gemini Deep Research, an AI-powered research assistant that autonomously browses the web for 5-10 minutes to generate comprehensive research reports with citations. The product addresses the challenge of users wanting to go from "zero to 50" on new topics quickly, automating what would typically require opening dozens of browser tabs and hours of manual research. The team solved key technical challenges around agentic planning, transparent UX design with editable research plans, asynchronous orchestration, and post-training custom models (initially Gemini 1.5 Pro, moving toward 2.0 Flash) to reliably perform iterative web search and synthesis. The product launched in December 2024 and has been widely praised as potentially the most useful public-facing AI agent to date, with users reporting it can compress hours or days of research work into minutes.

Building GitHub Copilot: Working with OpenAI's LLMs in Production

GitHub

GitHub developed GitHub Copilot by integrating OpenAI's large language models, starting with GPT-3 and evolving through multiple iterations of the Codex model. The problem was creating an effective AI-powered code generation tool that could work seamlessly within developer IDEs. The solution involved extensive prompt crafting to create optimal "pseudo-documents" that guide the model toward better completions, fine-tuning on specific codebases, and implementing contextual improvements such as incorporating code from neighboring editor tabs and file paths. The results included dramatic improvements in code acceptance rates, with the multilingual model eventually solving over 90% of test problems compared to about 50% initially, and noticeable quality improvements particularly for non-top-five programming languages when new model versions were deployed.

Building Goal-Oriented Retrieval Agents for Low-Latency Recommendations at Scale

Faber Labs

Faber Labs developed Gora (Goal-Oriented Retrieval Agents), a system that transforms subjective relevance ranking using cutting-edge technologies. The system optimizes for specific KPIs like conversion rates and average order value in e-commerce, or minimizing surgical engagements in healthcare. They achieved this through a combination of real-time user feedback processing, unified goal optimization, and high-performance infrastructure built with Rust, resulting in consistent 200%+ improvements in key metrics while maintaining sub-second latency.

Building ISO: A Hyperpersonalized AI Food Ordering Agent for Millions of Users

iFood

iFood, Brazil's largest food delivery company, built Ailo, an AI-powered food ordering agent to address the decision paralysis users face when choosing what to eat from overwhelming options. The agent operates both within the iFood app and on WhatsApp, providing hyperpersonalized recommendations based on user behavior, handling complex intents beyond simple search, and autonomously taking actions like applying coupons, managing carts, and facilitating payments. Through careful context management, latency optimization (reducing P95 from 30 to 10 seconds), and sophisticated evaluation frameworks, the team deployed ISO to millions of users in Brazil, demonstrating significant improvements in user experience through proactive engagement and intelligent personalization.

Building LinkedIn's First Production Agent: Hiring Assistant Platform and Architecture

LinkedIn

LinkedIn evolved from simple GPT-based collaborative articles to sophisticated AI coaches and finally to production-ready agents, culminating in their Hiring Assistant product announced in October 2025. The company faced the challenge of moving from conversational assistants with prompt chains to task automation using agent-based architectures that could handle high-scale candidate evaluation while maintaining quality and enabling rapid iteration. They built a comprehensive agent platform with modular sub-agent architecture, centralized prompt management, LLM inference abstraction, messaging-based orchestration for resilience, and a skill registry for dynamic tool discovery. The solution enabled parallel development of agent components, independent quality evaluation, and the ability to serve both enterprise recruiters and SMB customers with variations of the same underlying platform, processing thousands of candidate evaluations at scale while maintaining the flexibility to iterate on product design.

Building Low-Latency Voice AI Agents for Home Services

Elyos AI

Elyos AI built end-to-end voice AI agents for home services companies (plumbers, electricians, HVAC installers) to handle customer calls, emails, and messages 24/7. The company faced challenges achieving human-like conversation latency (targeting sub-400ms response times) while maintaining reliability and accuracy for complex workflows including appointment booking, payment processing, and emergency dispatch. Through careful orchestration, they optimized speech-to-text, LLM, and text-to-speech components, implemented just-in-time context engineering, state machine-based workflows, and parallel monitoring streams to achieve consistent performance with approximately 85% call automation (15% requiring human involvement).

Building Modular and Scalable RAG Systems with Hybrid Batch/Incremental Processing

Bell

Bell developed a sophisticated hybrid RAG (Retrieval Augmented Generation) system combining batch and incremental processing to handle both static and dynamic knowledge bases. The solution addresses challenges in managing constantly changing documentation while maintaining system performance. They created a modular architecture using Apache Beam, Cloud Composer (Airflow), and GCP services, allowing for both scheduled batch updates and real-time document processing. The system has been successfully deployed for multiple use cases including HR policy queries and dynamic Confluence documentation management.

Building Multi-Agent AI Systems for Developer Support and Infrastructure Operations

Electrolux

Electrolux, a Swedish home appliances manufacturer with over 100 years of history, developed "Infra Assistant," an AI-powered multi-agent system to support their internal development teams and reduce bottlenecks in their platform engineering organization. The company faced challenges with their small Site Reliability Engineering (SRE) team being overwhelmed with repetitive support requests via Slack channels. Using Amazon Bedrock agents with both retrieval-augmented generation (RAG) and multi-agent collaboration patterns, they built a sophisticated system that answers questions based on organizational documentation, executes operations via API integrations, and can even troubleshoot cloud infrastructure issues autonomously. The system has proven cost-efficient compared to manual effort, successfully handles repetitive tasks like access management, and provides context-aware responses by accessing multiple organizational knowledge sources, though challenges remain around response latency and achieving consistent accuracy across all interactions.

Building Multi-Agent Systems with MCP and Pydantic AI for Document Processing

Deepsense

Deepsense AI built a multi-agent system for a customer who operates a document processing platform that handles various file types and data sources at scale. The problem was to create both an MCP (Model Context Protocol) server for the platform's internal capabilities and a demonstration multi-agent system that could structure data on demand from documents. Using Pydantic AI as the core agent framework and Anthropic's Claude models, the team developed a solution where users specify goals for document processing, and the system automatically extracts structured information into tables. The implementation involved creating custom MCP servers, integrating with Databricks MCP, and applying 10 key lessons learned around tool design, token optimization, model selection, observability, testing, and security. The result was a modular, scalable system that demonstrates practical patterns for building production-ready agentic applications.

Building Production Agentic AI Systems for IT Operations and Support Automation

WEX

WEX, a global commerce platform processing over $230 billion in transactions annually, built a production agentic AI system called "Chat GTS" to address their 40,000+ annual IT support requests. The company's Global Technology Services team developed specialized agents using AWS Bedrock and Agent Core Runtime to automate repetitive operational tasks, including network troubleshooting and autonomous EBS volume management. Starting with Q&A capabilities, they evolved into event-driven agents that can autonomously respond to CloudWatch alerts, execute remediation playbooks via SSM documents exposed as MCP tools, and maintain infrastructure drift through automated pull requests. The system went from pilot to production in under 3 months, now serving over 2,000 internal users, with multi-agent architectures handling both user-initiated chat interactions and autonomous incident response workflows.

Building Production Agentic Systems with Platform-Level LLMOps Features

Anthropic

Anthropic's presentation at the AI Engineer conference outlined their platform evolution for building high-performance agentic systems, using Claude Code as the primary example. The company identified three core challenges in production LLM deployments: harnessing model capabilities through API features, managing context windows effectively, and providing secure computational infrastructure for autonomous agent operation. Their solution involved developing platform-level features including extended thinking modes, tool use APIs, Model Context Protocol (MCP) for standardized external system integration, memory management for selective context retrieval, context editing capabilities, and secure code execution environments with container orchestration. The combination of memory tools and context editing demonstrated a 39% performance improvement on internal benchmarks, while their infrastructure solutions enabled Claude Code to run autonomously on web and mobile platforms with session persistence and secure sandboxing.

Building Production AI Agents and Agentic Platforms at Scale

Vercel

This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

Building Production AI Agents: Lessons from Claude Code and Enterprise Deployments

Anthropic

Anthropic's Applied AI team shares learnings from building and deploying AI agents in production throughout 2024-2025, focusing on their Claude Code product and enterprise customer implementations. The presentation covers the evolution from simple Q&A chatbots and RAG systems to sophisticated agentic architectures that run LLMs in loops with tools. Key technical challenges addressed include context engineering, prompt optimization, tool design, memory management, and handling long-running tasks that exceed context windows. The team transitioned from workflow-based architectures (chained LLM calls with deterministic logic) to agent-based systems where models autonomously use tools to solve open-ended problems, resulting in more robust error handling and the ability to tackle complex tasks like multi-hour coding sessions.

Building Production AI Products: A Framework for Continuous Calibration and Development

OpenAI / Various

AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.

Building Production Analytics Agents with Semantic Layer Integration

Wobby

Wobby, a company that helps business teams get insights from their data warehouses in under one minute, shares their journey building production-ready analytics agents over two years. The team developed three specialized agents (Quick, Deep, and Steward) that work with semantic layers to answer business questions. Their solution emphasizes Slack/Teams integration for adoption, building their own semantic layer to encode business logic, preferring prompt-based logic over complex workflows, implementing comprehensive testing strategies beyond just evals, and optimizing for latency through caching and progressive disclosure. The approach led to successful adoption by clients, with analytics agents being actively used in production to handle ad-hoc business intelligence queries.

Building Production Audio Agents with Real-Time Speech-to-Speech Models

OpenAI

OpenAI's solution architecture team presents their learnings on building practical audio agents using speech-to-speech models in production environments. The presentation addresses the evolution from slow, brittle chained architectures combining speech-to-text, LLM processing, and text-to-speech into unified real-time APIs that reduce latency and improve user experience. Key considerations include balancing trade-offs across latency, cost, accuracy, user experience, and integrations depending on use case requirements. The talk covers architectural patterns like tool delegation to specialized agents, prompt engineering for voice expressiveness, evaluation strategies including synthetic conversations, and asynchronous guardrails implementation. Examples from Lemonade and Tinder demonstrate successful production deployments focusing on evaluation frameworks and brand customization respectively.

Building Production Evaluation Systems for GitHub Copilot at Scale

Github

This case study examines the challenges of building evaluation systems for AI products in production, drawing from the author's experience leading the evaluation team at GitHub Copilot serving 100M developers. The problem addressed was the gap between evaluation tooling and developer workflows, as most AI teams consist of engineers rather than data scientists, yet evaluation tools are designed for data science workflows. The solution involved building a comprehensive evaluation stack including automated harnesses for code completion testing, A/B testing infrastructure, and implicit user behavior metrics like acceptance rates. The results showed that while sophisticated evaluation systems are valuable, successful AI products in practice rely heavily on rapid iteration, monitoring in production, and "vibes-based" testing, with the dominant strategy being to ship fast and iterate based on real user feedback rather than extensive offline evaluation.

Building Production Multi-Agent Research Systems with Claude

Anthropic

Anthropic developed a production-grade multi-agent research system for their Claude Research feature that uses multiple LLM agents working in parallel to explore complex topics across web, Google Workspace, and integrated data sources. The system employs an orchestrator-worker pattern where a lead agent coordinates specialized subagents that search and filter information simultaneously, addressing challenges in agent coordination, evaluation, and reliability. Internal evaluations showed the multi-agent approach with Claude Opus 4 and Sonnet 4 outperformed single-agent Claude Opus 4 by 90.2% on research tasks, with token usage explaining 80% of performance variance, though the architecture consumes approximately 15× more tokens than standard chat interactions, requiring careful consideration of economic viability and deployment strategies.

Building Production Web Agents for Food Ordering

iFood

A team at Prosus built web agents to help automate food ordering processes across their e-commerce platforms. Rather than relying on APIs, they developed web agents that could interact directly with websites, handling complex tasks like searching, navigating menus, and placing orders. Through iterative development and optimization, they achieved an 80% success rate target for specific e-commerce tasks by implementing a modular architecture that separated planning and execution, combined with various operational modes for different scenarios.

Building Production-Grade Agentic AI Analytics: Lessons from Real-World Deployment

Tellius

Tellius shares hard-won lessons from building their agentic analytics platform that transforms natural language questions into trustworthy SQL-based insights. The core problem addressed is that chat-based analytics requires far more than simple text-to-SQL conversion—it demands deterministic planning, governed semantic layers, ambiguity management, multi-step consistency, transparency, performance engineering, and comprehensive observability. Their solution architecture separates language understanding from execution through typed plan artifacts that validate against schemas and policies before execution, implements clarification workflows for ambiguous queries, maintains plan/result fingerprinting for consistency, provides inline transparency with preambles and lineage, enforces latency budgets across execution hops, and treats feedback as governed policy changes. The result is a production system that achieves determinism, explainability, and sub-second interactive performance while avoiding the common pitfalls that cause 95% of AI pilot failures.

Building Production-Grade AI Agents with Guardrails, Context Management, and Security

Portia / Riff / Okta

This panel discussion features founders from Portia AI and Rift.ai (formerly Databutton) discussing the challenges of moving AI agents from proof-of-concept to production. The speakers address critical production concerns including guardrails for agent reliability, context engineering strategies, security and access control challenges, human-in-the-loop patterns, and identity management. They share real-world customer examples ranging from custom furniture makers to enterprise CRM enrichment, emphasizing that while approximately 40% of companies experimenting with AI have agents in production, the journey requires careful attention to trust, security, and supportability. Key solutions include conditional example-based prompting, sandboxed execution environments, role-based access controls, and keeping context windows smaller for better precision rather than utilizing maximum context lengths.

Building Production-Grade Generative AI Applications with Comprehensive LLMOps

Block (Square)

Block (Square) implemented a comprehensive LLMOps strategy across multiple business units using a combination of retrieval augmentation, fine-tuning, and pre-training approaches. They built a scalable architecture using Databricks' platform that allowed them to manage hundreds of AI endpoints while maintaining operational efficiency, cost control, and quality assurance. The solution enabled them to handle sensitive data securely, optimize model performance, and iterate quickly while maintaining version control and monitoring capabilities.

Building Production-Grade LLM Applications: An Architectural Guide

Github

A comprehensive technical guide on building production LLM applications, covering the five key steps from problem definition to evaluation. The article details essential components including input processing, enrichment tools, and responsible AI implementations, using a practical customer service example to illustrate the architecture and deployment considerations.

Building Production-Grade RAG Systems for Financial Document Analysis

Microsoft

Microsoft's team shares their experience implementing a production RAG system for analyzing financial documents, including analyst reports and SEC filings. They tackled complex challenges around metadata extraction, chart/graph analysis, and evaluation methodologies. The system needed to handle tens of thousands of documents, each containing hundreds of pages with tables, graphs, and charts spanning different time periods and fiscal years. Their solution incorporated multi-modal models for image analysis, custom evaluation frameworks, and specialized document processing pipelines.

Building Production-Ready Agentic AI Systems in Financial Services

Fitch Group

Jayeeta Putatunda, Director of AI Center of Excellence at Fitch Group, shares lessons learned from deploying agentic AI systems in the financial services industry. The discussion covers the challenges of moving from proof-of-concept to production, emphasizing the importance of evaluation frameworks, observability, and the "data prep tax" required for reliable AI agent deployments. Key insights include the need to balance autonomous agents with deterministic workflows, implement comprehensive logging at every checkpoint, combine LLMs with traditional predictive models for numerical accuracy, and establish strong business-technical partnerships to define success metrics. The conversation highlights that while agentic frameworks enable powerful capabilities, production success requires careful system design, multi-layered evaluation, human-in-the-loop validation patterns, and a focus on high-ROI use cases rather than chasing the latest model architectures.

Building Production-Ready Agentic Systems with the Claude Developer Platform

Anthropic

Anthropic's Claude Developer Platform team discusses their evolution from a simple API to a comprehensive platform for building autonomous AI agents in production. The conversation covers their philosophy of "unhobbling" models by reducing scaffolding and giving Claude more autonomous decision-making capabilities through tools like web search, code execution, and context management. They introduce the Claude Code SDK as a general-purpose agentic harness that handles the tool-calling loop automatically, making it easier for developers to prototype and deploy agents. The platform addresses key production challenges including prompt caching, context window management, observability for long-running tasks, and agentic memory, with a roadmap focused on higher-order abstractions and self-improving systems.

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

Building Production-Ready AI Analytics with LLMs: Lessons from Jira Integration

Luna

Luna developed an AI-powered Jira analytics system using GPT-4 and Claude 3.7 to extract actionable insights from complex project management data, helping engineering and product teams track progress, identify risks, and predict delays. Through iterative development, they identified seven critical lessons for building reliable LLM applications in production, including the importance of data quality over prompt engineering, explicit temporal context handling, optimal temperature settings for structured outputs, chain-of-thought reasoning for accuracy, focused constraints to reduce errors, leveraging reasoning models effectively, and addressing the "yes-man" effect where models become overly agreeable rather than critically analytical.

Building Production-Ready Conversational AI Voice Agents: Latency, Voice Quality, and Integration Challenges

Deepgram

Deepgram, a leader in transcription services, shares insights on building effective conversational AI voice agents. The presentation covers critical aspects of implementing voice AI in production, including managing latency requirements (targeting 300ms benchmark), handling end-pointing challenges, ensuring voice quality through proper prosody, and integrating LLMs with speech-to-text and text-to-speech services. The company introduces their new text-to-speech product Aura, designed specifically for conversational AI applications with low latency and natural voice quality.

Building Production-Ready CRM Integration for ChatGPT using Model Context Protocol

Hubspot

HubSpot developed the first third-party CRM connector for ChatGPT using the Model Context Protocol (MCP), creating a remote MCP server that enables 250,000+ businesses to perform deep research through conversational AI without requiring local installations. The solution involved building a homegrown MCP server infrastructure using Java and Dropwizard, implementing OAuth-based user-level permissions, creating a distributed service discovery system for automatic tool registration, and designing a query DSL that allows AI models to generate complex CRM searches through natural language interactions.

Building Production-Ready Customer Support AI Agents: Challenges and Solutions

Gradient Labs

Gradient Labs shares their experience building and deploying AI agents for customer support automation in production. While prototyping with LLMs is relatively straightforward, deploying agents to production introduces complex challenges around state management, knowledge integration, tool usage, and handling race conditions. The company developed a state machine-based architecture with durable execution engines to manage these challenges, successfully handling hundreds of conversations per day with high customer satisfaction.

Building Production-Ready LLMs for Automated Code Repair: A Scalable IDE Integration Case Study

Replit

Replit tackled the challenge of automating code repair in their IDE by developing a specialized 7B parameter LLM that integrates directly with their Language Server Protocol (LSP) diagnostics. They created a production-ready system that can automatically fix Python code errors by processing real-time IDE events, operational transformations, and project snapshots. Using DeepSeek-Coder-Instruct-v1.5 as their base model, they implemented a comprehensive data pipeline with serverless verification, structured input/output formats, and GPU-accelerated inference. The system achieved competitive results against much larger models like GPT-4 and Claude-3, with their finetuned 7B model matching or exceeding the performance of these larger models on both academic benchmarks and real-world error fixes. The production system features low-latency inference, load balancing, and real-time code application, demonstrating successful deployment of an LLM system in a high-stakes development environment where speed and accuracy are crucial.

Building Production-Scale Code Completion Tools with Continuous Evaluation and Prompt Engineering

Gitlab

Gitlab's ModelOps team developed a sophisticated code completion system using multiple LLMs, implementing a continuous evaluation and improvement pipeline. The system combines both open-source and third-party LLMs, featuring a comprehensive architecture that includes continuous prompt engineering, evaluation benchmarks, and reinforcement learning to consistently improve code completion accuracy and usefulness for developers.

Building QueryAnswerBird: An AI Data Analyst with Text-to-SQL and RAG

Delivery Hero

Woowa Brothers, part of Delivery Hero, developed QueryAnswerBird (QAB), an LLM-based AI data analyst to address employee challenges with SQL query generation and data literacy. Through a company-wide survey, they identified that 95% of employees used data for work, but over half struggled with SQL due to time constraints or difficulty translating business logic into queries. The solution leveraged RAG, LangChain, and GPT-4 to build a Slack-integrated assistant that automatically generates SQL queries from natural language, interprets queries, validates syntax, and explores tables. After winning first place at an internal hackathon in 2023, a dedicated task force spent six months developing the production system with comprehensive LLMOps practices including A/B testing, monitoring dashboards, API load balancing, GPT caching, and CI/CD deployment, conducting over 500 tests to optimize performance.

Building QueryAnswerBird: An LLM-Powered AI Data Analyst with RAG and Text-to-SQL

Delivery Hero

Woowa Brothers, part of Delivery Hero, developed QueryAnswerBird (QAB), an LLM-based AI data analyst to address the challenge that while 95% of employees used data in their work, over half struggled with SQL proficiency and data extraction reliability. The solution leveraged GPT-4, RAG architecture, LangChain, and comprehensive LLMOps practices to create a Slack-based chatbot that could generate SQL queries from natural language, interpret queries, validate syntax, and provide data discovery features. The development involved building automated unstructured data pipelines with vector stores, implementing multi-chain RAG architecture with router supervisors, establishing LLMOps infrastructure including A/B testing and monitoring dashboards, and conducting over 500 experiments to optimize performance, resulting in a 24/7 accessible service that provides high-quality query responses within 30 seconds to 1 minute.

Building Resilient Multi-Provider AI Agent Infrastructure for Financial Services

Gradient Labs

Gradient Labs built an AI agent that handles customer interactions for financial services companies, requiring high reliability in production. The company architected a sophisticated failover system that spans multiple LLM providers (OpenAI, Anthropic, Google) and hosting platforms (native APIs, Azure, AWS, GCP), enabling both traffic distribution across rate limits and automatic failover during errors, rate limiting, or latency spikes. They use Temporal for durable execution to checkpoint progress across long-running agentic workflows, and have implemented both provider-level and model-level failover strategies with tailored prompts for backup models, ensuring continuous operation even during catastrophic provider outages.

Building Robust Enterprise Search with LLMs and Traditional IR

Glean

Glean tackles enterprise search by combining traditional information retrieval techniques with modern LLMs and embeddings. Rather than relying solely on AI techniques, they emphasize the importance of rigorous ranking algorithms, personalization, and hybrid approaches that combine classical IR with vector search. The company has achieved unicorn status and serves major enterprises by focusing on holistic search solutions that include personalization, feed recommendations, and cross-application integrations.

Building Robust Evaluation Systems for GitHub Copilot

Github

This case study explores how Github developed and evolved their evaluation systems for Copilot, their AI code completion tool. Initially skeptical about the feasibility of code completion, the team built a comprehensive evaluation framework called "harness lib" that tested code completions against actual unit tests from open source repositories. As the product evolved to include chat capabilities, they developed new evaluation approaches including LLM-as-judge for subjective assessments, along with A/B testing and algorithmic evaluations for function calls. This systematic approach to evaluation helped transform Copilot from an experimental project to a robust production system.

Building Secure and Private Enterprise LLM Infrastructure

Slack

Slack implemented AI features by developing a secure architecture that ensures customer data privacy and compliance. They used AWS SageMaker to host LLMs in their VPC, implemented RAG instead of fine-tuning models, and maintained strict data access controls. The solution resulted in 90% of AI-adopting users reporting increased productivity while maintaining enterprise-grade security and compliance requirements.

Building Secure and Private Enterprise Search with LLMs

Slack

Slack built an enterprise search feature that extends their AI-powered search capabilities to external sources like Google Drive and GitHub while maintaining strict security and privacy standards. The problem was enabling users to search across multiple knowledge sources without compromising data security or violating privacy principles. Their solution uses a federated, real-time approach with OAuth-based authentication, Retrieval Augmented Generation (RAG), and LLMs hosted in an AWS escrow VPC to ensure customer data never leaves Slack's trust boundary, isn't used for model training, and respects user permissions. The result is a production system that surfaces relevant, up-to-date, permissioned content from both internal and external sources while maintaining enterprise-grade security standards, with explicit user and admin control over data access.

Building Voice-Enabled AI Assistants with Real-Time Processing

Bee

A detailed exploration of building real-time voice-enabled AI assistants, featuring multiple approaches from different companies and developers. The case study covers how to achieve low-latency voice processing, transcription, and LLM integration for interactive AI assistants. Solutions demonstrated include both commercial services like Deepgram and open-source implementations, with a focus on achieving sub-second latency, high accuracy, and cost-effective deployment.

Business Intelligence Agent for Automotive Dealers with Dynamic UI and Instant Actions

Prosus

Prosus, a machine learning engineering team, built an AI-powered business intelligence assistant for Otomoto, Poland's largest secondhand car dealer platform with thousands of dealers and millions of users. The problem was that dealers were overwhelmed by the platform's rich data and struggled to organize listings and take actionable insights. The initial chat-based agent achieved only 10% engagement with negligible repeat usage, revealing "chat fatigue" - users didn't know what to ask and found the open text box intimidating. The solution involved moving away from pure chat interfaces to a dynamic UI with context-aware action buttons, interactive responses with clickable elements, streaming for perceived faster responses, and purpose-built data aggregation tools using CSV format to reduce token consumption. Results showed that users were significantly more likely to engage when presented with clickable buttons rather than open-ended questions, with button clicks leading to follow-up questions and improved engagement metrics.

Challenges in Building Enterprise Chatbots with LLMs: A Banking Case Study

Invento Robotics

A bank's attempt to implement a customer support chatbot using GPT-4 and RAG reveals the complexities and challenges of deploying LLMs in production. What was initially estimated as a three-month project struggled to deliver after a year, highlighting key challenges in domain knowledge management, retrieval effectiveness, conversation flow design, state management, latency, and regulatory compliance.

Climate Tech Foundation Models for Environmental AI Applications

Various

Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.

Cloud-Based Generative AI for Preliminary Engineering Design

Rolls-Royce

Rolls-Royce implemented a cloud-based generative AI approach using GANs (Generative Adversarial Networks) to support preliminary engineering design tasks. The system combines geometric parameters and simulation data to generate and validate new design concepts, with a particular focus on aerospace applications. By leveraging Databricks' cloud infrastructure, they reduced training time from one week to 4-6 hours while maintaining data security through careful governance and transfer learning approaches.

Cloud-Based Integrated Diagnostics Platform with AI-Assisted Digital Pathology

Philips

Philips partnered with AWS to transform medical imaging and diagnostics by moving their entire healthcare informatics portfolio to the cloud, with particular focus on digital pathology. The challenge was managing petabytes of medical imaging data across multiple modalities (radiology, cardiology, pathology) stored in disparate silos, making it difficult for clinicians to access comprehensive patient information efficiently. Philips leveraged AWS Health Imaging and other cloud services to build a scalable, cloud-native integrated diagnostics platform that reduces workflow time from 11+ hours to 36 minutes in pathology, enables real-time collaboration across geographies, and supports AI-assisted diagnosis. The solution now manages 134 petabytes of data covering 34 million patient exams and 11 billion medical records, with 95 of the top 100 US hospitals using Philips healthcare informatics solutions.

Comprehensive LLM Evaluation Framework for Production AI Code Assistants

Github

Github describes their robust evaluation framework for testing and deploying new LLM models in their Copilot product. The team runs over 4,000 offline tests, including automated code quality assessments and chat capability evaluations, before deploying any model changes to production. They use a combination of automated metrics, LLM-based evaluation, and manual testing to assess model performance, quality, and safety across multiple programming languages and frameworks.

Contact Center Transformation with AI-Powered Customer Service and Agent Assistance

Canada Life

Canada Life, a leading financial services company serving 14 million customers (one in three Canadians), faced significant contact center challenges including 5-minute average speed to answer, wait times up to 40 minutes, complex routing, high transfer rates, and minimal self-service options. The company migrated 21 business units from a legacy system to Amazon Connect in 7 months, implementing AI capabilities including chatbots, call summarization, voice-to-text, automated authentication, and proficiency-based routing. Results included 94% reduction in wait time, 10% reduction in average handle time, $7.5 million savings in first half of 2025, 92% reduction in average speed to answer (now 18 seconds), 83% chatbot containment rate, and 1900 calls deflected per week. The company plans to expand AI capabilities including conversational AI, agent assist, next best action, and fraud detection, projecting $43 million in cost savings over five years.

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

Context Engineering for Production AI Agents at Scale

Manus

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

Context Engineering for Production AI Assistants at Scale

Spotify

Shopify developed Sidekick, an AI assistant serving millions of merchants on their commerce platform. The challenge was managing context windows effectively while maintaining performance, latency, and cost efficiency for an agentic system operating at massive scale. Their solution involved sophisticated "context engineering" techniques including aggressive token management (removing processed tool messages, trimming old conversation turns), a three-tier memory system (explicit user preferences, implicit user profiles, and episodic memory via RAG), and just-in-time instruction injection that collocates instructions with tool outputs. These techniques reportedly improved instruction adherence by 5-10% while reducing jailbreak likelihood and maintaining acceptable latency despite the system managing over 20 tools and handling complex multi-step agentic workflows.

Context Engineering Platform for Multi-Domain RAG and Agentic Systems

Contextual

Contextual has developed an end-to-end context engineering platform designed to address the challenges of building production-ready RAG and agentic systems across multiple domains including e-commerce, code generation, and device testing. The platform combines multimodal ingestion, hierarchical document processing, hybrid search with reranking, and dynamic agents to enable effective reasoning over large document collections. In a recent context engineering hackathon, Contextual's dynamic agent achieved competitive results on a retail dataset of nearly 100,000 documents, demonstrating the value of constrained sub-agents, turn limits, and intelligent tool selection including MCP server management.

Context Engineering Strategies for Production AI Agents

Manus

Manus AI developed a production AI agent system that uses context engineering instead of fine-tuning to enable rapid iteration and deployment. The company faced the challenge of building an effective agentic system that could operate reliably at scale while managing complex multi-step tasks. Their solution involved implementing several key strategies including KV-cache optimization, tool masking instead of removal, file system-based context management, attention manipulation through task recitation, and deliberate error preservation for learning. These approaches allowed Manus to achieve faster development cycles, improved cost efficiency, and better agent performance across millions of users while maintaining system stability and scalability.

Context-Aware AI Code Generation and Assistant at Scale

Windsurf

Windsurf, an AI coding toolkit company, addresses the challenge of generating contextually relevant code for individual developers and organizations. While generating generic code has become straightforward, the real challenge lies in producing code that fits into existing large codebases, adheres to organizational standards, and aligns with personal coding preferences. Windsurf's solution centers on a sophisticated context management system that combines user behavioral heuristics (cursor position, open files, clipboard content, terminal activity) with hard evidence from the codebase (code, documentation, rules, memories). Their approach optimizes for relevant context selection rather than simply expanding context windows, leveraging their background in GPU optimization to efficiently find and process relevant context at scale.

Context-Aware Item Recommendations Using Hybrid LLM and Embedding-Based Retrieval

DoorDash

DoorDash's Core Consumer ML team developed a GenAI-powered context shopping engine to address the challenge of lost user intent during in-app searches for items like "fresh vegetarian sushi." The traditional search system struggled to preserve specific user context, leading to generic recommendations and decision fatigue. The team implemented a hybrid approach combining embedding-based retrieval (EBR) using FAISS with LLM-based reranking to balance speed and personalization. The solution achieved end-to-end latency of approximately six seconds with store page loads under two seconds, while significantly improving user satisfaction through dynamic, personalized item carousels that maintained user context and preferences. This hybrid architecture proved more practical than pure LLM or deep neural network approaches by optimizing for both performance and cost efficiency.

Cost Optimization and Performance Panel Discussion: Strategies for Running LLMs in Production

Various

A panel discussion featuring experts from Neva, Intercom, Prompt Layer, and OctoML discussing strategies for optimizing costs and performance when running LLMs in production. The panel explores various approaches from using API services to running models in-house, covering topics like model compression, hardware selection, latency optimization, and monitoring techniques. Key insights include the trade-offs between API usage and in-house deployment, strategies for cost reduction, and methods for performance optimization.

Cost-Effective LLM Transaction Categorization for Business Banking

ANNA

ANNA, a UK business banking provider, implemented LLMs to automate transaction categorization for tax and accounting purposes across diverse business types. They achieved this by combining traditional ML with LLMs, particularly focusing on context-aware categorization that understands business-specific nuances. Through strategic optimizations including offline predictions, improved context utilization, and prompt caching, they reduced their LLM costs by 75% while maintaining high accuracy in their AI accountant system.

CPU-Based Deployment of Large MoE Models Using Intel Xeon 6 Processors

Lmsys

Intel PyTorch Team collaborated with the SGLang project to develop a cost-effective CPU-only deployment solution for large Mixture of Experts (MoE) models like DeepSeek R1, addressing the challenge of high memory requirements that typically necessitate multiple expensive AI accelerators. Their solution leverages Intel Xeon 6 processors with Advanced Matrix Extensions (AMX) and implements highly optimized kernels for attention mechanisms and MoE computations, achieving 6-14x speedup in time-to-first-token (TTFT) and 2-4x speedup in time-per-output-token (TPOT) compared to llama.cpp, while supporting multiple quantization formats including BF16, INT8, and FP8.

Customer Service Transformation with AI-Based Email Automation and Chatbot Implementation

Sixt

Sixt, a mobility service provider with over €4 billion in revenue, transformed their customer service operations using generative AI to handle the complexity of multiple product lines across 100+ countries. The company implemented "Project AIR" (AI-based Replies) to automate email classification, generate response proposals, and deploy chatbots across multiple channels. Within five months of ideation, they moved from proof-of-concept to production, achieving over 90% classification accuracy using Amazon Bedrock with Anthropic Claude models (up from 70% with out-of-the-box solutions), while reducing classification costs by 70%. The solution now handles customer inquiries in multiple languages, integrates with backend reservation systems, and has expanded from email automation to messaging and chatbot services deployed across all corporate countries by Q1 2025.

Data Flywheels for Cost-Effective AI Agent Optimization

Nvidia

NVIDIA implemented a data flywheel approach to optimize their internal employee support AI agent, addressing the challenge of maintaining accuracy while reducing inference costs. The system continuously collects user feedback and production data to fine-tune smaller, more efficient models that can replace larger, expensive foundational models. Through this approach, they achieved comparable accuracy (94-96%) with significantly smaller models (1B-8B parameters instead of 70B), resulting in 98% cost savings and 70x lower latency while maintaining the agent's effectiveness in routing employee queries across HR, IT, and product documentation domains.

Deploying Agentic AI in Financial Services at Scale

Nvidia

Financial institutions including Capital One, Royal Bank of Canada (RBC), and Visa are deploying agentic AI systems in production to handle real-time financial transactions and complex workflows. These multi-agent systems go beyond simple generative AI by reasoning through problems and taking action autonomously, requiring 100-200x more computational resources than traditional single-shot inference. The implementations focus on use cases like automotive purchasing assistance, investment research automation, and fraud detection, with organizations building proprietary models using open-source foundations (like Llama or Mistral) combined with bank-specific data to achieve 60-70% accuracy improvements. The results include 60% cycle time improvements in report generation, 10x more data analysis capacity, and enhanced fraud detection capabilities, though these gains require substantial investment in AI infrastructure and talent development.

Deploying Agentic Code Review at Scale with GPT-5 Codex

OpenAI

OpenAI addresses the challenge of verifying AI-generated code at scale by deploying an autonomous code reviewer built on GPT-5-Codex and GPT-5.1-Codex-Max. As autonomous coding systems produce code volumes that exceed human oversight capacity, the risk of severe bugs and vulnerabilities increases. The solution involves training a dedicated agentic code reviewer with repository-wide tool access and code execution capabilities, optimizing for precision over recall to maintain developer trust and minimize false alarms. The system now reviews over 100,000 external PRs daily, with authors making code changes in response to 52.7% of comments internally, demonstrating actionable impact while maintaining a low "alignment tax" on developer workflows.

Deploying AI Agents for Scalable Immigration Automation

Navismart AI

Navismart AI developed a multi-agent AI system to automate complex immigration processes that traditionally required extensive human expertise. The platform addresses challenges including complex sequential workflows, varying regulatory compliance across different countries, and the need for human oversight in high-stakes decisions. Built on a modular microservices architecture with specialized agents handling tasks like document verification, form filling, and compliance checks, the system uses Kubernetes for orchestration and scaling. The solution integrates REST APIs for inter-agent communication, implements end-to-end encryption for security, and maintains human-in-the-loop capabilities for critical decisions. The team started with US immigration processes due to their complexity and is expanding to other countries and domains like education.

Deploying Generative AI at Scale Across 5,000 Developers

Liberty IT

Liberty IT, the technology division of Fortune 100 insurance company Liberty Mutual, embarked on a large-scale deployment of generative AI tools across their global workforce of over 5,000 developers and 50,000+ employees. The initiative involved rolling out custom GenAI platforms including Liberty GPT (an internal ChatGPT variant) to 70% of employees and GitHub Copilot to over 90% of IT staff within the first year. The company faced challenges including rapid technology evolution, model availability constraints, cost management, RAG implementation complexity, and achieving true adoption beyond basic usage. Through building a centralized AI platform with governance controls, implementing comprehensive learning programs across six streams, supporting 28 different models optimized for various use cases, and developing custom dashboards for cost tracking and observability, Liberty IT successfully navigated these challenges while maintaining enterprise security and compliance requirements.

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

Document Metadata Extraction at Scale Using Generative AI for Healthcare and Financial Services

AArete

AArete, a management and technology consulting firm serving healthcare payers and financial services, developed Doxy AI to extract structured metadata from complex business documents like provider and vendor contracts. The company evolved from manual document processing (100 documents per week per person) through rules-based approaches (50-60% accuracy) to a generative AI solution built on AWS Bedrock using Anthropic's Claude models. The production system achieved 99% accuracy while processing up to 500,000 documents per week, resulting in a 97% reduction in manual effort and $330 million in client savings through improved contract analysis, claims overpayment identification, and operational efficiency.

Document-Wide AI Editing in Microsoft Word Add-In

Harvey

Harvey developed an AI-powered Word Add-In that enables comprehensive document-wide edits on 100+ page legal documents through a single query. The system addresses the challenges of OOXML complexity by creating reversible mappings between document structure and natural language, while using an orchestrator-subagent architecture to overcome position bias and ensure thorough coverage. The solution transforms hours of manual legal editing into seamless single-query interactions, supporting complex use cases like contract conformance, template creation, and jurisdiction-specific adaptations.

Domain-Adapted Foundation Models for Enterprise-Scale LLM Deployment

LinkedIn

LinkedIn developed a family of domain-adapted foundation models (EON models) to enhance their GenAI capabilities across their platform serving 1B+ members. By adapting open-source models like Llama through multi-task instruction tuning and safety alignment, they created cost-effective models that maintain high performance while being 75x more cost-efficient than GPT-4. The EON-8B model demonstrated significant improvements in production applications, including a 4% increase in candidate-job-requirements matching accuracy compared to GPT-4o mini in their Hiring Assistant product.

Domain-Specific AI Platform for Manufacturing and Supply Chain Optimization

Articul8

Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.

Domain-Specific Small Language Models for Call Center Intelligence

Deepgram

Deepgram tackles the challenge of building efficient language AI products for call centers by advocating for small, domain-specific language models instead of large foundation models. They demonstrate this by creating a 500M parameter model fine-tuned on call center transcripts, which achieves better performance in call center tasks like conversation continuation and summarization while being more cost-effective and faster than larger models.

Dutch YouTube Interface Localization and Content Management

Tastewise

This appears to be the Dutch footer section of YouTube's interface, showcasing the platform's localization and content management system. However, without more context about specific LLMOps implementation details, we can only infer that YouTube likely employs language models for content translation, moderation, and user interface localization.

Dynamic LLM Selection and Prompt Optimization Through Automated Evaluation and User Feedback

Beekeeper

Beekeeper, a digital workplace platform for frontline workers, faced the challenge of selecting and optimizing LLMs and prompts across rapidly evolving models while personalizing responses for different users and use cases. They built an Amazon Bedrock-powered system that continuously evaluates multiple model/prompt combinations using synthetic test data and real user feedback, ranks them on a live leaderboard based on quality, cost, and speed metrics, and automatically routes requests to the best-performing option. The system also mutates prompts based on user feedback to create personalized variations while using drift detection to ensure quality standards are maintained. This approach resulted in 13-24% better ratings on responses when aggregated per tenant, reduced manual labor in model selection, and enabled rapid adaptation to new models and user preferences.

Edge AI Architecture for Wearable Smart Glasses with Real-Time Multimodal Processing

Meta / Ray Ban

Meta Reality Labs developed a production AI system for Ray-Ban Meta smart glasses that brings AI capabilities directly to wearable devices through a four-part architecture combining on-device processing, smartphone connectivity, and cloud-based AI services. The system addresses unique challenges of wearable AI including power constraints, thermal management, connectivity limitations, and real-time performance requirements while enabling features like visual question answering, photo capture, and voice commands with sub-second response times for on-device operations and under 3-second response times for cloud-based AI interactions.

End-to-End Foundation Models for Self-Driving Vehicles at Scale

Wayve

Wayve is developing self-driving technology that works across multiple vehicle types and global markets by leveraging end-to-end foundation models trained on driving data rather than traditional rule-based systems. The company moved away from intermediate representations like object detection to a more holistic approach where a single neural network learns to drive from examples, similar to how large language models learn language. This architecture enabled rapid global expansion from primarily driving in London to operating across 500 cities in Japan, Europe, the UK, and the US within a year. The system uses foundation models for multiple tasks including driving, simulation, scenario classification, and even natural language explanations of driving decisions, with all components compressed into a single 75-watt model deployable in production vehicles.

End-to-End LLM Observability for RAG-Powered AI Assistant

Splunk

Splunk built an AI Assistant leveraging Retrieval-Augmented Generation (RAG) to answer FAQs using curated public content from .conf24 materials. The system was developed in a hackathon-style sprint using their internal CIRCUIT platform. To operationalize this LLM-powered application at scale, Splunk integrated comprehensive observability across the entire RAG pipeline—from prompt handling and document retrieval to LLM generation and output evaluation. By instrumenting structured logs, creating unified dashboards in Splunk Observability Cloud, and establishing proactive alerts for quality degradation, hallucinations, and cost overruns, they achieved full visibility into response quality, latency, source document reliability, and operational health. This approach enabled rapid iteration, reduced mean time to resolution for quality issues, and established reproducible governance practices for production LLM deployments.

Engineering Principles and Practices for Production LLM Systems

Langchain

This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.

Enhancing E-commerce Search with GPT-based Query Expansion

Whatnot

Whatnot improved their e-commerce search functionality by implementing a GPT-based query expansion system to handle misspellings and abbreviations. The system processes search queries offline through data collection, tokenization, and GPT-based correction, storing expansions in a production cache for low-latency serving. This approach reduced irrelevant content by more than 50% compared to their previous method when handling misspelled queries and abbreviations.

Enhancing E-commerce Search with LLM-Powered Semantic Retrieval

Picnic

Picnic, an e-commerce grocery delivery company, implemented LLM-enhanced search retrieval to improve product and recipe discovery across multiple languages and regions. They used GPT-3.5-turbo for prompt-based product description generation and OpenAI's text-embedding-3-small model for embedding generation, combined with OpenSearch for efficient retrieval. The system employs precomputation and caching strategies to maintain low latency while serving millions of customers across different countries.

Enhancing E-commerce Search with LLMs at Scale

Instacart

Instacart integrated LLMs into their search stack to improve query understanding, product attribute extraction, and complex intent handling across their massive grocery e-commerce platform. The solution addresses challenges with tail queries, product attribute tagging, and complex search intents while considering production concerns like latency, cost optimization, and evaluation metrics. The implementation combines offline and online LLM processing to enhance search relevance and enable new capabilities like personalized merchandising and improved product discovery.

Enterprise Agentic AI for Customer Support and Sales Using Amazon Bedrock AgentCore

Swisscom

Swisscom, Switzerland's leading telecommunications provider, implemented Amazon Bedrock AgentCore to build and scale enterprise AI agents for customer support and sales operations across their organization. The company faced challenges in orchestrating AI agents across different departments while maintaining Switzerland's strict data protection compliance, managing secure cross-departmental authentication, and preventing redundant efforts. By leveraging Amazon Bedrock AgentCore's Runtime, Identity, and Memory services along with the Strands Agents framework, Swisscom deployed two B2C use cases—personalized sales pitches and automated technical support—achieving stakeholder demos within 3-4 weeks, handling thousands of monthly requests with low latency, and establishing a scalable foundation that enables secure agent-to-agent communication while maintaining regulatory compliance.

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

Enterprise Challenges and Opportunities in Large-Scale LLM Deployment

Barclays

A senior leader in industry discusses the key challenges and opportunities in deploying LLMs at enterprise scale, highlighting the differences between traditional MLOps and LLMOps. The presentation covers critical aspects including cost management, infrastructure needs, team structures, and organizational adaptation required for successful LLM deployment, while emphasizing the importance of leveraging existing MLOps practices rather than completely reinventing the wheel.

Enterprise GenAI Implementation Strategies Across Industries

AstraZeneca / Adobe / Allianz Technology

A panel discussion featuring leaders from AstraZeneca, Adobe, and Allianz Technology sharing their experiences implementing GenAI in production. The case study covers how these enterprises prioritized use cases, managed legal considerations, and scaled AI adoption. Key successes included AstraZeneca's viral research assistant tool, Adobe's approach to legal frameworks for AI, and Allianz's code modernization efforts. The discussion highlights the importance of early legal engagement, focusing on impactful use cases, and treating AI implementation as a cultural transformation rather than just a tool rollout.

Enterprise LLM Application Development: GitHub Copilot's Journey

Github

GitHub shares their three-year journey of developing and scaling GitHub Copilot, their enterprise-grade AI code completion tool. The case study details their approach through three stages: finding the right problem space, nailing the product experience through rapid iteration and testing, and scaling the solution for enterprise deployment. The result was a successful launch that showed developers coding up to 55% faster and reporting 74% less frustration when coding.

Enterprise LLM Deployment with Multi-Cloud Data Platform Integration

Databricks

This presentation by Databricks' Product Management lead addresses the challenges large enterprises face when deploying LLMs into production, particularly around data governance, evaluation, and operational control. The talk centers on two primary case studies: FactSet's transformation of their query language translation system (improving from 59% to 85% accuracy while reducing latency from 15 to 6 seconds), and Databricks' internal use of Claude for automating analyst questionnaire responses. The solution involves decomposing complex prompts into multi-step agentic workflows, implementing granular governance controls across data and model access, and establishing rigorous evaluation frameworks to achieve production-grade reliability in high-risk enterprise environments.

Enterprise LLM Implementation Panel: Lessons from Box, Glean, Tyace, Security AI and Citibank

Various

A panel discussion featuring leaders from multiple enterprises sharing their experiences implementing LLMs in production. The discussion covers key challenges including data privacy, security, cost management, and enterprise integration. Speakers from Box discuss content management challenges, Glean covers enterprise search implementations, Tyace shares content generation experiences, Security AI addresses data safety, and Citibank provides CIO perspective on enterprise-wide AI deployment. The panel emphasizes the importance of proper data governance, security controls, and the need for systematic approach to move from POCs to production.

Enterprise LLMOps: Development, Operations and Security Framework

Cisco

At Cisco, the challenge of integrating LLMs into enterprise-scale applications required developing new DevSecOps workflows and practices. The presentation explores how Cisco approached continuous delivery, monitoring, security, and on-call support for LLM-powered applications, showcasing their end-to-end model for LLMOps in a large enterprise environment.

Enterprise Neural Machine Translation at Scale

DeepL

DeepL, a translation company founded in 2017, has built a successful enterprise-focused business using neural machine translation models to tackle the language barrier problem at scale. The company handles hundreds of thousands of customers by developing specialized neural translation models that balance accuracy and fluency, training them on curated parallel and monolingual corpora while leveraging context injection rather than per-customer fine-tuning for scalability. By building their own GPU infrastructure early on and developing custom frameworks for inference optimization, DeepL maintains a competitive edge over general-purpose LLMs and established players like Google Translate, demonstrating strong product-market fit in high-stakes enterprise use cases where translation quality directly impacts legal compliance, customer experience, and business operations.

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

Enterprise-Scale Cloud Event Management with Generative AI for Operational Intelligence

Fidelity Investments

Fidelity Investments faced the challenge of managing massive volumes of AWS health events and support case data across 2,000+ AWS accounts and 5 million resources in their multi-cloud environment. They built CENTS (Cloud Event Notification Transport Service), an event-driven data pipeline that ingests, enriches, routes, and acts on AWS health and support data at scale. Building upon this foundation, they developed and published the MAKI (Machine Augmented Key Insights) framework using Amazon Bedrock, which applies generative AI to analyze support cases and health events, identify trends, provide remediation guidance, and enable agentic workflows for vulnerability detection and automated code fixes. The solution reduced operational costs by 57%, improved stakeholder engagement through targeted notifications, and enabled proactive incident prevention by correlating patterns across their infrastructure.

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

John Snow Labs

John Snow Labs developed a comprehensive healthcare LLM system that integrates multimodal medical data (structured, unstructured, FHIR, and images) into unified patient journeys. The system enables natural language querying across millions of patient records while maintaining data privacy and security. It uses specialized healthcare LLMs for information extraction, reasoning, and query understanding, deployed on-premises via Kubernetes. The solution significantly improves clinical decision support accuracy and enables broader access to patient data analytics while outperforming GPT-4 in medical tasks.

Enterprise-Scale Prompt Engineering Toolkit with Lifecycle Management and Production Integration

Uber

Uber developed a comprehensive prompt engineering toolkit to address the challenges of managing and deploying LLMs at scale. The toolkit provides centralized prompt template management, version control, evaluation frameworks, and production deployment capabilities. It includes features for prompt creation, iteration, testing, and monitoring, along with support for both offline batch processing and online serving. The system integrates with their existing infrastructure and supports use cases like rider name validation and support ticket summarization.

Enterprise-Scale RAG Implementation for E-commerce Product Discovery

Grainger

Grainger, managing 2.5 million MRO products, faced challenges with their e-commerce product discovery and customer service efficiency. They implemented a RAG-based search system using Databricks Mosaic AI and Vector Search to handle 400,000 daily product updates and improve search accuracy. The solution enabled better product discovery through conversational interfaces and enhanced customer service capabilities while maintaining real-time data synchronization.

Enterprise-Wide LLM Assistant Deployment and Evolution Towards Fine-Tuned Models

Marsh McLennan

Marsh McLennan, a global professional services firm, implemented a comprehensive LLM-based assistant solution reaching 87% of their 90,000 employees worldwide, processing 25 million requests annually. Initially focused on productivity enhancement through API access and RAG, they evolved their strategy from using out-of-the-box models to incorporating fine-tuned models for specific tasks, achieving better accuracy than GPT-4 while maintaining cost efficiency. The implementation has conservatively saved over a million hours annually across the organization.

Enterprise-Wide LLM Framework for Manufacturing and Knowledge Management

Toyota

Toyota implemented a comprehensive LLMOps framework to address multiple production challenges, including battery manufacturing optimization, equipment maintenance, and knowledge management. The team developed a unified framework combining LangChain and LlamaIndex capabilities, with special attention to data ingestion pipelines, security, and multi-language support. Key applications include Battery Brain for manufacturing expertise, Gear Pal for equipment maintenance, and Project Cura for knowledge management, all showing significant operational improvements including reduced downtime and faster problem resolution.

Eval-Driven Development for AI Applications

Vercel

Vercel presents their approach to building and deploying AI applications through eval-driven development, moving beyond traditional testing methods to handle AI's probabilistic nature. They implement a comprehensive evaluation system combining code-based grading, human feedback, and LLM-based assessments to maintain quality in their v0 product, an AI-powered UI generation tool. This approach creates a positive feedback loop they call the "AI-native flywheel," which continuously improves their AI systems through data collection, model optimization, and user feedback.

Evaluation-Driven LLM Production Workflows with Morgan Stanley and Grab Case Studies

OpenAI

OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.

Evaluation-Driven Refactoring: How W&B Improved Their LLM Documentation Assistant Through Systematic Testing

Weights & Biases

Weights & Biases documented their journey refactoring Wandbot, their LLM-powered documentation assistant, achieving significant improvements in both accuracy (72% to 81%) and latency (84% reduction). The team initially attempted a "refactor-first, evaluate-later" approach but discovered the necessity of systematic evaluation throughout the process. Through methodical testing and iterative improvements, they replaced multiple components including switching from FAISS to ChromaDB for vector storage, transitioning to LangChain Expression Language (LCEL) for better async operations, and optimizing their RAG pipeline. Their experience highlighted the importance of continuous evaluation in LLM system development, with the team conducting over 50 unique evaluations costing approximately $2,500 to debug and optimize their refactored system.

Evolution from Centralized to Federated Generative AI Governance

Pictet AM

Pictet Asset Management faced the challenge of governing a rapidly proliferating landscape of generative AI use cases across marketing, compliance, investment research, and sales functions while maintaining regulatory compliance in the financial services industry. They initially implemented a centralized governance approach using a single AWS account with Amazon Bedrock, featuring a custom "Gov API" to track all LLM interactions. However, this architecture encountered resource limitations, cost allocation difficulties, and operational bottlenecks as the number of use cases scaled. The company pivoted to a federated model with decentralized execution but centralized governance, allowing individual teams to manage their own Bedrock services while maintaining cross-account monitoring and standardized guardrails. This evolution enabled better scalability, clearer cost ownership, and faster team iteration while preserving compliance and oversight capabilities.

Evolution from Monolithic to Task-Oriented LLM Pipelines in a Developer Assistant Product

Outropy

The case study details how Outropy evolved their LLM inference pipeline architecture while building an AI-powered assistant for engineering leaders. They started with simple pipelines for daily briefings and context-aware features, but faced challenges with context windows, relevance, and error cascades. The team transitioned from monolithic pipelines to component-oriented design, and finally to task-oriented pipelines using Temporal for workflow management. The product successfully scaled to 10,000 users and expanded from a Slack-only tool to a comprehensive browser extension.

Evolution from Task-Specific Models to Multi-Agent Orchestration Platform

AI21

AI21 Labs evolved their production AI systems from task-specific models (2022-2023) to RAG-as-a-Service, and ultimately to Maestro, a multi-agent orchestration platform. The company identified that while general-purpose LLMs demonstrated impressive capabilities, they weren't optimized for specific business use cases that enterprises actually needed, such as contextual question answering and summarization. AI21 developed smaller language models fine-tuned for specific tasks, wrapped them with pre- and post-processing operations (including hallucination filters), and eventually built a comprehensive RAG system when customers struggled to identify relevant context from large document corpora. The Maestro platform emerged to handle complex multi-hop queries by automatically breaking them into subtasks, parallelizing execution, and orchestrating multiple agents and tools, achieving dramatically improved quality with full traceability for enterprise requirements.

Evolution of AI Systems and LLMOps from Research to Production: Infrastructure Challenges and Application Design

NVIDA / Lepton

This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.

Evolution of Code Assistant Integration in a Cloud Development Platform

Val Town

Val Town's journey in implementing and evolving code assistance features showcases the challenges and opportunities in productionizing LLMs for code generation. Through iterative improvements and fast-following industry innovations, they progressed from basic ChatGPT integration to sophisticated features including error detection, deployment automation, and multi-file code generation, while addressing key challenges like generation speed and accuracy.

Evolution of Code Evaluation Benchmarks: From Single-Line Completion to Full Codebase Translation

Cursor

This research presentation details four years of work developing evaluation methodologies for coding LLMs across varying time horizons, from second-level code completions to hour-long codebase translations. The speaker addresses critical challenges in evaluating production coding AI systems including data contamination, insufficient test suites, and difficulty calibration. Key solutions include LiveCodeBench's dynamic evaluation approach with periodically updated problem sets, automated test generation using LLM-driven approaches, and novel reward hacking detection systems for complex optimization tasks. The work demonstrates how evaluation infrastructure must evolve alongside model capabilities, incorporating intermediate grading signals, latency-aware metrics, and LLM-as-judge approaches to detect non-idiomatic coding patterns that pass traditional tests but fail real-world quality standards.

Evolution of ML Model Deployment Infrastructure at Scale

Faire

Faire, a wholesale marketplace, evolved their ML model deployment infrastructure from a monolithic approach to a streamlined platform. Initially struggling with slow deployments, limited testing, and complex workflows across multiple systems, they developed an internal Machine Learning Model Management (MMM) tool that unified model deployment processes. This transformation reduced deployment time from 3+ days to 4 hours, enabled safe deployments with comprehensive testing, and improved observability while supporting various ML workloads including LLMs.

Evolving LLMOps Architecture for Enterprise Supplier Discovery

Various

A detailed case study of implementing LLMs in a supplier discovery product at Scoutbee, evolving from simple API integration to a sophisticated LLMOps architecture. The team tackled challenges of hallucinations, domain adaptation, and data quality through multiple stages: initial API integration, open-source LLM deployment, RAG implementation, and finally a comprehensive data expansion phase. The result was a production-ready system combining knowledge graphs, Chain of Thought prompting, and custom guardrails to provide reliable supplier discovery capabilities.

Evolving ML Infrastructure for Production Systems: From Traditional ML to LLMs

Doordash

A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.

Federal Government AI Platform Adoption and Scalability Initiatives

Various

The U.S. federal government agencies are working to move AI applications from pilots to production, focusing on scalable and responsible deployment. The Department of Energy (DOE) has implemented Energy GPT using open models in their environment, while the Department of State is utilizing LLMs for diplomatic cable summarization. The U.S. Navy's Project AMMO showcases successful MLOps implementation, reducing model retraining time from six months to one week for underwater vehicle operations. Agencies are addressing challenges around budgeting, security compliance, and governance while ensuring user-friendly AI implementations.

Financial Transaction Categorization at Scale Using LLMs and Custom Embeddings

Mercado Libre

Mercado Libre (MELI) faced the challenge of categorizing millions of financial transactions across Latin America in multiple languages and formats as Open Finance unlocked access to customer financial data. Starting with a brittle regex-based system in 2021 that achieved only 60% accuracy and was difficult to maintain, they evolved through three generations: first implementing GPT-3.5 Turbo in 2023 to achieve 80% accuracy with 75% cost reduction, then transitioning to GPT-4o-mini in 2024, and finally developing custom BERT-based semantic embeddings trained on regional financial text to reach 90% accuracy with an additional 30% cost reduction. This evolution enabled them to scale from processing tens of millions of transactions per quarter to tens of millions per week, while enabling near real-time categorization that powers personalized financial insights across their ecosystem.

Fine-tuned LLM Deployment for Automotive Customer Engagement

Impel

Impel, an automotive retail AI company, migrated from a third-party LLM to a fine-tuned Meta Llama model deployed on Amazon SageMaker to power their Sales AI product, which provides 24/7 personalized customer engagement for dealerships. The transition addressed cost predictability concerns and customization limitations, resulting in 20% improved accuracy across core features including response personalization, conversation summarization, and follow-up generation, while achieving better security and operational control.

Fine-Tuned LLM Deployment for Insurance Document Processing

Roots

Roots, an insurance AI company, developed and deployed fine-tuned 7B Mistral models in production using the vLLM framework to process insurance documents for entity extraction, classification, and summarization. The company evaluated multiple inference frameworks and selected vLLM for its performance advantages, achieving up to 130 tokens per second throughput on A100 GPUs with the ability to handle 32 concurrent requests. Their fine-tuned models outperformed GPT-4 on specialized insurance tasks while providing cost-effective processing at $30,000 annually for handling 20-30 million documents, demonstrating the practical benefits of self-hosting specialized models over relying on third-party APIs.

Fine-tuning and Deploying LLMs for Customer Service Contact Centers

Swisscom

Swisscom, a leading telecommunications provider in Switzerland, partnered with AWS to deploy fine-tuned large language models in their customer service contact centers to enable personalized, fast, and efficient customer interactions. The problem they faced was providing 24/7 customer service with high accuracy, low latency (critical for voice interactions), and the ability to handle hundreds of requests per minute during peak times while maintaining control over the model lifecycle. Their solution involved using AWS SageMaker to fine-tune a smaller LLM (Llama 3.1 8B) using synthetic data generated by a larger teacher model, implementing LoRA for efficient training, and deploying the model with infrastructure-as-code using AWS CDK. The results achieved median latency below 250 milliseconds in production, accuracy comparable to larger models, cost-efficient scaling with hourly infrastructure charging instead of per-token pricing, and successful handling of 50% of production traffic with the ability to scale for unexpected peaks.

Fine-Tuning and Multi-Stage Model Optimization for Financial AI Agents

Robinhood Markets

Robinhood Markets developed a sophisticated LLMOps platform to deploy AI agents serving millions of users across multiple use cases including customer support, content generation (Cortex Digest), and code generation (custom indicators and scans). To address the "generative AI trilemma" of balancing cost, quality, and latency in production, they implemented a hierarchical tuning approach starting with prompt optimization, progressing to trajectory tuning with dynamic few-shot examples, and culminating in LoRA-based fine-tuning. Their CX AI agent achieved over 50% latency reduction (from 3-6 seconds to under 1 second) while maintaining quality parity with frontier models, supported by a comprehensive three-layer evaluation system combining LLM-as-judge, human feedback, and task-specific metrics.

Fine-Tuning and Quantizing LLMs for Dynamic Attribute Extraction

Mercari

Mercari tackled the challenge of extracting dynamic attributes from user-generated marketplace listings by fine-tuning a 2B parameter LLM using QLoRA. The team successfully created a model that outperformed GPT-3.5-turbo while being 95% smaller and 14 times more cost-effective. The implementation included careful dataset preparation, parameter efficient fine-tuning, and post-training quantization using llama.cpp, resulting in a production-ready model with better control over hallucinations.

Fine-tuning and Scaling LLMs for Search Relevance Prediction

Faire

Faire, an e-commerce marketplace, tackled the challenge of evaluating search relevance at scale by transitioning from manual human labeling to automated LLM-based assessment. They first implemented a GPT-based solution and later improved it using fine-tuned Llama models. Their best performing model, Llama3-8b, achieved a 28% improvement in relevance prediction accuracy compared to their previous GPT model, while significantly reducing costs through self-hosted inference that can handle 70 million predictions per day using 16 GPUs.

Fine-tuning Custom Embedding Models for Enterprise Search

Glean

Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.

Fine-Tuning LLMs for Multi-Agent Orchestration in Code Generation

Cosine

Cosine, a company building enterprise coding agents, faced the challenge of deploying high-performance AI systems in highly constrained environments including on-premise and air-gapped deployments where large frontier models were not viable. They developed a multi-agent architecture using specialized orchestrator and worker models, leveraging model distillation, supervised fine-tuning, preference optimization, and reinforcement fine-tuning to create smaller models that could match or exceed the performance of much larger models. The result was a 31% performance increase on the SWE-bench Freelancer benchmark, 3X latency improvement, 60% reduction in GPU footprint, and 20% fewer errors in generated code, all while operating on as few as 4 H100 GPUs and maintaining full deployment flexibility across cloud, VPC, and on-premise environments.

Fine-tuning Mistral 7B for Multilingual Defense Intelligence Sentiment Analysis

Vannevar Labs

Vannevar Labs needed to improve their sentiment analysis capabilities for defense intelligence across multiple languages, finding that GPT-4 provided insufficient accuracy (64%) and high costs. Using Databricks Mosaic AI, they successfully fine-tuned a Mistral 7B model on domain-specific data, achieving 76% accuracy while reducing latency by 75%. The entire process from development to deployment took only two weeks, enabling efficient processing of multilingual content for defense-related applications.

Fine-tuning Multimodal Models for Banking Document Processing

Apoidea Group

Apoidea Group tackled the challenge of efficiently processing banking documents by developing a solution using multimodal large language models. They fine-tuned the Qwen2-VL-7B-Instruct model using LLaMA-Factory on Amazon SageMaker HyperPod to enhance visual information extraction from complex banking documents. The solution significantly improved table structure recognition accuracy from 23.4% to 81.1% TEDS score, approaching the performance of more advanced models while maintaining computational efficiency. This enabled reduction of financial spreading process time from 4-6 hours to just 10 minutes.

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

Foundation Model for Ads Recommendation at Scale

Meta

Meta developed GEM (Generative Ads Recommendation Model), an LLM-scale foundation model trained on thousands of GPUs to enhance ads recommendation across Facebook and Instagram. The model addresses challenges of sparse signals in billions of daily user-ad interactions, diverse multimodal data, and efficient large-scale training. GEM achieves 4x efficiency improvement over previous models through novel architecture innovations including stackable factorization machines, pyramid-parallel sequence processing, and cross-feature learning. The system employs sophisticated post-training knowledge transfer techniques achieving 2x the effectiveness of standard distillation, propagating learnings across hundreds of vertical models. Since launch in early 2025, GEM delivered a 5% increase in ad conversions on Instagram and 3% on Facebook Feed in Q2, with Q3 architectural improvements doubling performance gains from additional compute and data.

Foundation Model for Large-Scale Personalized Recommendation

Netflix

Netflix developed a foundation model approach to centralize and scale their recommendation system, transitioning from multiple specialized models to a unified architecture. The system processes hundreds of billions of user interactions, employing sophisticated tokenization, sparse attention mechanisms, and incremental training to handle cold-start problems and new content. The model demonstrates successful scaling properties similar to LLMs, while maintaining production-level latency requirements and addressing unique challenges in recommendation systems.

Foundation Model for Personalized Recommendation at Scale

Netflix

Netflix developed a foundation model for personalized recommendations to address the maintenance complexity and inefficiency of operating numerous specialized recommendation models. The company built a large-scale transformer-based model inspired by LLM paradigms that processes hundreds of billions of user interactions from over 300 million users, employing autoregressive next-token prediction with modifications for recommendation-specific challenges. The foundation model enables centralized member preference learning that can be fine-tuned for specific tasks, used directly for predictions, or leveraged through embeddings, while demonstrating clear scaling law benefits as model and data size increase, ultimately improving recommendation quality across multiple downstream applications.

Foundation Model for Unified Personalization at Scale

Netflix

Netflix developed a unified foundation model based on transformer architecture to consolidate their diverse recommendation systems, which previously consisted of many specialized models for different content types, pages, and use cases. The foundation model uses autoregressive transformers to learn user representations from interaction sequences, incorporating multi-token prediction, multi-layer representation, and long context windows. By scaling from millions to billions of parameters over 2.5 years, they demonstrated that scaling laws apply to recommendation systems, achieving notable performance improvements while creating high leverage across downstream applications through centralized learning and easier fine-tuning for new use cases.

Framework for Evaluating LLM Production Use Cases

Scale Venture Partners

Barak Turovsky, drawing from his experience leading Google Translate and other AI initiatives, presents a framework for evaluating LLM use cases in production. The framework analyzes use cases based on two key dimensions: accuracy requirements and fluency needs, along with consideration of stakes involved. This helps organizations determine which applications are suitable for current LLM deployment versus those that need more development. The framework suggests creative and workplace productivity applications are better immediate fits for LLMs compared to high-stakes information/decision support use cases.

From Mega-Prompts to Production: Lessons Learned Scaling LLMs in Enterprise Customer Support

GoDaddy

GoDaddy has implemented large language models across their customer support infrastructure, particularly in their Digital Care team which handles over 60,000 customer contacts daily through messaging channels. Their journey implementing LLMs for customer support revealed several key operational insights: the need for both broad and task-specific prompts, the importance of structured outputs with proper validation, the challenges of prompt portability across models, the necessity of AI guardrails for safety, handling model latency and reliability issues, the complexity of memory management in conversations, the benefits of adaptive model selection, the nuances of implementing RAG effectively, optimizing data for RAG through techniques like Sparse Priming Representations, and the critical importance of comprehensive testing approaches. Their experience demonstrates both the potential and challenges of operationalizing LLMs in a large-scale enterprise environment.

From Pilot to Profit: Three Enterprise GenAI Case Studies in Manufacturing, Aviation, and Telecommunications

Various

A comprehensive analysis of three enterprise GenAI implementations showcasing the journey from pilot to profit. The cases cover a top 10 automaker's use of GenAI for manufacturing maintenance, an aviation entertainment company's predictive maintenance system, and a telecom provider's sales automation solution. Each case study reveals critical "hidden levers" for successful GenAI deployment: adoption triggers, lean workflows, and revenue accelerators. The analysis demonstrates that while GenAI projects typically cost between $200K to $1M and take 15-18 months to achieve ROI, success requires careful attention to implementation details, user adoption, and business process integration.

GenAI Agent for Partner-Guest Messaging Automation

Booking.com

Booking.com developed a GenAI agent to assist accommodation partners in responding to guest inquiries more efficiently. The problem was that manual responses through their messaging platform were time-consuming, especially during busy periods, potentially leading to delayed responses and lost bookings. The solution involved building a tool-calling agent using LangGraph and GPT-4 Mini that can suggest relevant template responses, generate custom free-text answers, or abstain from responding when appropriate. The system includes guardrails for PII redaction, retrieval tools using embeddings for template matching, and access to property and reservation data. Early results show the system handles tens of thousands of daily messages, with pilots demonstrating 70% improvement in user satisfaction, reduced follow-up messages, and faster response times.

GenAI-Powered Accessory Recommendations for Large-Scale E-commerce Catalog

Target

Target's Product Recommendations Team developed GRAM (GenAI-based Related Accessory Model) to address the challenge of recommending appropriate accessories across their vast Electronics and Home categories. The system uses LLMs to automatically analyze product attributes, assign importance weights to different attribute combinations, and generate aesthetic matches that consider color harmony and stylistic coherence. By incorporating human-in-the-loop processes with site merchant insights, the solution balances algorithmic recommendations with cross-category expertise. An A/B test conducted in February 2025 showed approximately 11% increase in interaction rate, 12% increase in display-to-conversion rates, and over 9% growth in attributable demand. The model was fully rolled out to production in April 2025.

GenAI-Powered Document Classification for Community Management

Associa

Associa, North America's largest community management company managing 48 million documents across 26 TB of data, faced significant operational inefficiencies due to manual document classification processes that consumed employee hours and created bottlenecks. Collaborating with the AWS Generative AI Innovation Center, Associa built a generative AI-powered document classification system using Amazon Bedrock and the GenAI IDP Accelerator. The solution achieved 95% classification accuracy across eight document types at an average cost of 0.55 cents per document, using Amazon Nova Pro with a first-page-only approach combined with OCR and image inputs. The system processes documents automatically, integrates seamlessly into existing workflows, and delivers substantial cost savings while reducing manual classification effort and improving operational efficiency.

GenAI-Powered Invoice Document Processing and Automation

Uber

Uber faced significant challenges processing a high volume of invoices daily from thousands of global suppliers, with diverse formats, 25+ languages, and varying templates requiring substantial manual intervention. The company developed TextSense, a GenAI-powered document processing platform that leverages OCR, computer vision, and large language models (specifically OpenAI GPT-4 after evaluating multiple options including fine-tuned Llama 2 and Flan T5) to automate invoice data extraction. The solution achieved 90% overall accuracy, reduced manual processing by 2x, cut average handling time by 70%, and delivered 25-30% cost savings compared to manual processes, while providing a scalable, configuration-driven platform adaptable to diverse document types.

Generative AI Contact Center Solution with Amazon Bedrock and Claude

DoorDash

DoorDash implemented a generative AI-powered self-service contact center solution using Amazon Bedrock, Amazon Connect, and Anthropic's Claude to handle hundreds of thousands of daily support calls. The solution leverages RAG with Knowledge Bases for Amazon Bedrock to provide accurate responses to Dasher inquiries, achieving response latency of 2.5 seconds or less. The implementation reduced development time by 50% and increased testing capacity 50x through automated evaluation frameworks.

Generative AI Customer Service Agent Assist with RAG Implementation

Newday

NewDay, a UK financial services company handling 2.5 million customer calls annually, developed NewAssist, a real-time generative AI assistant to help customer service agents quickly find answers from nearly 200 knowledge articles. Starting as a hackathon project, the solution evolved from a voice assistant concept to a chatbot implementation using Amazon Bedrock and Claude 3 Haiku. Through iterative experimentation and custom data processing, the team achieved over 90% accuracy, reducing answer retrieval time from 90 seconds to 4 seconds while maintaining costs under $400 per month using a serverless AWS architecture.

Generative AI-Powered Intelligent Document Processing for Healthcare Operations

Myriad Genetics

Myriad Genetics, a genetic testing and precision medicine provider, faced challenges processing thousands of healthcare documents daily with their existing Amazon Comprehend and Amazon Textract solution, which cost $15,000 monthly per business unit with 8.5-minute processing times and required manual information extraction involving up to 10 full-time employees. Partnering with AWS Generative AI Innovation Center, they deployed the open-source GenAI IDP Accelerator using Amazon Bedrock with Amazon Nova models, implementing advanced prompt engineering techniques including AI-driven prompt engineering, negative prompting, few-shot learning, and chain-of-thought reasoning. The solution increased classification accuracy from 94% to 98%, reduced classification costs by 77%, decreased processing time by 80% (from 8.5 to 1.5 minutes), and automated key information extraction at 90% accuracy, projected to save $132K annually while reducing prior authorization processing time by 2 minutes per submission.

GitHub Copilot Deployment at Scale: Enhancing Developer Productivity

Mercado Libre

Mercado Libre, Latin America's largest e-commerce platform, implemented GitHub Copilot across their development team of 9,000+ developers to address the need for more efficient development processes. The solution resulted in approximately 50% reduction in code writing time, improved developer satisfaction, and enhanced productivity by automating repetitive tasks. The implementation was part of a broader GitHub Enterprise strategy that includes security features and automated workflows.

Global News Organization's AI-Powered Content Production and Verification System

Reuters

Reuters has implemented a comprehensive AI strategy to enhance its global news operations, focusing on reducing manual work, augmenting content production, and transforming news delivery. The organization developed three key tools: a press release fact extraction system, an AI-integrated CMS called Leon, and a content packaging tool called LAMP. They've also launched the Reuters AI Suite for clients, offering transcription and translation capabilities while maintaining strict ethical guidelines around AI-generated imagery and maintaining journalistic integrity.

Google Photos Magic Editor: Transitioning from On-Device ML to Cloud-Based Generative AI for Image Editing

Google

Google Photos evolved from using on-device machine learning models for basic image editing features like background blur and object removal to implementing cloud-based generative AI for their Magic Editor feature. The team transitioned from small, specialized models (10MB) running locally on devices to large-scale generative models hosted in the cloud to enable more sophisticated image editing capabilities like scene reimagination, object relocation, and advanced inpainting. This shift required significant changes in infrastructure, capacity planning, evaluation methodologies, and user experience design while maintaining focus on grounded, memory-preserving edits rather than fantastical image generation.

GPU Resource Optimization for Multi-Model LLM Deployment

Salesforce

Salesforce's AI Platform team addressed the challenge of inefficient GPU utilization and high costs when hosting multiple proprietary large language models (LLMs) including CodeGen on Amazon SageMaker. They implemented SageMaker AI inference components to deploy multiple foundation models on shared endpoints with granular resource allocation, enabling dynamic scaling and intelligent model packing. This solution achieved up to an eight-fold reduction in deployment and infrastructure costs while maintaining high performance standards, allowing smaller models to efficiently utilize high-performance GPUs and optimizing resource allocation across their diverse model portfolio.

Hardening AI Agents for E-commerce at Scale: Multi-Company Perspectives on RL Alignment and Reliability

Prosus / Microsoft / Inworld AI / IUD

This panel discussion features experts from Microsoft, Google Cloud, InWorld AI, and Brazilian e-commerce company IUD (Prosus partner) discussing the challenges of deploying reliable AI agents for e-commerce at scale. The panelists share production experiences ranging from Google Cloud's support ticket routing agent that improved policy adherence from 45% to 90% using DPO adapters, to Microsoft's shift away from prompt engineering toward post-training methods for all Copilot models, to InWorld AI's voice agent architecture optimization through cascading models, and IUD's struggles with personalization balance in their multi-channel shopping agent. Key challenges identified include model localization for UI elements, cost efficiency, real-time voice adaptation, and finding the right balance between automation and user control in commerce experiences.

Harness Engineering for Agentic Coding Systems

Langchain

LangChain improved their coding agent (deepagents-cli) from 52.8% to 66.5% on Terminal Bench 2.0, advancing from Top 30 to Top 5 performance, solely through harness engineering without changing the underlying model (gpt-5.2-codex). The solution focused on three key areas: system prompts emphasizing self-verification loops, enhanced tools and context injection to help agents understand their environment, and middleware hooks to detect problematic patterns like doom loops. The approach leveraged LangSmith tracing at scale to identify failure modes and iteratively optimize the harness through automated trace analysis, demonstrating that systematic engineering around the model can yield significant performance improvements in production agentic systems.

Healthcare Conversational AI and Multi-Model Cost Management in Production

Amberflo / Interactly.ai

A panel discussion featuring Interactly.ai's development of conversational AI for healthcare appointment management, and Amberflo's approach to usage tracking and cost management for LLM applications. The case study explores how Interactly.ai handles the challenges of deploying LLMs in healthcare settings with privacy and latency constraints, while Amberflo addresses the complexities of monitoring and billing for multi-model LLM applications in production.

Hierarchical Multi-Task Learning for Intent Prediction in Recommender Systems

Netflix

Netflix developed FM-Intent, a novel recommendation model that enhances their existing foundation model by incorporating hierarchical multi-task learning to predict user session intent alongside next-item recommendations. The problem addressed was that while their foundation model successfully predicted what users might watch next, it lacked understanding of underlying user intents (such as discovering new content versus continuing existing viewing, genre preferences, and content type preferences). FM-Intent solves this by establishing a hierarchical relationship where intent predictions inform item recommendations, using Transformer encoders to process interaction metadata and attention-based aggregation to combine multiple intent signals. The solution demonstrated a statistically significant 7.4% improvement in next-item prediction accuracy compared to the previous state-of-the-art baseline (TransAct) in offline experiments, and has been successfully integrated into Netflix's production recommendation ecosystem for applications including personalized UI optimization, analytics, and enhanced recommendation signals.

High-Performance AI Network Infrastructure for Distributed Training at Scale

Meta

Meta faced significant challenges with AI model training as checkpoint data grew from hundreds of gigabytes to tens of terabytes, causing network bottlenecks and GPU idle time. Their solution involved implementing bidirectional multi-NIC utilization through ECMP-based load balancing for egress traffic and BGP-based virtual IP injection for ingress traffic, enabling optimal use of all available network interfaces. The implementation resulted in dramatic performance improvements, reducing job read latency from 300 seconds to 1 second and checkpoint loading time from 800 seconds to 100 seconds, while achieving 4x throughput improvement through proper traffic distribution across multiple network interfaces.

High-Performance GPU Memory Transfer Optimization for Large Language Models

Perplexity

A technical exploration of achieving high-performance GPU memory transfer speeds (up to 3200 Gbps) on AWS SageMaker Hyperpod infrastructure, demonstrating the critical importance of optimizing memory bandwidth for large language model training and inference workloads.

High-Performance LLM Deployment with SageMaker AI

Salesforce

Salesforce's AI Model Serving team tackled the challenge of deploying and optimizing large language models at scale while maintaining performance and security. Using Amazon SageMaker AI and Deep Learning Containers, they developed a comprehensive hosting framework that reduced model deployment time by 50% while achieving high throughput and low latency. The solution incorporated automated testing, security measures, and continuous optimization techniques to support enterprise-grade AI applications.

Hybrid AI System for Large-Scale Product Categorization

Walmart

Walmart developed Ghotok, an innovative AI system that combines predictive and generative AI to improve product categorization across their digital platforms. The system addresses the challenge of accurately mapping relationships between product categories and types across 400 million SKUs. Using an ensemble approach with both predictive and generative AI models, along with sophisticated caching and deployment strategies, Ghotok successfully reduces false positives and improves the efficiency of product categorization while maintaining fast response times in production.

Implementing LLMOps in Restricted Networks with Long-Running Evaluations

Microsoft

A case study detailing Microsoft's experience implementing LLMOps in a restricted network environment using Azure Machine Learning. The team faced challenges with long-running evaluations (6+ hours) and network restrictions, developing solutions including opt-out mechanisms for lengthy evaluations, implementing Git Flow for controlled releases, and establishing a comprehensive CI/CE/CD pipeline. Their approach balanced the needs of data scientists, engineers, and platform teams while maintaining security and evaluation quality.

Implementing MCP Gateway for Large-Scale LLM Integration Infrastructure

Anthropic

Anthropic faced the challenge of managing an explosion of LLM-powered services and integrations across their organization, leading to duplicated functionality and integration chaos. They solved this by implementing a standardized MCP (Model Context Protocol) gateway that provides a single point of entry for all LLM integrations, handling authentication, credential management, and routing to both internal and external services. This approach reduced engineering overhead, improved security by centralizing credential management, and created a "pit of success" where doing the right thing became the easiest thing to do for their engineering teams.

Implementing RAG for Call Center Operations with Hybrid Data Sources

Manulife

Manulife implemented a Retrieval Augmented Generation (RAG) system in their call center to help customer service representatives quickly access and utilize information from both structured and unstructured data sources. They developed an innovative approach combining document chunks and structured data embeddings, achieving an optimized response time of 7.33 seconds in production. The system successfully handles both policy documents and database information, using GPT-3.5 for answer generation with additional validation from Llama 3 or GPT-4.

Implementing RAG for Enhanced Customer Care at Scale

Doctolib

Doctolib, a European e-health company, implemented a RAG-based system to improve their customer care services. Using GPT-4 hosted on Azure OpenAI, combined with OpenSearch as a vector database and a custom reranking system, they achieved a 20% reduction in customer care cases. The system includes comprehensive evaluation metrics through the Ragas framework, and overcame significant latency challenges to achieve response times under 5 seconds. While successful, they identified limitations with complex queries that led them to explore agentic frameworks as a next step.

Improving Contextual Understanding in GitHub Copilot Through Advanced Prompt Engineering

Github

GitHub's machine learning team enhanced GitHub Copilot's contextual understanding through several key innovations: implementing Fill-in-the-Middle (FIM) paradigm, developing neighboring tabs functionality, and extensive prompt engineering. These improvements led to significant gains in suggestion accuracy, with FIM providing a 10% boost in completion acceptance rates and neighboring tabs yielding a 5% increase in suggestion acceptance.

Improving GitHub Copilot's Contextual Understanding Through Advanced Prompt Engineering and Retrieval

GitHub

GitHub's machine learning team worked to enhance GitHub Copilot's contextual understanding of code to provide more relevant AI-powered coding suggestions. The problem was that large language models could only process limited context (approximately 6,000 characters), making it challenging to leverage all relevant information from a developer's codebase. The solution involved sophisticated prompt engineering, implementing neighboring tabs to process multiple open files, introducing a Fill-In-the-Middle (FIM) paradigm to consider code both before and after the cursor, and experimenting with vector databases and embeddings for semantic code retrieval. These improvements resulted in measurable gains: neighboring tabs provided a 5% relative increase in suggestion acceptance, FIM yielded a 10% relative boost in performance, and the overall enhancements contributed to developers coding up to 55% faster when using GitHub Copilot.

Improving LLM Accuracy and Evaluation in Enterprise Customer Analytics

Various

Echo.ai and Log10 partnered to solve accuracy and evaluation challenges in deploying LLMs for enterprise customer conversation analysis. Echo.ai's platform analyzes millions of customer conversations using multiple LLMs, while Log10 provides infrastructure for improving LLM accuracy through automated feedback and evaluation. The partnership resulted in a 20-point F1 score increase in accuracy and enabled Echo.ai to successfully deploy large enterprise contracts with improved prompt optimization and model fine-tuning.

Improving Local Search with Multimodal LLMs and Vector Search

OfferUp

OfferUp transformed their traditional keyword-based search system to a multimodal search solution using Amazon Bedrock's Titan Multimodal Embeddings and Amazon OpenSearch Service. The new system processes both text and images to generate vector embeddings, enabling more contextually relevant search results. The implementation led to significant improvements, including a 27% increase in relevance recall, 54% reduction in geographic spread for more local results, and a 6.5% increase in search depth.

Incremental LLM Adoption Strategy in Email Processing API Platform

Nylas

Nylas, an email/calendar/contacts API platform provider, implemented a systematic three-month strategy to integrate LLMs into their production systems. They started with development workflow automation using multi-agent systems, enhanced their annotation processes with LLMs, and finally integrated LLMs as a fallback mechanism in their core email processing product. This measured approach resulted in 90% reduction in bug tickets, 20x cost savings in annotation, and successful deployment of their own LLM infrastructure when usage reached cost-effective thresholds.

Infrastructure Challenges and Solutions for Agentic AI Systems in Production

Meta / Google / Monte Carlo / Microsoft

A panel discussion featuring experts from Meta, Google, Monte Carlo, and Microsoft examining the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion covers how agentic workloads differ from traditional software systems, requiring new approaches to networking, load balancing, caching, security, and observability, while highlighting specific challenges like non-deterministic behavior, massive search spaces, and the need for comprehensive evaluation frameworks to ensure reliable and secure AI agent operations at scale.

Infrastructure for AI Agents: Panel Discussion on Production Challenges and Solutions

Various

This panel discussion brings together infrastructure experts from Groq, NVIDIA, Lambda, and AMD to discuss the unique challenges of deploying AI agents in production. The panelists explore how agentic AI differs from traditional AI workloads, requiring significantly higher token generation, lower latency, and more diverse infrastructure spanning edge to cloud. They discuss the evolution from training-focused to inference-focused infrastructure, emphasizing the need for efficiency at scale, specialized hardware optimization, and the importance of smaller distilled models over large monolithic models. The discussion highlights critical operational challenges including power delivery, thermal management, and the need for full-stack engineering approaches to debug and optimize agentic systems in production environments.

Infrastructure Noise in Agentic Coding Evaluations

Anthropic

Anthropic discovered that infrastructure configuration alone can produce differences in agentic coding benchmark scores that exceed the typical margins between top models on leaderboards. Through systematic experiments running Terminal-Bench 2.0 across six resource configurations on Google Kubernetes Engine, they found a 6 percentage point gap between the most- and least-resourced setups. The research revealed that while moderate resource headroom (up to 3x specifications) primarily improves infrastructure stability by preventing spurious failures, more generous allocations actively help agents solve problems they couldn't solve before. These findings challenge the notion that small leaderboard differences represent pure model capability measurements and led to recommendations for specifying both guaranteed allocations and hard kill thresholds, calibrating resource bands empirically, and treating resource configuration as a first-class experimental variable in LLMOps practices.

Integrating Foundation Models into Production Personalization Systems

Netflix

Netflix developed a centralized foundation model for personalization to replace multiple specialized models powering their homepage recommendations. Rather than maintaining numerous individual models, they created one powerful transformer-based model trained on comprehensive user interaction histories and content data at scale. The challenge then became how to effectively integrate this large foundation model into existing production systems. Netflix experimented with and deployed three distinct integration approaches—embeddings via an Embedding Store, using the model as a subgraph within downstream models, and direct fine-tuning for specific applications—each with different tradeoffs in terms of latency, computational cost, freshness, and implementation complexity. These approaches are now used in production across different Netflix personalization use cases based on their specific requirements.

Integrating Gemini for Natural Language Analytics in IoT Fleet Management

Cox 2M

Cox 2M, facing challenges with a lean analytics team and slow insight generation (taking up to a week per request), partnered with Thoughtspot and Google Cloud to implement Gemini-powered natural language analytics. The solution reduced time to insights by 88% while enabling non-technical users to directly query complex IoT and fleet management data using natural language. The implementation includes automated insight generation, change analysis, and natural language processing capabilities.

Integrating Symbolic Reasoning with LLMs for AI-Native Telecom Infrastructure

Ericsson

Ericsson's System Comprehension Lab is exploring the integration of symbolic reasoning capabilities into telecom-oriented large language models to address critical limitations in current LLM architectures for telecommunications infrastructure management. The problem centers on LLMs' inability to provide deterministic, explainable reasoning required for telecom network optimization, security, and anomaly detection—domains where hallucinations, lack of logical consistency, and black-box behavior are unacceptable. The proposed solution involves hybrid neural-symbolic AI architectures that combine the pattern recognition strengths of transformer-based LLMs with rule-based reasoning engines, connected through techniques like symbolic chain-of-thought prompting, program-aided reasoning, and external solver integration. This approach aims to enable AI-native wireless systems for 6G infrastructure that can perform cross-layer optimization, real-time decision-making, and intent-driven network management while maintaining the explainability and logical rigor demanded by production telecom environments.

Iterative Development Process for Production AI Features

Zapier

Zapier's journey in developing and deploying AI products demonstrates a pragmatic, iterative approach to LLMOps. Their methodology focuses on rapid prototyping with advanced models like GPT-4 Turbo and Claude Opus, followed by quick deployment of initial versions (even with sub-50% accuracy), systematic collection of user feedback, and establishment of comprehensive evaluation frameworks. This approach has enabled them to improve their AI products from sub-50% to over 90% accuracy within 2-3 months, while successfully managing costs and maintaining product quality.

Journey Towards Autonomous Network Operations with AI/ML and Dark NOC

BT

BT is undertaking a major transformation of their network operations, moving from traditional telecom engineering to a software-driven approach with the goal of creating an autonomous "Dark NOC" (Network Operations Center). The initiative focuses on handling massive amounts of network data, implementing AI/ML for automated analysis and decision-making, and consolidating numerous specialized tools into a comprehensive intelligent system. The project involves significant organizational change, including upskilling teams and partnering with AWS to build data foundations and AI capabilities for predictive maintenance and autonomous network management.

JUDE: Large-Scale LLM-Based Embedding Generation for Job Recommendations

LinkedIn

LinkedIn developed JUDE (Job Understanding Data Expert), a production platform that leverages fine-tuned large language models to generate high-quality embeddings for job recommendations at scale. The system addresses the computational challenges of LLM deployment through a multi-component architecture including fine-tuned representation learning, real-time embedding generation, and comprehensive serving infrastructure. JUDE replaced standardized features in job recommendation models, resulting in +2.07% qualified applications, -5.13% dismiss-to-apply ratio, and +1.91% total job applications - representing the highest metric improvement from a single model change observed by the team.

Kubernetes as a Platform for LLM Operations: Practical Experiences and Trade-offs

Various

A panel discussion between experienced Kubernetes and ML practitioners exploring the challenges and opportunities of running LLMs on Kubernetes. The discussion covers key aspects including GPU management, cost optimization, training vs inference workloads, and architectural considerations. The panelists share insights from real-world implementations while highlighting both benefits (like workload orchestration and vendor agnosticism) and challenges (such as container sizes and startup times) of using Kubernetes for LLM operations.

Large Foundation Model for Unified Recommendation and Ranking at Scale

LinkedIn

LinkedIn developed a large foundation model called "Brew XL" with 150 billion parameters to unify all personalization and recommendation tasks across their platform, addressing the limitations of task-specific models that operate in silos. The solution involved training a massive language model on user interaction data through "promptification" techniques, then distilling it down to smaller, production-ready models (3B parameters) that could serve high-QPS recommendation systems with sub-second latency. The system demonstrated zero-shot capabilities for new tasks, improved performance on cold-start users, and achieved 7x latency reduction with 30x throughput improvement through optimization techniques including distillation, pruning, quantization, and sparsification.

Large Language Models for Game Player Sentiment Analysis and Retention

SEGA Europe

SEGA Europe faced challenges managing data from 50,000 events per second across 40 million players, making it difficult to derive actionable insights. They implemented a sentiment analysis LLM system on the Databricks platform that processes over 10,000 user reviews daily to identify and address gameplay issues. This led to up to 40% increase in player retention and significantly faster time to insight through AI-powered analytics.

Large Language Models for Search Relevance via Knowledge Distillation

Pinterest

Pinterest tackled the challenge of improving search relevance by implementing a large language model-based system. They developed a cross-encoder LLM teacher model trained on human-annotated data, which was then distilled into a lightweight student model for production deployment. The system processes rich Pin metadata including titles, descriptions, and synthetic image captions to predict relevance scores. The implementation resulted in a 2.18% improvement in search feed relevance (nDCG@20) and over 1.5% increase in search fulfillment rates globally, while successfully generalizing across multiple languages despite being trained primarily on US data.

Large Language Models in Production Round Table Discussion: Latency, Cost and Trust Considerations

Various

A panel of experts from various companies and backgrounds discusses the challenges and solutions of deploying LLMs in production. They explore three main themes: latency considerations in LLM deployments, cost optimization strategies, and building trust in LLM systems. The discussion includes practical examples from Digits, which uses LLMs for financial document processing, and insights from other practitioners about model optimization, deployment strategies, and the evolution of LLM architectures.

Large-Scale Analysis of AI Coding Tool Adoption and Productivity Impact Across 1,000 Companies

Jellyfish

Jellyfish, a software engineering analytics company, conducted a comprehensive study analyzing 20 million pull requests from 200,000 developers across 1,000 companies to understand real-world AI transformation patterns in software development. The study tracked adoption of AI coding tools (Copilot, Cursor, Claude Code) and autonomous agents (Devon, Codeex) from June 2024 onwards. Key findings include: median developer adoption rates grew from 22% to 90%, companies achieved approximately 2x gains in PR throughput with full AI adoption, cycle times decreased by 24%, and PR sizes increased by 18%. However, the study revealed that code architecture significantly impacts outcomes—centralized and balanced architectures saw 4x gains while highly distributed architectures showed minimal correlation between AI adoption and productivity, primarily due to context limitations across multiple repositories. Quality metrics showed no significant degradation, with bug resolution rates actually improving as teams used AI for well-scoped bug fixes.

Large-Scale Deployment of On-Device and Server Foundation Models for Consumer AI Features

Apple

Apple developed and deployed a comprehensive foundation model infrastructure consisting of a 3-billion parameter on-device model and a mixture-of-experts server model to power Apple Intelligence features across iOS, iPadOS, and macOS. The implementation addresses the challenge of delivering generative AI capabilities at consumer scale while maintaining privacy, efficiency, and quality across 15 languages. The solution involved novel architectural innovations including shared KV caches, parallel track mixture-of-experts design, and extensive optimization techniques including quantization and compression, resulting in production deployment across millions of devices with measurable performance improvements in text and vision tasks.

Large-Scale Foundation Model Training Infrastructure for National AI Initiative

AWS GENAIC (Japan)

Japan's GENIAC program partnered with AWS to provide 12 organizations with massive compute resources (127 P5 instances and 24 Trn1 instances) for foundation model development. The challenge revealed that successful FM training required far more than raw hardware access - it demanded structured organizational support, reference architectures, cross-functional teams, and comprehensive enablement programs. Through systematic deployment guides, monitoring infrastructure, and dedicated communication channels, multiple large-scale models were successfully trained including 100B+ parameter models, demonstrating that large-scale AI development is fundamentally an organizational rather than purely technical challenge.

Large-Scale GPU Infrastructure for Neural Web Search Training

Exa.ai

Exa.ai built a sophisticated GPU infrastructure combining a new 144 H200 GPU cluster with their existing 80 A100 GPU cluster to support their neural web search and retrieval models. They implemented a five-layer infrastructure stack using Pulumi, Ansible/Kubespray, NVIDIA operators, Alluxio for storage, and Flyte for orchestration, enabling efficient large-scale model training and inference while maintaining reproducibility and reliability.

Large-Scale Learned Retrieval System with Two-Tower Architecture

Pinterest

Pinterest developed and deployed a large-scale learned retrieval system using a two-tower architecture to improve content recommendations for over 500 million monthly active users. The system replaced traditional heuristic approaches with an embedding-based retrieval system learned from user engagement data. The implementation includes automatic retraining capabilities and careful version synchronization between model artifacts. The system achieved significant success, becoming one of the top-performing candidate generators with the highest user coverage and ranking among the top three in save rates.

Large-Scale LLM Batch Processing Platform for Millions of Prompts

Instacart

Instacart faced challenges processing millions of LLM calls required by various teams for tasks like catalog data cleaning, item enrichment, fulfillment routing, and search relevance improvements. Real-time LLM APIs couldn't handle this scale effectively, leading to rate limiting issues and high costs. To solve this, Instacart built Maple, a centralized service that automates large-scale LLM batch processing by handling batching, encoding/decoding, file management, retries, and cost tracking. Maple integrates with external LLM providers through batch APIs and an internal AI Gateway, achieving up to 50% cost savings compared to real-time calls while enabling teams to process millions of prompts reliably without building custom infrastructure.

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

Large-Scale Semantic Search Platform for Food Delivery

Uber

Uber Eats built a production-grade semantic search platform to improve discovery across restaurants, grocery, and retail items by addressing limitations of traditional lexical search. The solution leverages LLM-based embeddings (using Qwen as the backbone), a two-tower architecture with Matryoshka Representation Learning, and Apache Lucene Plus for indexing. Through careful optimization of ANN parameters, quantization strategies, and embedding dimensions, the team achieved significant cost reductions (34% latency reduction, 17% CPU savings, 50% storage reduction) while maintaining high recall (>0.95). The system features automated biweekly model updates with blue/green deployment, comprehensive validation gates, and serving-time reliability checks to ensure production stability at global scale.

Large-Scale Tax AI Assistant Implementation for TurboTax

Intuit

Intuit built a comprehensive LLM-powered AI assistant system called Intuit Assist for TurboTax to help millions of customers understand their tax situations, deductions, and refunds. The system processes 44 million tax returns annually and uses a hybrid approach combining Claude and GPT models for both static tax explanations and dynamic Q&A, supported by RAG systems, fine-tuning, and extensive evaluation frameworks with human tax experts. The implementation includes proprietary platform GenOS with safety guardrails, orchestration capabilities, and multi-phase evaluation systems to ensure accuracy in the highly regulated tax domain.

Large-Scale Video Content Processing with Multimodal LLMs on AWS Inferentia2

ByteDance

ByteDance implemented multimodal LLMs for video understanding at massive scale, processing billions of videos daily for content moderation and understanding. By deploying their models on AWS Inferentia2 chips across multiple regions, they achieved 50% cost reduction compared to standard EC2 instances while maintaining high performance. The solution combined tensor parallelism, static batching, and model quantization techniques to optimize throughput and latency.

Legacy PDF Document Processing with LLM

Five Sigma

The given text appears to be a PDF document with binary/encoded content that needs to be processed and analyzed. The case involves handling PDF streams, filters, and document structure, which could benefit from LLM-based processing for content extraction and understanding.

Leveraging LangSmith for Debugging Tools & Actions in Production LLM Applications

Mendable

Mendable.ai enhanced their enterprise AI assistant platform with Tools & Actions capabilities, enabling automated tasks and API interactions. They faced challenges with debugging and observability of agent behaviors in production. By implementing LangSmith, they successfully debugged agent decision processes, optimized prompts, improved tool schema generation, and built evaluation datasets, resulting in a more reliable and efficient system that has already achieved $1.3 million in savings for a major tech company client.

Leveraging Vector Embeddings for Financial Fraud Detection

NICE Actimize

NICE Actimize, a leader in financial fraud prevention, implemented a scalable approach using vector embeddings to enhance their fraud detection capabilities. They developed a pipeline that converts tabular transaction data into meaningful text representations, then transforms them into vector embeddings using RoBERTa variants. This approach allows them to capture semantic similarities between transactions while maintaining high performance requirements for real-time fraud detection.

LLM Feature Extraction for Content Categorization and Search Query Understanding

Canva

Canva implemented LLMs as a feature extraction method for two key use cases: search query categorization and content page categorization. By replacing traditional ML classifiers with LLM-based approaches, they achieved higher accuracy, reduced development time from weeks to days, and lowered operational costs from $100/month to under $5/month for query categorization. For content categorization, LLM embeddings outperformed traditional methods in terms of balance, completion, and coherence metrics while simplifying the feature extraction process.

LLM Integration for Customer Support Automation and Enhancement

Airbnb

Airbnb implemented AI text generation models across three key customer support areas: content recommendation, real-time agent assistance, and chatbot paraphrasing. They leveraged large language models with prompt engineering to encode domain knowledge from historical support data, resulting in significant improvements in content relevance, agent efficiency, and user engagement. The implementation included innovative approaches to data preparation, model training with DeepSpeed, and careful prompt design to overcome common challenges like generic responses.

LLM Validation and Testing at Scale: GitLab's Comprehensive Model Evaluation Framework

Gitlab

GitLab developed a robust framework for validating and testing LLMs at scale for their GitLab Duo AI features. They created a Centralized Evaluation Framework (CEF) that uses thousands of prompts across multiple use cases to assess model performance. The process involves creating a comprehensive prompt library, establishing baseline model performance, iterative feature development, and continuous validation using metrics like Cosine Similarity Score and LLM Judge, ensuring consistent improvement while maintaining quality across all use cases.

LLM-Assisted Personalization Framework for Multi-Vertical Retail Discovery

DoorDash

DoorDash developed an LLM-assisted personalization framework to help customers discover products across their expanding catalog of hundreds of thousands of SKUs spanning multiple verticals including grocery, convenience, alcohol, retail, flowers, and gifting. The solution combines traditional machine learning approaches like two-tower embedding models and multi-task learning rankers with LLM capabilities for semantic understanding, collection generation, query rewriting, and knowledge graph augmentation. The framework balances three core consumer value dimensions—familiarity (showing relevant favorites), affordability (optimizing for price sensitivity and deals), and novelty (introducing new complementary products)—across the entire personalization stack from retrieval to ranking to presentation. While specific quantitative results are not provided, the case study presents this as a production system deployed across multiple discovery surfaces including category pages, checkout aisles, personalized carousels, and search.

LLM-Powered 3D Model Generation for 3D Printing

Build Great AI

Build Great AI developed a prototype application that leverages multiple LLM models to generate 3D printable models from text descriptions. The system uses various models including LLaMA 3.1, GPT-4, and Claude 3.5 to generate OpenSCAD code, which is then converted to STL files for 3D printing. The solution demonstrates rapid prototyping capabilities, reducing design time from hours to minutes, while handling the challenges of LLMs' spatial reasoning limitations through multiple simultaneous generations and iterative refinement.

LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation

Grab

Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.

LLM-Powered Data Labeling Quality Assurance System

Uber

Uber AI Solutions developed a production LLM-based quality assurance system called Requirement Adherence to improve data labeling accuracy for their enterprise clients. The system addresses the costly and time-consuming problem of post-labeling rework by identifying quality issues during the labeling process itself. It works in two phases: first extracting atomic rules from client Standard Operating Procedure (SOP) documents using LLMs with reflection capabilities, then performing real-time validation during the labeling process by routing different rule types to appropriately-sized models with optimization techniques like prefix caching. This approach resulted in an 80% reduction in required audits, significantly improving timelines and reducing costs while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered In-Tool Quality Validation for Data Labeling

Uber

Uber AI Solutions developed a Requirement Adherence system to address quality issues in data labeling workflows, which traditionally relied on post-labeling checks that resulted in costly rework and delays. The solution uses LLMs in a two-phase approach: first extracting atomic rules from Standard Operating Procedure (SOP) documents and categorizing them by complexity, then performing real-time validation during the labeling process within their uLabel tool. By routing different rule types to appropriate LLM models (non-reasoning models for deterministic checks, reasoning models for subjective checks) and leveraging techniques like prefix caching and parallel execution, the system achieved an 80% reduction in required audits while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered Personalized Music Recommendations and AI DJ Commentary

Spotify

Spotify implemented LLMs to enhance their recommendation system by providing contextualized explanations for music recommendations and powering their AI DJ feature. They adapted Meta's Llama models through careful domain adaptation, human-in-the-loop training, and multi-task fine-tuning. The implementation resulted in up to 4x higher user engagement for recommendations with explanations, and a 14% improvement in Spotify-specific tasks compared to baseline Llama performance. The system was deployed at scale using vLLM for efficient serving and inference.

LLM-Powered Product Attribute Extraction from Unstructured Marketplace Data

Etsy

Etsy faced the challenge of understanding and categorizing over 100 million unique, handmade items listed by 5 million sellers, where most product information existed only as unstructured text and images rather than structured attributes. The company deployed large language models to extract product attributes at scale from listing titles, descriptions, and photos, transforming unstructured data into structured attributes that could power search filters and product comparisons. The implementation increased complete attribute coverage from 31% to 91% in target categories, improved engagement with search filters, and increased overall post-click conversion rates, while establishing robust evaluation frameworks using both human-annotated ground truth and LLM-generated silver labels.

LLM-Powered Real Estate Search and Agent Matching

Zillow

Zillow's StreetEasy platform developed two LLM-powered features in 2024 to enhance the real estate experience for New York City users. The first feature, "Instant Answers," uses pre-generated AI responses to address frequently asked property questions, reducing user frustration and improving efficiency on listing pages where shoppers spend less than 61 seconds. The second feature, "Easy as PIE," creates personalized introductions between home buyers and agents by generating AI-powered bio summaries and highlighting relevant agent attributes based on deal history and user preferences. Both features were designed with cost-effectiveness, scalability, and ethical considerations in mind, leveraging techniques like BERTopic for topic modeling, chain-of-thought prompting to prevent hallucinations, and Fair Housing guardrails to ensure compliance. The implementation demonstrated the importance of data quality, human oversight, cross-functional collaboration, and iterative development in deploying production LLM systems.

LLM-Powered Relevance Assessment for Search Results

Pinterest

Pinterest Search faced significant limitations in measuring search relevance due to the high cost and low availability of human annotations, which resulted in large minimum detectable effects (MDEs) that could only identify significant topline metric movements. To address this, they fine-tuned open-source multilingual LLMs on human-annotated data to predict relevance scores on a 5-level scale, then deployed these models to evaluate ranking results across A/B experiments. This approach reduced labeling costs dramatically, enabled stratified query sampling designs, and achieved an order of magnitude reduction in MDEs (from 1.3-1.5% down to ≤0.25%), while maintaining strong alignment with human labels (73.7% exact match, 91.7% within 1 point deviation) and enabling rapid evaluation of 150,000 rows within 30 minutes on a single GPU.

LLM-Powered Search Relevance Re-Ranking System

LeBonCoin

leboncoin, France's largest second-hand marketplace, implemented a neural re-ranking system using large language models to improve search relevance across their 60 million classified ads. The system uses a two-tower architecture with separate Ad and Query encoders based on fine-tuned LLMs, achieving up to 5% improvement in click and contact rates and 10% improvement in user experience KPIs while maintaining strict latency requirements for their high-throughput search system.

LLM-Powered Security Incident Response and Automation

Agoda

Agoda, a global travel platform processing sensitive data at scale, faced operational bottlenecks in security incident response due to high alert volumes, manual phishing email reviews, and time-consuming incident documentation. The security team implemented three LLM-powered workflows: automated triage for Level 1-2 security alerts using RAG to retrieve historical context, autonomous phishing email classification responding in under 25 seconds, and multi-source incident report generation reducing drafting time from 5-7 hours to 10 minutes. The solutions achieved 97%+ alignment with human analysts for alert triage, 99% precision in phishing classification with no false negatives, and 95% factual accuracy in report generation, while significantly reducing analyst workload and response times.

LLM-Powered Upskilling Assistant in Steel Manufacturing

Gerdau

Gerdau, a major steel manufacturer, implemented an LLM-based assistant to support employee re/upskilling as part of their broader digital transformation initiative. This development came after transitioning to the Databricks Data Intelligence Platform to solve data infrastructure challenges, which enabled them to explore advanced AI applications. The platform consolidation resulted in a 40% cost reduction in data processing and allowed them to onboard 300 new global data users while creating an environment conducive to AI innovation.

LLM-Powered Voice Assistant for Restaurant Operations and Personalized Alcohol Recommendations

Doordash

DoorDash implemented two major LLM-powered features during their 2025 summer intern program: a voice AI assistant for verifying restaurant hours and personalized alcohol recommendations with carousel generation. The voice assistant replaced rigid touch-tone phone systems with natural language conversations, allowing merchants to specify detailed hours information in advance while maintaining backward compatibility with legacy infrastructure through factory patterns and feature flags. The alcohol recommendation system leveraged LLMs to generate personalized product suggestions and engaging carousel titles using chain-of-thought prompting and a two-stage generation pipeline. Both systems were integrated into production using DoorDash's existing frameworks, with the voice assistant achieving structured data extraction through prompt engineering and webhook processing, while the recommendations carousel utilized the company's Carousel Serving Framework and Discovery SDK for rapid deployment.

LLMOps Best Practices and Success Patterns Across Multiple Companies

HumanLoop

A comprehensive analysis of successful LLM implementations across multiple companies including Duolingo, GitHub, Fathom, and others, highlighting key patterns in team composition, evaluation strategies, and tooling requirements. The study emphasizes the importance of domain experts in LLMOps, proper evaluation frameworks, and the need for comprehensive logging and debugging tools, showcasing concrete examples of companies achieving significant ROI through proper LLMOps implementation.

Mainframe to Cloud Migration with AI-Powered Code Transformation

Mercedes-Benz

Mercedes-Benz faced the challenge of modernizing their Global Ordering system, a critical mainframe application handling over 5 million lines of code that processes every vehicle order and production request across 150 countries. The company partnered with Capgemini, AWS, and Rocket Software to migrate this system from mainframe to cloud using a hybrid approach: replatforming the majority of the application while using agentic AI (GenRevive tool) to refactor specific components. The most notable success was transforming 1.3 million lines of COBOL code in their pricing service to Java in just a few months, achieving faster performance, reduced mainframe costs, and a successful production deployment with zero incidents at go-live.

Managing Memory and Scaling Issues in Production AI Agent Systems

Gradient Labs

Gradient Labs experienced a series of interconnected production incidents involving their AI agent deployed on Google Cloud Run, starting with memory usage alerts that initially appeared to be memory leaks. The team discovered the root cause was Temporal workflow cache sizing issues causing container crashes, which they resolved by tuning cache parameters. However, this fix inadvertently caused auto-scaling problems that throttled their system's ability to execute activities, leading to increased latency. The incidents highlight the complex interdependencies in production AI systems and the need for careful optimization across all infrastructure layers.

Managing Model Updates and Robustness in Production Voice Assistants

Amazon (Alexa)

At Amazon Alexa, researchers tackled two key challenges in production NLP models: preventing performance degradation on common utterances during model updates and improving model robustness to input variations. They implemented positive congruent training to minimize negative prediction flips between model versions and used T5 models to generate synthetic training data variations, making the system more resilient to slight changes in user commands while maintaining consistent performance.

MCP Marketplace: Scaling AI Agents with Organizational Context

Intuit

Intuit, a global fintech platform, faced challenges scaling AI agents across their organization due to poor discoverability of Model Context Protocol (MCP) services, inconsistent security practices, and complex manual setup requirements. They built an MCP Marketplace, a centralized registry functioning as a package manager for AI capabilities, which standardizes MCP development through automated CI/CD pipelines for producers and provides one-click installation with enterprise-grade security for consumers. The platform leverages gRPC middleware for authentication, token management, and auditing, while collecting usage analytics to track adoption, service latency, and quality metrics, thereby democratizing secure context access across their developer organization.

Mercury: Agentic AI Platform for LLM-Powered Recommendation Systems

eBay

eBay developed Mercury, an internal agentic framework designed to scale LLM-powered recommendation experiences across its massive marketplace of over two billion active listings. The platform addresses the challenge of transforming vast amounts of unstructured data into personalized product recommendations by integrating Retrieval-Augmented Generation (RAG) with a custom Listing Matching Engine that bridges the gap between LLM-generated text outputs and eBay's dynamic inventory. Mercury enables rapid development through reusable, plug-and-play components following object-oriented design principles, while its near-real-time distributed queue-based execution platform handles cost and latency requirements at industrial scale. The system combines multiple retrieval mechanisms, semantic search using embedding models, anomaly detection, and personalized ranking to deliver contextually relevant shopping experiences to hundreds of millions of users.

Migrating from Elasticsearch to Vespa for Large-Scale Search Platform

Vinted

Vinted, a major e-commerce platform, successfully migrated their search infrastructure from Elasticsearch to Vespa to handle their growing scale of 1 billion searchable items. The migration resulted in halving their server count, improving search latency by 2.5x, reducing indexing latency by 3x, and decreasing visibility time for changes from 300 to 5 seconds. The project, completed between May 2023 and April 2024, demonstrated significant improvements in search relevance and operational efficiency through careful architectural planning and phased implementation.

Migrating LLM Fine-tuning Workflows from Slurm to Kubernetes Using Metaflow and Argo

Adept.ai

Adept.ai, building an AI model for computer interaction, faced challenges with complex fine-tuning pipelines running on Slurm. They implemented a migration strategy to Kubernetes using Metaflow and Argo for workflow orchestration, while maintaining existing Slurm workloads through a hybrid approach. This allowed them to improve pipeline management, enable self-service capabilities for data scientists, and establish robust monitoring infrastructure, though complete migration to Kubernetes remains a work in progress.

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

Mission-Critical LLM Inference Platform Architecture

Baseten

Baseten has built a production-grade LLM inference platform focusing on three key pillars: model-level performance optimization, horizontal scaling across regions and clouds, and enabling complex multi-model workflows. The platform supports various frameworks including SGLang and TensorRT-LLM, and has been successfully deployed by foundation model companies and enterprises requiring strict latency, compliance, and reliability requirements. A key differentiator is their ability to handle mission-critical inference workloads with sub-400ms latency for complex use cases like AI phone calls.

ML-Powered Interactive Voice Response System for Customer Support

Airbnb

Airbnb transformed their traditional button-based Interactive Voice Response (IVR) system into an intelligent, conversational AI-powered solution that allows customers to describe their issues in natural language. The system combines automated speech recognition, intent detection, LLM-based article retrieval and ranking, and paraphrasing models to understand customer queries and either provide relevant self-service resources via SMS/app notifications or route calls to appropriate agents. This resulted in significant improvements including a reduction in word error rate from 33% to 10%, sub-50ms intent detection latency, increased user engagement with help articles, and reduced dependency on human customer support agents.

MLOps Evolution and LLM Integration at a Major Bank

Barclays

Discussion of MLOps practices and the evolution towards LLM integration at Barclays, focusing on the transition from traditional ML to GenAI workflows while maintaining production stability. The case study highlights the importance of balancing innovation with regulatory requirements in financial services, emphasizing ROI-driven development and the creation of reusable infrastructure components.

MLOps Maturity Levels and Enterprise Implementation Challenges

Various

The case study explores MLOps maturity levels (0-2) in enterprise settings, discussing how organizations progress from manual ML deployments to fully automated systems. It covers the challenges of implementing MLOps across different team personas (data scientists, ML engineers, DevOps), highlighting key considerations around automation, monitoring, compliance, and business value metrics. The study particularly emphasizes the differences between traditional ML and LLM deployments, and how organizations need to adapt their MLOps practices for each.

MLOps Platform for Airline Operations with LLM Integration

LATAM Airlines

LATAM Airlines developed Cosmos, a vendor-agnostic MLOps framework that enables both traditional ML and LLM deployments across their business operations. The framework reduced model deployment time from 3-4 months to less than a week, supporting use cases from fuel efficiency optimization to personalized travel recommendations. The platform demonstrates how a traditional airline can transform into a data-driven organization through effective MLOps practices and careful integration of AI technologies.

Modernizing DevOps with Generative AI: Challenges and Best Practices in Production

Various (Bundesliga, Harness, Trice)

A panel of experts from various organizations discusses the current state and challenges of integrating generative AI into DevOps workflows and production environments. The discussion covers how companies are balancing productivity gains with security concerns, the importance of having proper testing and evaluation frameworks, and strategies for successful adoption of AI tools in production DevOps processes while maintaining code quality and security.

Multi-Agent AI Banking Assistant Using Amazon Bedrock

Bunq

Bunq, Europe's second-largest neobank serving 20 million users, faced challenges delivering consistent, round-the-clock multilingual customer support across multiple time zones while maintaining strict banking security and compliance standards. Traditional support models created frustrating bottlenecks and strained internal resources as users expected instant access to banking functions like transaction disputes, account management, and financial advice. The company built Finn, a proprietary multi-agent generative AI assistant using Amazon Bedrock with Anthropic's Claude models, Amazon ECS for orchestration, DynamoDB for session management, and OpenSearch Serverless for RAG capabilities. The solution evolved from a problematic router-based architecture to a flexible orchestrator pattern where primary agents dynamically invoke specialized agents as tools. Results include handling 97% of support interactions with 82% fully automated, reducing average response times to 47 seconds, translating the app into 38 languages, and deploying the system from concept to production in 3 months with a team of 80 people deploying updates three times daily.

Multi-Agent AI Platform for Customer Experience at Scale

Cisco

Cisco developed an agentic AI platform leveraging LangChain to transform their customer experience operations across a 20,000-person organization managing $26 billion in recurring revenue. The solution combines multiple specialized agents with a supervisor architecture to handle complex workflows across customer adoption, renewals, and support processes. By integrating traditional machine learning models for predictions with LLMs for language processing, they achieved 95% accuracy in risk recommendations and reduced operational time by 20% in just three weeks of limited availability deployment, while automating 60% of their 1.6-1.8 million annual support cases.

Multi-Agent Architecture for Automated Advertising Media Planning

Spotify

Spotify faced a structural problem where multiple advertising buying channels (Direct, Self-Serve, Programmatic) relied on consolidated backend services but implemented fragmented, channel-specific workflow logic, creating duplicated decision-making and technical debt. To address this, they built "Ads AI," a multi-agent system using Google's Agent Development Kit (ADK) and Vertex AI that transforms media planning from a manual 15-30 minute process requiring 20+ form fields into a conversational interface that generates optimized, data-driven media plans in 5-10 seconds using 1-3 natural language messages. The system decomposes media planning into specialized agents (RouterAgent, GoalResolverAgent, AudienceResolverAgent, BudgetAgent, ScheduleAgent, and MediaPlannerAgent) that execute in parallel, leverage historical campaign performance data via function calling tools, and produce recommendations based on cost optimization, delivery rates, and budget matching heuristics.

Multi-Agent Architecture for Automating Commercial Real Estate Development Workflows

Build.inc

Build.inc developed a sophisticated multi-agent system called Dougie to automate complex commercial real estate development workflows, particularly for data center projects. Using LangGraph for orchestration, they implemented a hierarchical system of over 25 specialized agents working in parallel to perform land diligence tasks. The system reduces what traditionally took human consultants four weeks to complete down to 75 minutes, while maintaining high quality and depth of analysis.

Multi-Agent Customer Support Automation Platform for Fintech

Gradient Labs

Gradient Labs, an AI-native startup founded after ChatGPT's release, built a comprehensive customer support automation platform for fintech companies featuring three coordinated AI agents: inbound, outbound, and back office. The company addresses the challenge that traditional customer support automation only handles the "tip of the iceberg" - frontline queries - while missing the complex back-office tasks like fraud disputes and KYC compliance that consume most human agent time. Their solution uses a modular agent architecture with natural language procedures, deterministic skill-based orchestration, multi-layer guardrails for regulatory compliance, and sophisticated state management to handle complex, multi-turn conversations across email, chat, and voice channels. This approach enables end-to-end automation where agents coordinate seamlessly, such as an inbound agent receiving a dispute claim, triggering a back-office agent to process it, and an outbound agent proactively following up with customers for additional information.

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

Multi-Agent Property Investment Advisor with Continuous Evaluation

PropHero

PropHero, a property wealth management service, needed an AI-powered advisory system to provide personalized property investment insights for Spanish and Australian consumers. Working with AWS Generative AI Innovation Center, they built a multi-agent conversational AI system using Amazon Bedrock that delivers knowledge-grounded property investment advice through natural language conversations. The solution uses strategically selected foundation models for different agents, implements semantic search with Amazon Bedrock Knowledge Bases, and includes an integrated continuous evaluation system that monitors context relevance, response groundedness, and goal accuracy in real-time. The system achieved 90% goal accuracy, reduced customer service workload by 30%, lowered AI costs by 60% through optimal model selection, and enabled over 50% of users (70% of paid users) to actively engage with the AI advisor.

Multi-Agent System Architecture for Autonomous Recruiting Agents

LinkedIn

LinkedIn developed a multi-agent system called Hiring Assistant to help recruiters work more efficiently, launching in October 2024. The system comprises four specialized agents (intake, sourcing, evaluation, and outreach) coordinated by a supervisor agent, with personalization driven by a preference model trained on recruiter behaviors. The presentation focuses on the operational challenges of scaling from specialized multi-agent systems to truly autonomous agents, addressing critical production issues including memory isolation across users, tool discovery and validation, safety considerations for destructive tool calls, and computational efficiency through complexity classification to route simpler tasks to completion models rather than expensive reasoning models.

Multi-Agent System for Misinformation Detection and Correction at Scale

Meta

This case study presents a sophisticated multi-agent LLM system designed to identify, correct, and find the root causes of misinformation on social media platforms at scale. The solution addresses the limitations of pre-LLM era approaches (content-only features, no real-time information, low precision/recall) by deploying specialized agents including an Indexer (for sourcing authentic data), Extractor (adaptive retrieval and reranking), Classifier (discriminative misinformation categorization), Corrector (reasoning and correction generation), and Verifier (final validation). The system achieves high precision and recall by orchestrating these agents through a centralized coordinator, implementing comprehensive logging, evaluation at both individual agent and system levels, and optimization strategies including model distillation, semantic caching, and adaptive retrieval. The approach prioritizes accuracy over cost and latency given the high stakes of misinformation propagation on platforms.

Multi-Company Panel on Production LLM Deployment Strategies and Small Language Model Optimization

Meta / AWS / NVIDIA / ConverseNow

This panel discussion features leaders from Meta, AWS, NVIDIA, and ConverseNow discussing real-world challenges and solutions for deploying LLMs in production environments. The conversation covers the trade-offs between small and large language models, with ConverseNow sharing their experience building voice AI systems for restaurants that require high accuracy and low latency. Key themes include the importance of fine-tuning small models for production use cases, the convergence of training and inference systems, optimization techniques like quantization and alternative architectures, and the challenges of building reliable, cost-effective inference stacks for mission-critical applications.

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

Multi-Label Red Flag Detection System for Fraud Prevention

Feedzai

Feedzai developed ScamAlert, a generative AI-based system that moves beyond traditional binary scam classification to identify specific red flags in suspected fraud attempts. The system addresses the limitations of binary classifiers that only output risk scores without explanation by using multimodal LLMs to analyze screenshots of suspected scams (emails, text messages, listings) and identify observable warning signs like suspicious links, urgency tactics, or unusual communication channels. The team created a comprehensive benchmarking framework to evaluate multiple commercial multimodal models across four dimensions: red flag detection accuracy (precision/recall/F1), instruction adherence, cost, and latency. Their results showed significant performance variations across models, with GPT-5, Gemini 3 Pro, and Gemini 2.5 Pro leading in accuracy, though with notable tradeoffs in cost and latency, while also revealing instruction-following issues in some models that generated hallucinated red flags not in the predefined taxonomy.

Multi-Layered Caching Architecture for AI Metadata Service Scalability

Salesforce

Salesforce faced critical performance and reliability issues with their AI Metadata Service (AIMS), experiencing 400ms P90 latency bottlenecks and system outages during database failures that impacted all AI inference requests including Agentforce. The team implemented a multi-layered caching strategy with L1 client-side caching and L2 service-level caching, reducing metadata retrieval latency from 400ms to sub-millisecond response times and improving end-to-end request latency by 27% while maintaining 65% availability during backend outages.

Multi-Lingual Voice Control System for AGV Management Using Edge LLMs

Addverb

Addverb developed an AI-powered voice control system for AGV (Automated Guided Vehicle) maintenance that enables warehouse workers to communicate with robots in their native language. The system uses a combination of edge-deployed Llama 3 and cloud-based ChatGPT to translate natural language commands from 98 different languages into AGV instructions, significantly reducing maintenance downtime and improving operational efficiency.

Multi-LLM Orchestration for Product Matching at Scale

Mercado Libre

Mercado Libre tackled the classic e-commerce product-matching challenge where sellers create listings with inconsistent titles, attributes, and identifiers, making it difficult to identify identical products across the platform. The team developed a sophisticated multi-LLM orchestration system that evolved from a simple 2-node architecture to a complex 7-node pipeline, incorporating adaptive prompts, context-aware decision-making, and collaborative consensus mechanisms. Through systematic iteration and careful orchestration alongside existing ML models and embedding systems, they achieved human-level performance with 95% precision and over 50% recall at a cost-effective rate of less than $0.001 per request, enabling scalable autonomous product matching across millions of items for critical use cases including pricing, personalization, and inventory optimization.

Multi-LoRA Serving for Agent Performance Analysis at Scale

Convirza

Convirza, facing challenges with their customer service agent evaluation system, transitioned from Longformer models to fine-tuned Llama-3-8b using Predibase's multi-LoRA serving infrastructure. This shift enabled them to process millions of call hours while reducing operational costs by 10x compared to OpenAI, achieving an 8% improvement in F1 scores, and increasing throughput by 80%. The solution allowed them to efficiently serve over 60 performance indicators across thousands of customer interactions daily while maintaining sub-second inference times.

Multi-modal LLM Platform for Catalog Attribute Extraction at Scale

Instacart

Instacart faced significant challenges in extracting structured product attributes (flavor, size, dietary claims, etc.) from millions of SKUs using traditional SQL-based rules and text-only machine learning models. These approaches suffered from low quality, high development overhead, and inability to process image data. To address these limitations, Instacart built PARSE (Product Attribute Recognition System for E-commerce), a self-serve multi-modal LLM platform that enables teams to extract attributes from both text and images with minimal engineering effort. The platform reduced attribute extraction development time from weeks to days, achieved 10% higher recall through multi-modal reasoning compared to text-only approaches, and delivered 95% accuracy on simpler attributes with just one day of effort versus one week with traditional methods.

Multi-Model LLM Orchestration with Rate Limit Management

Bito

Bito, an AI coding assistant startup, faced challenges with API rate limits while scaling their LLM-powered service. They developed a sophisticated load balancing system across multiple LLM providers (OpenAI, Anthropic, Azure) and accounts to handle rate limits and ensure high availability. Their solution includes intelligent model selection based on context size, cost, and performance requirements, while maintaining strict guardrails through prompt engineering.

Multi-node LLM inference scaling using AWS Trainium and vLLM for conversational AI shopping assistant

Rufus

Amazon's Rufus team faced the challenge of deploying increasingly large custom language models for their generative AI shopping assistant serving millions of customers. As model complexity grew beyond single-node memory capacity, they developed a multi-node inference solution using AWS Trainium chips, vLLM, and Amazon ECS. Their solution implements a leader/follower architecture with hybrid parallelism strategies (tensor and data parallelism), network topology-aware placement, and containerized multi-node inference units. This enabled them to successfully deploy across tens of thousands of Trainium chips, supporting Prime Day traffic while delivering the performance and reliability required for production-scale conversational AI.

Multi-Tenant MCP Server Authentication with Redis Session Management

BrainGrid

BrainGrid faced the challenge of transforming their Model Context Protocol (MCP) server from a local development tool into a production-ready, multi-tenant service that could be deployed to customers. The core problem was that serverless platforms like Cloud Run and Vercel don't maintain session state, causing users to re-authenticate repeatedly as instances scaled to zero or requests hit different instances. BrainGrid solved this by implementing a Redis-based session store with AES-256-GCM encryption, OAuth integration via WorkOS, and a fast-path/slow-path authentication pattern that caches validated JWT sessions. The solution reduced authentication overhead from 50-100ms per request to near-instantaneous for cached sessions, eliminated re-authentication fatigue, and enabled the MCP server to scale from single-user to multi-tenant deployment while maintaining security and performance.

Multilingual Content Navigation and Localization System

Intercom

YouTube, a Google company, implements a comprehensive multilingual navigation and localization system for its global platform. The source text appears to be in Dutch, demonstrating the platform's localization capabilities, though insufficient details are provided about the specific LLMOps implementation.

Multimodal Feature Stores and Research-Engineering Collaboration

Runway

Runway, a leader in generative AI for creative tools, developed a novel approach to managing multimodal training data through what they call a "multimodal feature store". This system enables efficient storage and retrieval of diverse data types (video, images, text) along with their computed features and embeddings, facilitating large-scale distributed training while maintaining researcher productivity. The solution addresses challenges in data management, feature computation, and the research-to-production pipeline, while fostering better collaboration between researchers and engineers.

Multimodal LLM-as-a-Judge for Large-Scale Product Retrieval Evaluation

Zalando

Zalando, a major e-commerce platform, faced the challenge of evaluating product retrieval systems at scale across multiple languages and diverse customer queries. Traditional human relevance assessments required substantial time and resources, making large-scale continuous evaluation impractical. The company developed a novel framework leveraging Multimodal Large Language Models (MLLMs) that automatically generate context-specific annotation guidelines and conduct relevance assessments by analyzing both text and images. Evaluated on 20,000 examples, the approach achieved accuracy comparable to human annotators while being up to 1,000 times cheaper and significantly faster (20 minutes versus weeks for humans), enabling continuous monitoring of high-frequency search queries in production and faster identification of areas requiring improvement.

Multimodal RAG Architecture Optimization for Production

Microsoft

Microsoft explored optimizing a production Retrieval-Augmented Generation (RAG) system that incorporates both text and image content to answer domain-specific queries. The team conducted extensive experiments on various aspects of the system including prompt engineering, metadata inclusion, chunk structure, image enrichment strategies, and model selection. Key improvements came from using separate image chunks, implementing a classifier for image relevance, and utilizing GPT-4V for enrichment while using GPT-4o for inference. The resulting system achieved better search precision and more relevant LLM-generated responses while maintaining cost efficiency.

Native Image Generation with Multimodal Context in Gemini 2.5 Flash

Google DeepMind

Google DeepMind released an updated native image generation capability in Gemini 2.5 Flash that represents a significant quality leap over previous versions. The model addresses key production challenges including consistent character rendering across multiple angles, pixel-perfect editing that preserves scene context, and improved text rendering within images. Through interleaved generation, the model can maintain conversation context across multiple editing turns, enabling iterative creative workflows. The team tackled evaluation challenges by combining human preference data with specific technical metrics like text rendering quality, while incorporating real user feedback from social media to create comprehensive benchmarks that drive model improvements.

Natural Language Analytics Assistant Using Amazon Bedrock Agents

Skai

Skai, an omnichannel advertising platform, developed Celeste, an AI agent powered by Amazon Bedrock Agents, to transform how customers access and analyze complex advertising data. The solution addresses the challenge of time-consuming manual report generation (taking days or weeks) by enabling natural language queries that automatically collect data from multiple sources, synthesize insights, and provide actionable recommendations. The implementation reduced report generation time by 50%, case study creation by 75%, and transformed weeks-long processes into minutes while maintaining enterprise-grade security and privacy for sensitive customer data.

Natural Language Query Interface with Production LLM Integration

Honeycomb

Honeycomb implemented a natural language query interface for their observability platform to help users more easily analyze their production data. Rather than creating a chatbot, they focused on a targeted query translation feature using GPT-3.5, achieving a 94% success rate in query generation. The feature led to significant improvements in user activation metrics, with teams using the query assistant being 2-3x more likely to create complex queries and save them to boards.

Natural Language to SQL Query Generation at Scale

Uber

Uber developed QueryGPT to address the time-intensive process of SQL query authoring across its data platform, which handles 1.2 million interactive queries monthly. The system uses large language models, vector databases, and similarity search to generate complex SQL queries from natural language prompts, reducing query authoring time from approximately 10 minutes to 3 minutes. Starting from a hackathon prototype in May 2023, the system evolved through 20+ iterations into a production service featuring workspaces for domain-specific query generation, multiple specialized LLM agents (intent, table, and column pruning), and a comprehensive evaluation framework. The limited release achieved 300 daily active users with 78% reporting significant time savings, representing a major productivity gain particularly for Uber's Operations organization which contributes 36% of all queries.

Network Operations Transformation with GenAI and AIOps

Vodafone

Vodafone implemented a comprehensive AI and GenAI strategy to transform their network operations, focusing on improving customer experience through better network management. They migrated from legacy OSS systems to a cloud-based infrastructure on Google Cloud Platform, integrating over 2 petabytes of network data with commercial and IT data. The initiative includes AI-powered network investment planning, automated incident management, and device analytics, resulting in significant operational efficiency improvements and a planned 50% reduction in OSS tools.

Next-Generation AI-Powered In-Vehicle Assistant with Hybrid Edge-Cloud Architecture

Bosch

Bosch Engineering, in collaboration with AWS, developed a next-generation conversational AI assistant for vehicles that operates through a hybrid edge-cloud architecture to address the limitations of traditional in-car voice assistants. The solution combines on-board AI components for simple queries with cloud-based processing for complex requests, enabling seamless integration with external APIs for services like restaurant booking, charging station management, and vehicle diagnostics. The system was implemented on Bosch's Software-Defined Vehicle (SDV) reference demonstrator platform, demonstrating capabilities ranging from basic vehicle control to sophisticated multi-service orchestration, with ongoing development focused on gradually moving more intelligence to the edge while maintaining robust connectivity fallback mechanisms.

Observability Platform's Journey to Production GenAI Integration

New Relic

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

On-Device Grammar Correction with Sequence-to-Sequence Models

Google

Google Research developed an on-device grammar correction system for Gboard on Pixel 6 that detects and suggests corrections for grammatical errors as users type. The solution addresses the challenge of implementing neural grammar correction within the constraints of mobile devices (limited memory, computational power, and latency requirements) while preserving user privacy by keeping all processing local. The team built a 20MB hybrid Transformer-LSTM model using hard distillation from a cloud-based system, achieving inference on 60 characters in under 22ms on the Pixel 6 CPU, enabling real-time grammar correction for both complete sentences and partial sentence prefixes across English text in nearly any app using Gboard.

On-Device Personalized Lexicon Learning for Mobile Keyboard

Grammarly

Grammarly developed an on-device machine learning model for their iOS keyboard that learns users' personal vocabulary and provides personalized autocorrection suggestions without sending data to the cloud. The challenge was to build a model that could distinguish between valid personal vocabulary and typos while operating within severe mobile constraints (under 5 MB RAM, minimal latency). The solution involved memory-mapped storage, time-based decay functions for vocabulary management, noisy input filtering, and edit-distance-based frequency thresholding to verify new words. Deployed to over 5 million devices, the model demonstrated measurable improvements with decreased rates of reverted suggestions and increased acceptance rates, while maintaining minimal memory footprint and responsive performance.

On-Device Unified Spelling and Grammar Correction Model

Grammarly

Grammarly developed a compact 1B-parameter on-device LLM to provide offline spelling and grammar correction capabilities, addressing the challenge of maintaining writing assistance functionality without internet connectivity. The team selected Llama as the base model, created comprehensive synthetic training data covering diverse writing styles and error types, and applied extensive optimizations including Grouped Query Attention, MLX framework integration for Apple silicon, and 4-bit quantization. The resulting model achieves 210 tokens/second on M2 Mac hardware while maintaining correction quality, demonstrating that multiple specialized models can be consolidated into a single efficient on-device solution that preserves user voice and delivers real-time feedback.

Online Reinforcement Learning for Code Completion at Scale

Cursor

Cursor developed a production LLM system called Cursor Tab that predicts developer actions and suggests code completions across codebases, handling over 400 million requests per day. To address the challenge of noisy suggestions that disrupt developer flow, they implemented an online reinforcement learning approach using policy gradient methods that directly optimizes the model to show suggestions only when acceptance probability exceeds a target threshold. This approach required building infrastructure for rapid model deployment and on-policy data collection with a 1.5-2 hour turnaround cycle. The resulting model achieved a 21% reduction in suggestions shown while simultaneously increasing the accept rate by 28%, demonstrating effective LLMOps practices for continuously improving production models using real-time user feedback.

Open Source vs. Closed Source Agentic Stacks: Panel Discussion on Production Deployment Strategies

Various (Alation, GrottoAI, Nvidia, OLX)

This panel discussion brings together experts from Nvidia, OLX, Alation, and GrottoAI to discuss practical considerations for deploying agentic AI systems in production. The conversation explores when to choose open source versus closed source tooling, the challenges of standardizing agent frameworks across enterprise organizations, and the tradeoffs between abstraction levels in agent orchestration platforms. Key themes include starting with closed source models for rapid prototyping before transitioning to open source for compliance and cost reasons, the importance of observability across heterogeneous agent frameworks, the difficulty of enabling non-technical users to build agents, and the critical difference between internal tooling with lower precision requirements versus customer-facing systems demanding 95%+ accuracy.

Open-Source Protein Structure Prediction and Generative Design Platform for Drug Discovery

Boltz

Boltz, founded by Gabriele Corso and Jeremy Wohlwend, developed an open-source suite of AI models (Boltz-1, Boltz-2, and BoltzGen) for structural biology and protein design, democratizing access to capabilities previously held by proprietary systems like AlphaFold 3. The company addresses the challenge of predicting complex molecular interactions (protein-ligand, protein-protein) and designing novel therapeutic proteins by combining generative diffusion models with specialized equivariant architectures. Their approach achieved validated nanomolar binders for two-thirds of nine previously unseen protein targets, demonstrating genuine generalization beyond training data. The newly launched Boltz Lab platform provides a production-ready infrastructure with optimized GPU kernels running 10x faster than open-source versions, offering agents for protein and small molecule design with collaborative interfaces for medicinal chemists and researchers.

Optimizing Call Center Analytics with Small Language Models and Multi-Adapter Serving

Convirza

Convirza transformed their call center analytics platform from using traditional large language models to implementing small language models (specifically Llama 3B) with adapter-based fine-tuning. By partnering with Predibase, they achieved a 10x cost reduction compared to OpenAI while improving accuracy by 8% and throughput by 80%. The system analyzes millions of calls monthly, extracting hundreds of custom indicators for agent performance and caller behavior, with sub-0.1 second inference times using efficient multi-adapter serving on single GPUs.

Optimizing Cloud Storage Infrastructure for Enterprise AI Platform Operations

H2O.ai

H2O.ai, an enterprise AI platform provider delivering both generative and predictive AI solutions, faced significant challenges with their AWS EBS storage infrastructure that supports model training and AI workloads running on Kubernetes. The company was managing over 2 petabytes of storage with poor utilization rates (around 25%), leading to substantial cloud costs and limited ability to scale efficiently. They implemented Datafi, an autonomous storage management solution that dynamically scales EBS volumes up and down based on actual usage without downtime. The solution integrated seamlessly with their existing Kubernetes, Terraform, and GitOps workflows, ultimately improving storage utilization to 80% and reducing their storage footprint from 2 petabytes to less than 1 petabyte while simultaneously improving performance for customers.

Optimizing Copilot Latency with NVIDIA TensorRT-LLM Integration

Moveworks

Moveworks addressed latency challenges in their enterprise Copilot by implementing NVIDIA's TensorRT-LLM optimization engine. The integration resulted in significant performance improvements, including a 2.3x increase in token processing speed (from 19 to 44 tokens per second), a reduction in average request latency from 3.4 to 1.5 seconds, and nearly 3x faster time to first token. These optimizations enabled more natural conversations and improved resource utilization in production.

Optimizing GPU Memory Usage in LLM Training with Liger-Kernel

LinkedIn

LinkedIn developed Liger-Kernel, a library to optimize GPU performance during LLM training by addressing memory access and per-operation bottlenecks. Using techniques like FlashAttention and operator fusion implemented in Triton, the library achieved a 60% reduction in memory usage, 20% improvement in multi-GPU training throughput, and a 3x reduction in end-to-end training time.

Optimizing LLM Server Startup Times for Preemptable GPU Infrastructure

Replit

Replit faced challenges with running LLM inference on expensive GPU infrastructure and implemented a solution using preemptable cloud GPUs to reduce costs by two-thirds. The key challenge was reducing server startup time from 18 minutes to under 2 minutes to handle preemption events, which they achieved through container optimization, GKE image streaming, and improved model loading processes.

Optimizing LLM Training with Efficient GPU Kernels

LinkedIn

LinkedIn developed and open-sourced LIER (LinkedIn Efficient and Reusable) kernels to address the fundamental challenge of memory consumption in LLM training. By optimizing core operations like layer normalization, rotary position encoding, and activation functions, they achieved up to 3-4x reduction in memory allocation and 20% throughput improvements for large models. The solution, implemented using Python and Triton, focuses on minimizing data movement between GPU memory and compute units, making LLM training faster and more cost-effective.

Optimizing LLM Training with Triton Kernels and Infrastructure Stack

LinkedIn

LinkedIn introduced Liger-Kernel, an open-source library addressing GPU efficiency challenges in LLM training. The solution combines efficient Triton kernels with a flexible API design, integrated into a comprehensive training infrastructure stack. The implementation achieved significant improvements, including 20% better training throughput and 60% reduced memory usage for popular models like Llama, Gemma, and Qwen, while maintaining compatibility with mainstream training frameworks and distributed training systems.

Optimizing Medical Record Processing with Prompt Caching at Scale

Care Access

Care Access, a global health services and clinical research organization, faced significant operational challenges when processing 300-500+ medical records daily for their health screening program. Each medical record required multiple LLM-based analyses through Amazon Bedrock, but the approach of reprocessing substantial portions of medical data for each separate analysis question led to high costs and slower processing times. By implementing Amazon Bedrock's prompt caching feature—caching the static medical record content while varying only the analysis questions—Care Access achieved an 86% reduction in data processing costs (7x decrease) and 66% faster processing times (3x speedup), saving 4-8+ hours of processing time daily. This optimization enabled the organization to scale their health screening program efficiently while maintaining strict HIPAA compliance and privacy standards, allowing them to connect more participants with personalized health resources and clinical trial opportunities.

Optimizing Production Vision Pipelines for Planet Image Generation

Prem AI

At Prem AI, they tackled the challenge of generating realistic ethereal planet images at scale with specific constraints like aspect ratio and controllable parameters. The solution involved fine-tuning Stable Diffusion XL with a curated high-quality dataset, implementing custom upscaling pipelines, and optimizing performance through various techniques including LoRA fusion, model quantization, and efficient serving frameworks like Ray Serve.

Optimizing RAG Latency Through Model Racing and Self-Hosted Infrastructure

ElevenLabs

ElevenLabs faced significant latency challenges in their production RAG system, where query rewriting accounted for over 80% of RAG latency due to reliance on a single externally-hosted LLM. They redesigned their architecture to implement model racing, where multiple models (including self-hosted Qwen 3-4B and 3-30B-A3B models) process queries in parallel, with the first valid response winning. This approach reduced median RAG latency from 326ms to 155ms (a 50% improvement), while also improving system resilience by providing fallbacks during provider outages and reducing dependency on external services.

Optimizing RAG-based Search Results for Production: A Journey from POC to Production

Statista

Statista, a global data platform, developed and optimized a RAG-based AI search system to enhance their platform's search capabilities. Working with Urial Labs and Talent Formation, they transformed a basic prototype into a production-ready system that improved search quality by 140%, reduced costs by 65%, and decreased latency by 10%. The resulting Research AI product has seen growing adoption among paying customers and demonstrates superior performance compared to general-purpose LLMs for domain-specific queries.

Optimizing Security Threat Investigation with Multi-Model LLM Strategy

Trellix

Trellix implemented an AI-powered security threat investigation system using multiple foundation models on Amazon Bedrock to automate and enhance their security analysis workflow. By strategically combining Amazon Nova Micro with Anthropic's Claude Sonnet, they achieved 3x faster inference speeds and nearly 100x lower costs while maintaining investigation quality through a multi-pass approach with smaller models. The system uses RAG architecture with Amazon OpenSearch Service to process billions of security events and provide automated risk scoring.

Optimizing Text-to-SQL Pipeline Using Agent Experiments

IDInsight

Ask-a-Metric developed a WhatsApp-based AI data analyst that converts natural language questions to SQL queries. They evolved from a simple sequential pipeline to testing an agent-based approach using CrewAI, ultimately creating a hybrid "pseudo-agent" pipeline that combined the best aspects of both approaches. While the agent-based system achieved high accuracy, its high costs and slow response times led to the development of an optimized pipeline that maintained accuracy while reducing query response time to under 15 seconds and costs to less than $0.02 per query.

Optimizing vLLM for High-Throughput Embedding Inference at Scale

Snowflake

Snowflake faced performance bottlenecks when scaling embedding models for their Cortex AI platform, which processes trillions of tokens monthly. Through profiling vLLM, they identified CPU-bound inefficiencies in tokenization and serialization that left GPUs underutilized. They implemented three key optimizations: encoding embedding vectors as little-endian bytes for faster serialization, disaggregating tokenization and inference into a pipeline, and running multiple model replicas on single GPUs. These improvements delivered 16x throughput gains for short sequences and 4.2x for long sequences, while reducing costs by 16x and achieving 3x throughput improvement in production.

Overcoming LLM Production Deployment Challenges

Neeva

A comprehensive analysis of the challenges and solutions in deploying LLMs to production, presented by a machine learning expert from Neeva. The presentation covers both infrastructural challenges (speed, cost, API reliability, evaluation) and output-related challenges (format variability, reproducibility, trust and safety), along with practical solutions and strategies for successful LLM deployment, emphasizing the importance of starting with non-critical workflows and planning for scale.

Panel Discussion on Building Production LLM Applications

Various

A panel discussion featuring experts from Various companies discussing key aspects of building production LLM applications. The discussion covers critical topics including hallucination management, prompt engineering, evaluation frameworks, cost considerations, and model selection. Panelists share practical experiences and insights on deploying LLMs in production, highlighting the importance of continuous feedback loops, evaluation metrics, and the trade-offs between open source and commercial LLMs.

Panel Discussion on LLMOps Challenges: Model Selection, Ethics, and Production Deployment

Google, Databricks,

A panel discussion featuring leaders from various AI companies discussing the challenges and solutions in deploying LLMs in production. Key topics included model selection criteria, cost optimization, ethical considerations, and architectural decisions. The discussion highlighted practical experiences from companies like Interact.ai's healthcare deployment, Inflection AI's emotionally intelligent models, and insights from Google and Databricks on responsible AI deployment and tooling.

Panel Discussion: Best Practices for LLMs in Production

Various

A panel of industry experts from companies including Titan ML, YLabs, and Outer Bounds discuss best practices for deploying LLMs in production. They cover key challenges including prototyping, evaluation, observability, hardware constraints, and the importance of iteration. The discussion emphasizes practical advice for teams moving from prototype to production, highlighting the need for proper evaluation metrics, user feedback, and robust infrastructure.

PerfInsights: AI-Powered Performance Optimization for Go Services

Uber

Uber developed PerfInsights to address the unsustainable compute costs of their Go services, where the top 10 services alone accounted for multi-million dollars in monthly compute spend. The solution combines runtime profiling with GenAI-powered static analysis to automatically detect performance antipatterns in Go code, validate findings through LLM juries and rule-based checking (LLMCheck), and generate optimization recommendations. Results include a 93% reduction in time required to detect and fix performance issues (from 14.5 hours to 1 hour), over 80% reduction in false positives, hundreds of merged optimization diffs, and a 33.5% reduction in detected antipatterns over four months, translating to approximately 3,800 hours of engineering time saved annually.

Pivoting from GPU Infrastructure to Building an AI-Powered Development Environment

Windsurf

Windsurf began as a GPU virtualization company but pivoted in 2022 when they recognized the transformative potential of large language models. They developed an AI-powered development environment that evolved from a VS Code extension to a full-fledged IDE, incorporating advanced code understanding and generation capabilities. The product now serves hundreds of thousands of daily active users, including major enterprises, and has achieved significant success in automating software development tasks while maintaining high precision through sophisticated evaluation systems.

Platform-Centric AI-Assisted Code Generation with Context-Aware Systems

Intuit

Intuit developed a platform-centric approach to AI-assisted code generation to improve developer productivity across its 8,000+ engineering organization serving 100M customers. While off-the-shelf IDE extensions initially showed promise, they lacked awareness of Intuit-specific APIs, architectural conventions, and compliance requirements, leading to declining usage. Intuit's solution involved creating "golden repositories" containing curated, high-quality code examples that embed organizational context into AI code generation systems through context-enriched query pipelines. This approach enabled vendor-agnostic AI integration while ensuring generated code aligns with Intuit's standards. Results included 58% of AI-generated tests used without modification, 56% faster PR merge times, 3× faster backend code generation, and over 10× improvement in frontend generation tasks.

Post-Training and Production LLM Systems at Scale

OpenAI

This case study explores OpenAI's approach to post-training and deploying large language models in production environments, featuring insights from a post-training researcher working on reasoning models. The discussion covers the operational complexities of reinforcement learning from human feedback at scale, the evolution from non-thinking to thinking models, and production challenges including model routing, context window optimization, token efficiency improvements, and interruptability features. Key developments include the shopping model release, improvements from GPT-4.1 to GPT-5.1, and the operational realities of managing complex RL training runs with multiple grading setups and infrastructure components that require constant monitoring and debugging.

Practical Lessons Learned from Building and Deploying GenAI Applications

Bolbeck

A comprehensive overview of lessons learned from building GenAI applications over 1.5 years, focusing on the complexities and challenges of deploying LLMs in production. The presentation covers key aspects of LLMOps including model selection, hosting options, ensuring response accuracy, cost considerations, and the importance of observability in AI applications. Special attention is given to the emerging role of AI agents and the critical balance between model capability and operational costs.

Privacy-Preserving University Chatbot with LiteLLM Proxy for Multi-Model Governance and Cost Control

Unnamed private university

A private university sought to implement a privacy-preserving chatbot accessible to students and employees with requirements for model flexibility, potential self-hosting, and budget control. The solution leveraged LiteLLM's proxy server as an OpenAI-compatible gateway to manage multiple LLM providers, implement automatic cost tracking and budgeting per user/team, handle load balancing across model instances, and provide a unified API. While the system successfully delivered basic cost control and multi-provider support, the implementation revealed limitations in handling complex custom budgeting requirements, provider-specific features, and stability issues with newer features, requiring workarounds and custom implementations for advanced use cases.

Production Agents: Real-world Implementations of LLM-powered Autonomous Systems

Various

A panel discussion featuring three practitioners implementing LLM-powered agents in production: Sam's personal assistant with real-time feedback and router agents, Div's browser automation system Melton with reliability and monitoring features, and Devin's GitHub repository assistant that helps with code understanding and feature requests. Each presenter shared their architecture choices, testing strategies, and approaches to handling challenges like latency, reliability, and model selection in production environments.

Production Agents: Routing, Testing and Browser Automation Case Studies

Various

Three practitioners share their experiences deploying LLM agents in production: Sam discusses building a personal assistant with real-time user feedback and router agents, Div presents a browser automation assistant called Milton that can control web applications, and Devin explores using LLMs to help engineers with non-coding tasks by navigating codebases. Each case study highlights different approaches to routing between agents, handling latency, testing strategies, and model selection for production deployment.

Production AI Agents for Accounting Automation: Engineering Process Daemons at Scale

Digits

Digits, an AI-native accounting platform, shares their experience running AI agents in production for over 2 years, addressing real-world challenges in deploying LLM-based systems. The team reframes "agents" as "process daemons" to set appropriate expectations and details their implementation across three use cases: vendor data enrichment, client onboarding, and complex query handling. Their solution emphasizes building lightweight custom infrastructure over dependency-heavy frameworks, reusing existing APIs as agent tools, implementing comprehensive observability with OpenTelemetry, and establishing robust guardrails. The approach has enabled reliable automation while maintaining transparency, security, and performance through careful engineering rather than relying on framework abstractions.

Production AI Agents with Dynamic Planning and Reactive Evaluation

Hex

Hex successfully implemented AI agents in production for data science notebooks by developing a unique approach to agent orchestration. They solved key challenges around planning, tool usage, and latency by constraining agent capabilities, building a reactive DAG structure, and optimizing context windows. Their success came from iteratively developing individual capabilities before combining them into agents, keeping humans in the loop, and maintaining tight feedback cycles with users.

Production AI Deployment: Lessons from Real-World Agentic AI Systems

Databricks / Various

This case study presents lessons learned from deploying generative AI applications in production, with a specific focus on Flo Health's implementation of a women's health chatbot on the Databricks platform. The presentation addresses common failure points in GenAI projects including poor constraint definition, over-reliance on LLM autonomy, and insufficient engineering discipline. The solution emphasizes deterministic system architecture over autonomous agents, comprehensive observability and tracing, rigorous evaluation frameworks using LLM judges, and proper DevOps practices. Results demonstrate that successful production deployments require treating agentic AI as modular system architectures following established software engineering principles rather than monolithic applications, with particular emphasis on cost tracking, quality monitoring, and end-to-end deployment pipelines.

Production Deployment Challenges and Infrastructure Gaps for Multi-Agent AI Systems

GetOnStack

GetOnStack's team deployed a multi-agent LLM system for market data research that initially cost $127 weekly but escalated to $47,000 over four weeks due to an infinite conversation loop between agents running undetected for 11 days. This experience exposed critical gaps in production infrastructure for multi-agent systems using Agent-to-Agent (A2A) communication and Anthropic's Model Context Protocol (MCP). In response, the company spent six weeks building comprehensive production infrastructure including message queues, monitoring, cost controls, and safeguards. GetOnStack is now developing a platform to provide one-command deployment and production-ready infrastructure specifically designed for multi-agent systems, aiming to help other teams avoid similar costly production failures.

Production GenAI for User Safety and Enhanced Matching Experience

Tinder

Tinder implemented two production GenAI applications to enhance user safety and experience: a username detection system using fine-tuned Mistral 7B to identify social media handles in user bios with near-perfect recall, and a personalized match explanation feature using fine-tuned Llama 3.1 8B to help users understand why recommended profiles are relevant. Both systems required sophisticated LLMOps infrastructure including multi-model serving with LoRA adapters, GPU optimization, extensive monitoring, and iterative fine-tuning processes to achieve production-ready performance at scale.

Production LLM Implementation for Customer Support Response Generation

Stripe

Stripe implemented a large language model system to help support agents answer customer questions more efficiently. They developed a sequential framework that combined fine-tuned models for question filtering, topic classification, and response generation. While the system achieved good accuracy in offline testing, they discovered challenges with agent adoption and the importance of monitoring online metrics. Key learnings included breaking down complex problems into manageable ML steps, prioritizing online feedback mechanisms, and maintaining high-quality training data.

Production LLM Systems at Scale - Lessons from Financial Services, Legal Tech, and ML Infrastructure

Nubank, Harvey AI, Galileo and Convirza

A panel discussion featuring leaders from Nubank, Harvey AI, Galileo, and Convirza discussing their experiences implementing LLMs in production. The discussion covered key challenges and solutions around model evaluation, cost optimization, latency requirements, and the transition from large proprietary models to smaller fine-tuned models. Participants shared insights on modularizing LLM applications, implementing human feedback loops, and balancing the tradeoffs between model size, cost, and performance in production environments.

Production Monitoring and Issue Discovery for AI Agents

Raindrop

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

Production Vector Search and Retrieval System Optimization at Scale

Superlinked

SuperLinked, a company focused on vector search infrastructure, shares production insights from deploying information retrieval systems for e-commerce and enterprise knowledge management with indexes up to 2 terabytes. The presentation addresses challenges in relevance, latency, and cost optimization when deploying vector search systems at scale. Key solutions include avoiding vector pooling/averaging, implementing late interaction models, fine-tuning embeddings for domain-specific needs, combining sparse and dense representations, leveraging graph embeddings, and using template-based query generation instead of unconstrained text-to-SQL. Results demonstrate 5%+ precision improvements through targeted fine-tuning, significant latency reductions through proper database selection and query optimization, and improved relevance through multi-encoder architectures that combine text, graph, and metadata signals.

Production-Scale Generative AI Infrastructure for Game Art Creation

Playtika

Playtika, a gaming company, built an internal generative AI platform to accelerate art production for their game studios with the goal of reducing art production time by 50%. The solution involved creating a comprehensive infrastructure for fine-tuning and deploying diffusion models (Stable Diffusion 1.5, then SDXL) at scale, supporting text-to-image, image-to-image, and inpainting capabilities. The platform evolved from using DreamBooth fine-tuning with separate model deployments to LoRA adapters with SDXL, enabling efficient model switching and GPU utilization. Through optimization techniques including OneFlow acceleration framework (achieving 40% latency reduction), FP16 quantization, NVIDIA MIG partitioning, and careful infrastructure design, they built a cost-efficient system serving multiple game studios while maintaining quality and minimizing inference latency.

Production-Scale NLP Suggestion System with Real-Time Text Processing

Grammarly

Grammarly built a sophisticated production system for delivering writing suggestions to 30 million users daily. The company developed an extensible operational transformation protocol using Delta format to represent text changes, user edits, and AI-generated suggestions in a unified manner. The system addresses critical challenges in managing ML-generated suggestions at scale: maintaining suggestion relevance as users edit text in real-time, rebasing suggestion positions according to ongoing edits without waiting for backend updates, and applying multiple suggestions simultaneously without UI freezing. The architecture includes a Suggestions Repository, Delta Manager for rebasing operations, and Highlights Manager, all working together to ensure suggestions remain accurate and applicable as document state changes dynamically.

Production-Scale RAG System for Real-Time News Processing and Analysis

Emergent Methods

Emergent Methods built a production-scale RAG system processing over 1 million news articles daily, using a microservices architecture to deliver real-time news analysis and context engineering. The system combines multiple open-source tools including Quadrant for vector search, VLM for GPU optimization, and their own Flow.app for orchestration, addressing challenges in news freshness, multilingual processing, and hallucination prevention while maintaining low latency and high availability.

Productionizing Generative AI Applications: From Exploration to Scale

LinkedIn

A LinkedIn product manager shares insights on bringing LLMs to production, focusing on their implementation of various generative AI features across the platform. The case study covers the complete lifecycle from idea exploration to production deployment, highlighting key considerations in prompt engineering, GPU resource management, and evaluation frameworks. The presentation emphasizes practical approaches to building trust-worthy AI products while maintaining scalability and user focus.

Productionizing LLM-Powered Data Governance with LangChain and LangSmith

Grab

Grab enhanced their LLM-powered data governance system (Metasense V2) by improving model performance and operational efficiency. The team tackled challenges in data classification by splitting complex tasks, optimizing prompts, and implementing LangChain and LangSmith frameworks. These improvements led to reduced misclassification rates, better collaboration between teams, and streamlined prompt experimentation and deployment processes while maintaining robust monitoring and safety measures.

RAG-Based Dasher Support Automation with LLM Guardrails and Quality Monitoring

Doordash

DoorDash developed an LLM-based chatbot system to automate support for Dashers (delivery contractors) who encounter issues during deliveries. The existing flow-based automated support system could only handle a limited subset of issues, and while a knowledge base existed, it was difficult to navigate, time-consuming to parse, and only available in English. The solution involved implementing a RAG (Retrieval Augmented Generation) system that retrieves relevant information from knowledge base articles and generates contextually appropriate responses. To address LLM challenges including hallucinations, context summarization accuracy, language consistency, and latency, DoorDash built three key systems: an LLM Guardrail for real-time response validation, an LLM Judge for quality monitoring and evaluation, and a quality improvement pipeline. The system now autonomously assists thousands of Dashers daily, reducing hallucinations by 90% and compliance issues by 99%, while allowing human agents to focus on more complex support scenarios.

Rapid Development and Deployment of Enterprise LLM Features Through Centralized LLM Service Architecture

PagerDuty

PagerDuty successfully developed and deployed multiple GenAI features in just two months by implementing a centralized LLM API service architecture. They created AI-powered features including runbook generation, status updates, postmortem reports, and an AI assistant, while addressing challenges of rapid development with new technology. Their solution included establishing clear processes, role definitions, and a centralized LLM service with robust security, monitoring, and evaluation frameworks.

Rapid Prototyping and Scaling AI Applications Using Open Source Models

Hassan El Mghari

Hassan El Mghari, a developer relations leader at Together AI, demonstrates how to build and scale AI applications to millions of users using open source models and a simplified architecture. Through building approximately 40 AI apps over four years (averaging one per month), he developed a streamlined approach that emphasizes simplicity, rapid iteration, and leveraging the latest open source models. His applications, including commit message generators, text-to-app builders, and real-time image generators, have collectively served millions of users and generated tens of millions of outputs, proving that simple architectures with single API calls can achieve significant scale when combined with good UI design and viral sharing mechanics.

Real-Time Access Control and Credit System for High-Scale LLM Products

OpenAI

OpenAI encountered significant scaling challenges with Codex and Sora as rapid user adoption pushed usage beyond expected limits, creating frustrating experiences when users hit rate limits. To address this, they built an in-house real-time access engine that seamlessly blends rate limits with a credit-based pay-as-you-go system, enabling users to continue working without hard stops. The solution involved creating a distributed usage and balance system with provably correct billing, real-time decision-making, idempotent credit debits, and comprehensive audit trails that maintain user trust while ensuring fair access and system performance at scale.

Real-time AI Agent Assistance in Contact Center Operations

US Bank

US Bank implemented a generative AI solution to enhance their contact center operations by providing real-time assistance to agents handling customer calls. The system uses Amazon Q in Connect and Amazon Bedrock with Anthropic's Claude model to automatically transcribe conversations, identify customer intents, and provide relevant knowledge base recommendations to agents in real-time. While still in production pilot phase with limited scope, the solution addresses key challenges including reducing manual knowledge base searches, improving call handling times, decreasing call transfers, and automating post-call documentation through conversation summarization.

Real-Time AI Chief of Staff for Product Teams

Earmark

Earmark built a productivity suite for product teams that transforms meeting conversations into finished work in real-time, addressing the problem of endless context-switching and manual follow-up work that plagues modern product development. Founded by Mark Barb and Sandon, who both came from the product management SaaS space, Earmark uses live transcription and multiple parallel AI agents to generate product specs, tickets, summaries, and other artifacts during meetings rather than after them. The company pivoted from an Apple Vision Pro communication training tool to a web-based real-time meeting assistant after discovering through 60 customer interviews that few people actually prepare for presentations. With 78% of survey respondents saying they'd be "super bummed" if the product disappeared, Earmark has achieved strong product-market fit by focusing specifically on product managers, engineering leaders, and adjacent roles who spend most of their time in back-to-back meetings with different audiences and deliverables.

Real-time Data Streaming Architecture for AI Customer Support

Clari

A fictional airline case study demonstrates how shifting from batch processing to real-time data streaming transformed their AI customer support system. By implementing a shift-left data architecture using Kafka and Flink, they eliminated data silos and delayed processing, enabling their AI agents to access up-to-date customer information across all channels. This resulted in improved customer satisfaction, reduced latency, and decreased operational costs while enabling their AI system to provide more accurate and contextual responses.

Real-Time Generative AI for Immersive Theater Performance

University of California Los Angeles

The University of California Los Angeles (UCLA) Office of Advanced Research Computing (OARC) partnered with UCLA's Center for Research and Engineering in Media and Performance (REMAP) to build an AI-powered system for an immersive production of the musical "Xanadu." The system enabled up to 80 concurrent audience members and performers to create sketches on mobile phones, which were processed in near real-time (under 2 minutes) through AWS generative AI services to produce 2D images and 3D meshes displayed on large LED screens during live performances. Using a serverless-first architecture with Amazon SageMaker AI endpoints, Amazon Bedrock foundation models, and AWS Lambda orchestration, the system successfully supported 7 performances in May 2025 with approximately 500 total audience members, demonstrating that cloud-based generative AI can reliably power interactive live entertainment experiences.

Real-Time Multilingual Chat Translation at Scale

Roblox

Roblox deployed a unified transformer-based translation LLM to enable real-time chat translation across all combinations of 16 supported languages for over 70 million daily active users. The company built a custom ~1 billion parameter model using pretraining on open source and proprietary data, then distilled it down to fewer than 650 million parameters to achieve approximately 100 millisecond latency while handling over 5,000 chats per second. The solution leverages a mixture-of-experts architecture, custom translation quality estimation models, back translation techniques for low-resource language pairs, and comprehensive integration with trust and safety systems to deliver contextually appropriate translations that understand Roblox-specific slang and terminology.

Real-time Question-Answering System with Two-Stage LLM Architecture for Sales Content Recommendations

Microsoft

Microsoft developed a real-time question-answering system for their MSX Sales Copilot to help sellers quickly find and share relevant sales content from their Seismic repository. The solution uses a two-stage architecture combining bi-encoder retrieval with cross-encoder re-ranking, operating on document metadata since direct content access wasn't available. The system was successfully deployed in production with strict latency requirements (few seconds response time) and received positive feedback from sellers with relevancy ratings of 3.7/5.

Rebuilding a Production Chatbot with Direct API Access and Multi-Agent Architecture

Langchain

LangChain rebuilt their public documentation chatbot after discovering their support engineers preferred using their own internal workflow over the existing tool. The original chatbot used traditional vector embedding retrieval, which suffered from fragmented context, constant reindexing, and vague citations. The solution involved building two distinct architectures: a fast CreateAgent for simple documentation queries delivering sub-15-second responses, and a Deep Agent with specialized subgraphs for complex queries requiring codebase analysis. The new approach replaced vector embeddings with direct API access to structured content (Mintlify for docs, Pylon for knowledge base, and ripgrep for codebase search), enabling the agent to search iteratively like a human. Results included dramatically faster response times, precise citations with line numbers, elimination of reindexing overhead, and internal adoption by support engineers for complex troubleshooting.

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

Revamping Query Understanding with LLMs in E-commerce Search

Instacart

Instacart transformed their query understanding (QU) system from multiple independent traditional ML models to a unified LLM-based approach to better handle long-tail, specific, and creatively-phrased search queries. The solution employed a layered strategy combining retrieval-augmented generation (RAG) for context engineering, post-processing guardrails, and fine-tuning of smaller models (Llama-3-8B) on proprietary data. The production system achieved significant improvements including 95%+ query rewrite coverage with 90%+ precision, 6% reduction in scroll depth for tail queries, 50% reduction in complaints for poor tail query results, and sub-300ms latency through optimizations like adapter merging, H100 GPU upgrades, and autoscaling.

RoBERTa for Large-Scale Merchant Classification

Square

Square developed and deployed a RoBERTa-based merchant classification system to accurately categorize millions of merchants across their platform. The system replaced unreliable self-selection methods with an ML approach that combines business names, self-selected information, and transaction data to achieve a 30% improvement in accuracy. The solution runs daily predictions at scale using distributed GPU infrastructure and has become central to Square's business metrics and strategic decision-making.

Running LLM Agents in Production for Accounting Automation

Digits

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

Scalable Intelligent Document Processing with Multi-Tenant Serverless Architecture

Ricoh

Ricoh USA faced significant scalability challenges in their healthcare document processing operations, where each new customer implementation required 40-60 hours of custom engineering work involving unique prompt engineering, model fine-tuning, and integration testing. To address anticipated sevenfold growth in document volume (from 10,000 to 70,000 documents monthly), Ricoh partnered with AWS to implement the GenAI IDP Accelerator using a serverless architecture combining Amazon Textract for OCR and Amazon Bedrock foundation models for intelligent classification and extraction. The solution reduced customer onboarding time from 4-6 weeks to 2-3 days, decreased engineering hours per deployment by over 90% (from ~80 hours to <5 hours), and created a reusable, multi-tenant framework that maintains strict healthcare compliance standards (HITRUST, HIPAA, SOC 2) while enabling effective human-in-the-loop workflows through confidence scoring mechanisms.

Scaling a High-Traffic LLM Chat Application to 30,000 Messages Per Second

Character.ai

Character.ai scaled their open-domain conversational AI platform from 300 to over 30,000 generations per second within 18 months, becoming the third most-used generative AI application globally. They tackled unique engineering challenges around data volume, cost optimization, and connection management while maintaining performance. Their solution involved custom model architectures, efficient GPU caching strategies, and innovative prompt management tools, all while balancing performance, latency, and cost considerations at scale.

Scaling Agentic AI for Digital Accessibility and Content Intelligence

Siteimprove

Siteimprove, a SaaS platform provider for digital accessibility, analytics, SEO, and content strategy, embarked on a journey from generative AI to production-scale agentic AI systems. The company faced the challenge of processing up to 100 million pages per month for accessibility compliance while maintaining trust, speed, and adoption. By leveraging AWS Bedrock, Amazon Nova models, and developing a custom AI accelerator architecture, Siteimprove built a multi-agent system supporting batch processing, conversational remediation, and contextual image analysis. The solution achieved 75% cost reduction on certain workloads, enabled autonomous multi-agent orchestration across accessibility, analytics, SEO, and content domains, and was recognized as a leader in Forrester's digital accessibility platforms assessment. The implementation demonstrated how systematic progression through human-in-the-loop, human-on-the-loop, and autonomous stages can bridge the prototype-to-production chasm while delivering measurable business value.

Scaling Agentic AI Systems for Real Estate Due Diligence: Managing Prompt Tax at Production Scale

Orbital

Orbital, a real estate technology company, developed an agentic AI system called Orbital Co-pilot to automate legal due diligence for property transactions. The system processes hundreds of pages of legal documents to extract key information traditionally done manually by lawyers. Over 18 months, they scaled from zero to processing 20 billion tokens monthly and achieved multiple seven figures in annual recurring revenue. The presentation focuses on their concept of "prompt tax" - the hidden costs and complexities of continuously upgrading AI models in production, including prompt migration, regression risks, and the operational challenges of shipping at the AI frontier.

Scaling AI Agents to Production: A Blueprint for Autonomous Customer Service

Cox Automotive

Cox Automotive, a dominant player in the automotive software industry with visibility into 5.1 trillion vehicle insights, faced the challenge of moving AI agents from prototype to production at scale. In response to an aggressive 5-week deadline set in summer 2024, the company launched five agentic AI products using Amazon Bedrock Agent Core and the Strands framework. The flagship product was a fully automated virtual assistant for dealership customer conversations that operates autonomously after hours without human oversight. By establishing foundational infrastructure with Agent Core, implementing comprehensive red teaming practices, designing both hard and soft guardrails, automating evaluation with LLM-as-judge techniques, and setting circuit breakers for cost and conversation limits, Cox Automotive successfully deployed three products to production beta, with dealers reporting that customers receive timely responses both during business hours and after hours.

Scaling AI Image Animation System with Optimized Latency and Traffic Management

Meta

Meta developed and deployed an AI-powered image animation feature that needed to serve billions of users efficiently. They tackled this challenge through a comprehensive optimization strategy including floating-point precision reduction, temporal-attention improvements, DPM-Solver implementation, and innovative distillation techniques. The system was further enhanced with sophisticated traffic management and load balancing solutions, resulting in a highly efficient, globally scalable service with minimal latency and failure rates.

Scaling AI Infrastructure for Legal AI Applications at Enterprise Scale

Harvey

Harvey, a legal AI platform company, developed a comprehensive AI infrastructure system to handle millions of daily requests across multiple AI models for legal document processing and analysis. The company built a centralized Python library that manages model deployments, implements load balancing, quota management, and real-time monitoring to ensure reliability and performance. Their solution includes intelligent model endpoint selection, distributed rate limiting using Redis-backed token bucket algorithms, a proxy service for developer access, and comprehensive observability tools, enabling them to process billions of prompt tokens while maintaining high availability and seamless scaling for their legal AI products.

Scaling AI Infrastructure: From Training to Inference at Meta

Meta

Meta shares their journey in scaling AI infrastructure to support massive LLM training and inference operations. The company faced challenges in scaling from 256 GPUs to over 100,000 GPUs in just two years, with plans to reach over a million GPUs by year-end. They developed solutions for distributed training, efficient inference, and infrastructure optimization, including new approaches to data center design, power management, and GPU resource utilization. Key innovations include the development of a virtual machine service for secure code execution, improvements in distributed inference, and novel approaches to reducing model hallucinations through RAG.

Scaling AI Infrastructure: Managing Data Movement and Placement on Meta's Global Backbone Network

Meta

Meta faced significant challenges when AI workload demands on their global backbone network grew over 100% year-over-year starting in 2022. The case study explores how Meta adapted their infrastructure to handle AI-specific challenges around data replication, placement, and freshness requirements across their network of 25 data centers and 85 points of presence. They implemented solutions including optimizing data placement strategies, improving caching mechanisms, and working across compute, storage, and network teams to "bend the demand curve" while expanding network capacity to meet AI workload needs.

Scaling AI Infrastructure: Network Architecture and Communication Optimization at Microsoft

Meta

Microsoft's AI infrastructure team tackled the challenges of scaling large language models across massive GPU clusters by optimizing network topology, routing, and communication libraries. They developed innovative approaches including rail-optimized cluster designs, smart communication libraries like TAL and MSL, and intelligent validation frameworks like SuperBench, enabling reliable training across hundreds of thousands of GPUs while achieving top rankings in ML performance benchmarks.

Scaling AI Network Infrastructure for Large Language Model Training at 100K+ GPU Scale

Meta

Meta's network engineers Rohit Puri and Henny present the evolution of Meta's AI network infrastructure designed to support large-scale generative AI training, specifically for LLaMA models. The case study covers the journey from a 24K GPU cluster used for LLaMA 3 training to a 100K+ GPU multi-building cluster for LLaMA 4, highlighting the architectural decisions, networking challenges, and operational solutions needed to maintain performance and reliability at unprecedented scale. The presentation details technical challenges including network congestion, priority flow control issues, buffer management, and firmware inconsistencies that emerged during production deployment, along with the engineering solutions implemented to resolve these issues while maintaining model training performance.

Scaling AI Systems for Unstructured Data Processing: Logical Data Models and Embedding Optimization

CoActive AI

CoActive AI addresses the challenge of processing unstructured data at scale through AI systems. They identified two key lessons: the importance of logical data models in bridging the gap between data storage and AI processing, and the strategic use of embeddings for cost-effective AI operations. Their solution involves creating data+AI hybrid teams to resolve impedance mismatches and optimizing embedding computations to reduce redundant processing, ultimately enabling more efficient and scalable AI operations.

Scaling AI-Assisted Coding Infrastructure: From Auto-Complete to Global Deployment

Cursor

Cursor, an AI-assisted coding platform, scaled their infrastructure from handling basic code completion to processing 100 million model calls per day across a global deployment. They faced and overcame significant challenges in database management, model inference scaling, and indexing systems. The case study details their journey through major incidents, including a database crisis that led to a complete infrastructure refactor, and their innovative solutions for handling high-scale AI model inference across multiple providers while maintaining service reliability.

Scaling AI-Assisted Developer Tools and Agentic Workflows at Scale

Slack

Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.

Scaling AI-Generated Image Animation with Optimized Deployment Strategies

Meta

Meta tackled the challenge of deploying an AI-powered image animation feature at massive scale, requiring optimization of both model performance and infrastructure. Through a combination of model optimizations including halving floating-point precision, improving temporal-attention expansion, and leveraging DPM-Solver, along with sophisticated traffic management and deployment strategies, they successfully deployed a system capable of serving billions of users while maintaining low latency and high reliability.

Scaling AI-Powered Code Generation in Browser and Enterprise Environments

Qodo / Stackblitz

The case study examines two companies' approaches to deploying LLMs for code generation at scale: Stackblitz's Bolt.new achieving over $8M ARR in 2 months with their browser-based development environment, and Qodo's enterprise-focused solution handling complex deployment scenarios across 96 different configurations. Both companies demonstrate different approaches to productionizing LLMs, with Bolt.new focusing on simplified web app development for non-developers and Qodo targeting enterprise testing and code review workflows.

Scaling AI-Powered File Understanding with Efficient Embedding and LLM Architecture

Dropbox

Dropbox implemented AI-powered file understanding capabilities for previews on the web, enabling summarization and Q&A features across multiple file types. They built a scalable architecture using their Riviera framework for text extraction and embeddings, implemented k-means clustering for efficient summarization, and developed an intelligent chunk selection system for Q&A. The system achieved significant improvements with a 93% reduction in cost-per-summary, 64% reduction in cost-per-query, and latency improvements from 115s to 4s for summaries and 25s to 5s for queries.

Scaling an AI-Powered Conversational Shopping Assistant to 250 Million Users

Rufus

Amazon built Rufus, an AI-powered shopping assistant that serves over 250 million customers with conversational shopping experiences. Initially launched using a custom in-house LLM specialized for shopping queries, the team later adopted Amazon Bedrock to accelerate development velocity by 6x, enabling rapid integration of state-of-the-art foundation models including Amazon Nova and Anthropic's Claude Sonnet. This multi-model approach combined with agentic capabilities like tool use, web grounding, and features such as price tracking and auto-buy resulted in monthly user growth of 140% year-over-year, interaction growth of 210%, and a 60% increase in purchase completion rates for customers using Rufus.

Scaling an AI-Powered Search and Research Assistant from Prototype to Production

Perplexity AI

Perplexity AI evolved from an internal tool for answering SQL and enterprise questions to a full-fledged AI-powered search and research assistant. The company iteratively developed their product through various stages - from Slack and Discord bots to a web interface - while tackling challenges in search relevance, model selection, latency optimization, and cost management. They successfully implemented a hybrid approach using fine-tuned GPT models and their own LLaMA-based models, achieving superior performance metrics in both citation accuracy and perceived utility compared to competitors.

Scaling an MCP Server for Error Monitoring to 60 Million Monthly Requests

Sentry

Sentry, an error monitoring platform, built a Model Context Protocol (MCP) server to improve the workflow where developers would copy error details from Sentry's UI and paste them into AI coding assistants like Cursor. The MCP server provides direct integration with 10-15 tools, including retrieving issue details and triggering automated fix attempts through Sentry's AI agent. The implementation scaled from 30 million to 60 million requests per month, with over 5,000 organizations using it. The company learned critical lessons about treating MCP servers as production services, implementing comprehensive observability, managing context pollution, and taking responsibility for agent behavior through careful prompt engineering and tool description design.

Scaling and Operating Large Language Models at the Frontier

Anthropic

This case study examines Anthropic's journey in scaling and operating large language models, focusing on their transition from GPT-3 era training to current state-of-the-art systems like Claude. The company successfully tackled challenges in distributed computing, model safety, and operational reliability while growing 10x in revenue. Key innovations include their approach to constitutional AI, advanced evaluation frameworks, and sophisticated MLOps practices that enable running massive training operations with hundreds of team members.

Scaling and Optimizing Self-Hosted LLMs for Developer Documentation

Various

A tech company needed to improve their developer documentation accessibility and understanding. They implemented a self-hosted LLM solution using retrieval augmented generation (RAG), with guard rails for content safety. The team optimized performance using vLLM for faster inference and Ray Serve for horizontal scaling, achieving significant improvements in latency and throughput while maintaining cost efficiency. The solution helped developers better understand and adopt the company's products while keeping proprietary information secure.

Scaling Audio Content Generation with LLMs and TTS for Language Learning

Duolingo

Duolingo tackled the challenge of scaling their DuoRadio feature, a podcast-like audio learning experience, by implementing an AI-driven content generation pipeline. They transformed a labor-intensive manual process into an automated system using LLMs for script generation and evaluation, coupled with Text-to-Speech technology. This allowed them to expand from 300 to 15,000+ episodes across 25+ language courses in under six months, while reducing costs by 99% and growing daily active users from 100K to 5.5M.

Scaling Chatbot Platform with Hybrid LLM and Custom Model Approach

Voiceflow

Voiceflow, a chatbot and voice assistant platform, integrated large language models into their existing infrastructure while maintaining custom language models for specific tasks. They used OpenAI's API for generative features but kept their custom NLU model for intent/entity detection due to superior performance and cost-effectiveness. The company implemented extensive testing frameworks, prompt engineering, and error handling while dealing with challenges like latency variations and JSON formatting issues.

Scaling Content Production and Fan Engagement with Gen AI

Bundesliga

Bundesliga (DFL), Germany's premier soccer league, deployed multiple Gen AI solutions to address two key challenges: scaling content production for over 1 billion global fans across 200 countries, and enhancing personalized fan engagement to reduce "second screen chaos" during live matches. The organization implemented three main production-scale solutions: automated match report generation that saves editors 90% of their time, AI-powered story creation from existing articles that reduces production time by 80%, and on-demand video localization that cuts processing time by 75% while reducing costs by 3.5x. Additionally, they developed MatchMade, an AI-powered fan companion featuring dynamic text-to-SQL workflows and proactive content nudging. By leveraging Amazon Nova for cost-performance optimization alongside other models like Anthropic's Claude, Bundesliga achieved a 70% cost reduction in image assignment tasks, 35% cost reduction through dynamic routing, and scaled personalized content delivery by 5x per user while serving over 100,000 fans in production.

Scaling Customer Support AI Chatbot to Production with Multiple LLM Providers

Intercom

Intercom developed Fin, an AI customer support chatbot that resolves up to 86% of conversations instantly. They faced challenges scaling from proof-of-concept to production, particularly around reliability and cost management. The team successfully improved their system from 99% to 99.9%+ reliability by implementing cross-region inference, strategic use of streaming, and multiple model fallbacks while using Amazon Bedrock and other LLM providers. The solution has processed over 13 million conversations for 4,000+ customers with most achieving over 50% automated resolution rates.

Scaling Customer Support, Compliance, and Developer Productivity with Gen AI

Coinbase

Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.

Scaling Email Content Extraction Using LLMs in Production

Yahoo

Yahoo Mail faced challenges with their existing ML-based email content extraction system, hitting a coverage ceiling of 80% for major senders while struggling with long-tail senders and slow time-to-market for model updates. They implemented a new solution using Google Cloud's Vertex AI and LLMs, achieving 94% coverage for standard domains and 99% for tail domains, with 51% increase in extraction richness and 16% reduction in tracking API errors. The implementation required careful consideration of hybrid infrastructure, cost management, and privacy compliance while processing billions of daily messages.

Scaling Enterprise RAG with Advanced Vector Search Migration

Danswer

Danswer, an enterprise search solution, migrated their core search infrastructure to Vespa to overcome limitations in their previous vector database setup. The migration enabled them to better handle team-specific terminology, implement custom boost and decay functions, and support multiple vector embeddings per document while maintaining performance at scale. The solution improved search accuracy and resource efficiency for their RAG-based enterprise search product.

Scaling Finance Operations with Agentic AI in a High-Growth EV Manufacturer

Lucid Motors

Lucid Motors, a software-defined electric vehicle manufacturer, partnered with PWC and AWS to implement agentic AI solutions across their finance organization to prepare for massive growth with the launch of their mid-size vehicle platform. The company developed 14 proof-of-concept use cases in just 10 weeks, spanning demand forecasting, investor analytics, treasury, accounting, and internal audit functions. By leveraging AWS Bedrock and PWC's Agent OS orchestration layer, along with access to diverse data sources across SAP, Redshift, and Salesforce, Lucid is transforming finance from a traditional reporting function into a strategic competitive advantage that provides real-time predictive analytics and enables data-driven decision making at sapphire speed.

Scaling Financial Software with GenAI and Production ML

Ramp

Ramp, a financial technology company, has integrated AI and ML throughout their operations, from their core financial products to their sales and customer service. They evolved from traditional ML use cases like fraud detection and underwriting to more advanced generative AI applications. Their Ramp Intelligence suite now includes features like automated price comparison, expense categorization, and an experimental AI agent that can guide users through the platform's interface. The company has achieved significant productivity gains, with their sales development representatives booking 3-4x more meetings than competitors through AI augmentation.

Scaling GenAI Applications with vLLM for High-Throughput LLM Serving

LinkedIn

LinkedIn adopted vLLM, an open-source LLM inference framework, to power over 50 GenAI use cases including LinkedIn Hiring Assistant and AI Job Search, running on thousands of hosts across their platform. The company faced challenges in deploying LLMs at scale with low latency and high throughput requirements, particularly for applications requiring complex reasoning and structured outputs. By leveraging vLLM's PagedAttention technology and implementing a five-phase evolution strategy—from offline mode to a modular, OpenAI-compatible architecture—LinkedIn achieved significant performance improvements including ~10% TPS gains and GPU savings of over 60 units for certain workloads, while maintaining sub-600ms p95 latency for thousands of QPS in production applications.

Scaling Generative AI Features to Millions of Users with Infrastructure Optimization and Quality Evaluation

Slack

Slack faced significant challenges in scaling their generative AI features (Slack AI) to millions of daily active users while maintaining security, cost efficiency, and quality. The company needed to move from a limited, provisioned infrastructure to a more flexible system that could handle massive scale (1-5 billion messages weekly) while meeting strict compliance requirements. By migrating from SageMaker to Amazon Bedrock and implementing sophisticated experimentation frameworks with LLM judges and automated metrics, Slack achieved over 90% reduction in infrastructure costs (exceeding $20 million in savings), 90% reduction in cost-to-serve per monthly active user, 5x increase in scale, and 15-30% improvements in user satisfaction across features—all while maintaining quality and enabling experimentation with over 15 different LLMs in production.

Scaling Generative AI for Manufacturing Operations with RAG and Multi-Model Architecture

Georgia-Pacific

Georgia-Pacific, a forest products manufacturing company with 30,000+ employees and 140+ facilities, deployed generative AI to address critical knowledge transfer challenges as experienced workers retire and new employees struggle with complex equipment. The company developed an "Operator Assistant" chatbot using AWS Bedrock, RAG architecture, and vector databases to provide real-time troubleshooting guidance to factory operators. Starting with a 6-8 week MVP deployment in December 2023, they scaled to 45 use cases across multiple facilities within 7-8 months, serving 500+ users daily with improved operational efficiency and reduced waste.

Scaling Generative AI in Gaming: From Safety to Creation Tools

Roblox

Roblox has implemented a comprehensive suite of generative AI features across their gaming platform, addressing challenges in content moderation, code assistance, and creative tools. Starting with safety features using transformer models for text and voice moderation, they expanded to developer tools including AI code assistance, material generation, and specialized texture creation. The company releases new AI features weekly, emphasizing rapid iteration and public testing, while maintaining a balance between automation and creator control. Their approach combines proprietary solutions with open-source contributions, demonstrating successful large-scale deployment of AI in a production gaming environment serving 70 million daily active users.

Scaling Image Generation to 100M New Users in One Week

OpenAI

OpenAI's launch of ChatGPT Images faced unprecedented scale, attracting 100 million new users generating 700 million images in the first week. The engineering team had to rapidly adapt their synchronous image generation system to an asynchronous one while handling production load, implementing system isolation, and managing resource constraints. Despite the massive scale and technical challenges, they maintained service availability by prioritizing access over latency and successfully scaled their infrastructure.

Scaling LLM and ML Models to 300M Monthly Requests with Self-Hosting

StoryGraph

StoryGraph, a book recommendation platform, successfully scaled their AI/ML infrastructure to handle 300M monthly requests by transitioning from cloud services to self-hosted solutions. The company implemented multiple custom ML models, including book recommendations, similar users, and a large language model, while maintaining data privacy and reducing costs significantly compared to using cloud APIs. Through innovative self-hosting approaches and careful infrastructure optimization, they managed to scale their operations despite being a small team, though not without facing significant challenges during high-traffic periods.

Scaling LLM Inference Infrastructure at Meta: From Model Runner to Production Platform

Meta

Meta's AI infrastructure team developed a comprehensive LLM serving platform to support Meta AI, smart glasses, and internal ML workflows including RLHF processing hundreds of millions of examples. The team addressed the fundamental challenges of LLM inference through a four-stage approach: building efficient model runners with continuous batching and KV caching, optimizing hardware utilization through distributed inference techniques like tensor and pipeline parallelism, implementing production-grade features including disaggregated prefill/decode services and hierarchical caching systems, and scaling to handle multiple deployments with sophisticated allocation and cost optimization. The solution demonstrates the complexity of productionizing LLMs, requiring deep integration across modeling, systems, and product teams to achieve acceptable latency and cost efficiency at scale.

Scaling LLM Inference to Serve 400M+ Monthly Search Queries

Perplexity

Perplexity AI scaled their LLM-powered search engine to handle over 435 million queries monthly by implementing a sophisticated inference architecture using NVIDIA H100 GPUs, Triton Inference Server, and TensorRT-LLM. Their solution involved serving 20+ AI models simultaneously, implementing intelligent load balancing, and using tensor parallelism across GPU pods. This resulted in significant cost savings - approximately $1 million annually compared to using third-party LLM APIs - while maintaining strict service-level agreements for latency and performance.

Scaling LLM Infrastructure: Building and Operating 24K GPU Clusters for LLaMA Training

Meta

Meta faced the challenge of scaling their AI infrastructure from training smaller recommendation models to massive LLM training jobs like LLaMA 3. They built two 24K GPU clusters (one with RoCE, another with InfiniBand) to handle the unprecedented scale of computation required for training models with thousands of GPUs running for months. Through full-stack optimizations across hardware, networking, and software layers, they achieved 95% training efficiency for the LLaMA 3 70B model, while dealing with challenges in hardware reliability, thermal management, network topology, and collective communication operations.

Scaling LLM Training and Inference with FP8 Precision

DeepL

DeepL needed to scale their Language AI capabilities while maintaining low latency for production inference and handling increasing request volumes. The company transitioned from BFloat16 (BF16) to 8-bit floating point (FP8) precision for both training and inference of their large language models, leveraging NVIDIA H100 GPUs' native FP8 support through Transformer Engine for training and TensorRT-LLM for inference. This approach accelerated model training by 50% (achieving 67% Model FLOPS utilization), enabled training of larger models with more parameters, doubled inference throughput at equivalent latency levels, and delivered translation quality improvements of 1.4x for European languages and 1.7x for complex language pairs like English-Japanese, all while maintaining comparable training quality to BF16 precision.

Scaling LLM-Based Ranking Systems with Prefill-Only Optimization

LinkedIn

LinkedIn faced significant performance challenges when deploying LLM-based ranking systems for AI Job Search and AI People Search, where models needed to score hundreds of items per query within strict latency SLAs (sub-500ms P99). The ranking workload differs fundamentally from text generation—it requires only the prefill phase to score candidates, not iterative token generation. LinkedIn optimized SGLang, an open-source LLM serving system, through four optimization stages: implementing comprehensive batching (tokenization and batch preservation), creating a scoring-only fast path that eliminates unnecessary decode loops and CPU-GPU synchronization, introducing in-batch prefix caching to reuse shared query context, and addressing Python runtime bottlenecks through multi-process architecture. These optimizations delivered 2-3x throughput improvements on H100 GPUs while maintaining P99 latency under 500ms, enabling production-scale LLM ranking for millions of members.

Scaling LLMs for Product Knowledge and Search in E-commerce

Doordash

Doordash leverages LLMs to enhance their product knowledge graph and search capabilities as they expand into new verticals beyond food delivery. They employ LLM-assisted annotations for attribute extraction, use RAG for generating training data, and implement LLM-based systems for detecting catalog inaccuracies and understanding search intent. The solution includes distributed computing frameworks, model optimization techniques, and careful consideration of latency and throughput requirements for production deployment.

Scaling Meta AI's Feed Deep Dive from Launch to Product-Market Fit

Meta

Meta launched Feed Deep Dive as an AI-powered feature on Facebook in April 2024 to address information-seeking and context enrichment needs when users encounter posts they want to learn more about. The challenge was scaling from launch to product-market fit while maintaining high-quality responses at Meta scale, dealing with LLM hallucinations and refusals, and providing more value than users would get from simply scrolling Facebook Feed. Meta's solution involved evolving from traditional orchestration to agentic models with planning, tool calling, and reflection capabilities; implementing auto-judges for online quality evaluation; using smart caching strategies focused on high-traffic posts; and leveraging ML-based user cohort targeting to show the feature to users who derived the most value. The results included achieving product-market fit through improved quality and engagement, with the team now moving toward monetization and expanded use cases.

Scaling Multimedia Search with Metadata-First Indexing and On-Demand Preview Generation

Dropbox

Dropbox Dash faced the challenge of enabling fast, accurate search across multimedia content (images, videos, audio) that typically lacks meaningful metadata and requires significantly more compute and storage resources than text documents. The team built a scalable multimedia search solution by implementing metadata-first indexing (extracting lightweight features like file paths, titles, and EXIF data), just-in-time preview generation to minimize upfront costs, location-aware query logic with reverse geocoding, and intelligent caching strategies. This infrastructure leveraged Dropbox's existing Riviera compute framework and preview services, enabling parallel processing and reducing latency while balancing cost with user value. The result is a system that makes visual content as searchable as text documents within the Dash universal search product.

Scaling Network Infrastructure to Support AI Workload Growth at Hyperscale

Meta

Meta's network engineering team faced an unprecedented challenge when AI workload demands required accelerating their backbone network scaling plans from 2028 to 2024-2025, necessitating a 10x capacity increase. They addressed this through three key techniques: pre-building scalable data center metro architectures with ring topologies, platform scaling through both vendor-dependent improvements (larger chassis, faster interfaces) and internal innovations (adding backbone planes, multiple devices per plane), and IP-optical integration using coherent transceiver technology that reduced power consumption by 80-90% while dramatically improving space efficiency. Additionally, they developed specialized AI backbone solutions for connecting geographically distributed clusters within 3-100km ranges using different fiber and optical technologies based on distance requirements.

Scaling Open-Ended Customer Service Analysis with Foundation Models

MaestroQA

MaestroQA enhanced their customer service quality assurance platform by integrating Amazon Bedrock to analyze millions of customer interactions at scale. They implemented a solution that allows customers to ask open-ended questions about their service interactions, enabling sophisticated analysis beyond traditional keyword-based approaches. The system successfully processes high volumes of transcripts across multiple regions while maintaining low latency, leading to improved compliance detection and customer sentiment analysis for their clients across various industries.

Scaling Parallel Agent Operations with LangChain and LangSmith Monitoring

Paradigm

Paradigm (YC24) built an AI-powered spreadsheet platform that runs thousands of parallel agents for data processing tasks. They utilized LangChain for rapid agent development and iteration, while leveraging LangSmith for comprehensive monitoring, operational insights, and usage-based pricing optimization. This enabled them to build task-specific agents for schema generation, sheet naming, task planning, and contact lookup while maintaining high performance and cost efficiency.

Scaling Privacy Infrastructure for GenAI Product Innovation

Meta

Meta addresses the challenge of maintaining user privacy while deploying GenAI-powered products at scale, using their AI glasses as a primary example. The company developed Privacy Aware Infrastructure (PAI), which integrates data lineage tracking, automated policy enforcement, and comprehensive observability across their entire technology stack. This infrastructure automatically tracks how user data flows through systems—from initial collection through sensor inputs, web processing, LLM inference calls, data warehousing, to model training—enabling Meta to enforce privacy controls programmatically while accelerating product development. The solution allows engineering teams to innovate rapidly with GenAI capabilities while maintaining auditable, verifiable privacy guarantees across thousands of microservices and products globally.

Scaling Product Categorization from Manual Tagging to LLM-Based Classification

GetYourGuide

GetYourGuide, a global marketplace for travel experiences, evolved their product categorization system from manual tagging to an LLM-based solution to handle 250,000 products across 600 categories. The company progressed through rule-based systems and semantic NLP models before settling on a hybrid approach using OpenAI's GPT-4-mini with structured outputs, combined with embedding-based ranking and batch processing with early stopping. This solution processes one product-category pair at a time, incorporating reasoning and confidence fields to improve decision quality. The implementation resulted in significant improvements: Matthew's Correlation Coefficient increased substantially, 50 previously excluded categories were reintroduced, 295 new categories were enabled, and A/B testing showed a 1.3% increase in conversion rate, improved quote rate, and reduced bounce rate.

Scaling Product Categorization with Batch Inference and Prompt Engineering

GoDaddy

GoDaddy sought to improve their product categorization system that was using Meta Llama 2 for generating categories for 6 million products but faced issues with incomplete/mislabeled categories and high costs. They implemented a new solution using Amazon Bedrock's batch inference capabilities with Claude and Llama 2 models, achieving 97% category coverage (exceeding their 90% target), 80% faster processing time, and 8% cost reduction while maintaining high quality categorization as verified by subject matter experts.

Scaling Recommender Systems with Vector Database Infrastructure

Farfetch

Farfetch implemented a scalable recommender system using Vespa as a vector database to serve real-time personalized recommendations across multiple online retailers. The system processes user-product interactions and features through matrix operations to generate recommendations, achieving sub-100ms latency requirements while maintaining scalability. The solution cleverly handles sparse matrices and shape mismatching challenges through optimized data storage and computation strategies.

Scaling Search Query Understanding with LLMs: From POC to Production

Yelp

Yelp implemented LLMs to enhance their search query understanding capabilities, focusing on query segmentation and review highlights. They followed a systematic approach from ideation to production, using a combination of GPT-4 for initial development, creating fine-tuned smaller models for scale, and implementing caching strategies for head queries. The solution successfully improved search relevance and user engagement, while managing costs and latency through careful architectural decisions and gradual rollout strategies.

Scaling Self-Hosted LLMs with GPU Optimization and Load Testing

Fuzzy Labs

Fuzzy Labs helped a tech company improve their developer documentation and tooling experience by implementing a self-hosted LLM system using Mistral-7B. They tackled performance challenges through systematic load testing with Locust, optimized inference latency using vLLM's paged attention, and achieved horizontal scaling with Ray Serve. The solution improved response times from 11 seconds to 3 seconds and enabled handling of concurrent users while efficiently managing GPU resources.

Scaling Trust and Safety Using LLMs at Tinder

Tinder

Tinder implemented a comprehensive LLM-based trust and safety system to combat various forms of harmful content at scale. The solution involves fine-tuning open-source LLMs using LoRA (Low-Rank Adaptation) for different types of violation detection, from spam to hate speech. Using the Lorax framework, they can efficiently serve multiple fine-tuned models on a single GPU, achieving real-time inference with high precision and recall while maintaining cost-effectiveness. The system demonstrates superior generalization capabilities against adversarial behavior compared to traditional ML approaches.

Scaling Vector Search Infrastructure for AI-Powered Workspace Search

Notion

Notion scaled their vector search infrastructure supporting Notion AI Q&A from launch in November 2023 through early 2026, achieving a 10x increase in capacity while reducing costs by 90%. The problem involved onboarding millions of workspaces to their AI-powered semantic search feature while managing rapidly growing infrastructure costs. Their solution involved migrating from dedicated pod-based vector databases to serverless architectures, switching to turbopuffer as their vector database provider, implementing intelligent page state caching to avoid redundant embeddings, and transitioning to Ray on Anyscale for both embeddings generation and serving. The results included clearing a multi-million workspace waitlist, reducing vector database costs by 60%, cutting embeddings infrastructure costs by over 90%, and improving query latency from 70-100ms to 50-70ms while supporting 15x growth in active workspaces.

Scaling Vector Search: Multi-Tier Storage and GPU Acceleration for Production Vector Databases

Zilliz

Zilliz, the company behind the open-source Milvus vector database, shares their approach to scaling vector search to handle billions of vectors. They employ a multi-tier storage architecture spanning from GPU memory to object storage, enabling flexible trade-offs between performance, cost, and data freshness. The system uses GPU acceleration for both index building and search, implements real-time search through a buffer strategy, and handles distributed consistency challenges at scale.

Scaling Voice AI with GPU-Accelerated Infrastructure

ElevenLabs

ElevenLabs developed a high-performance voice AI platform for voice cloning and multilingual speech synthesis, leveraging Google Cloud's GKE and NVIDIA GPUs for scalable deployment. They implemented GPU optimization strategies including multi-instance GPUs and time-sharing to improve utilization and reduce costs, while successfully serving 600 hours of generated audio for every hour of real time across 29 languages.

Self-Hosting DeepSeek-R1 Models on AWS: A Cost-Benefit Analysis

LiftOff

LiftOff LLC explored deploying open-source DeepSeek-R1 models (1.5B, 7B, 8B, 16B parameters) on AWS EC2 GPU instances to evaluate their viability as alternatives to paid AI services like ChatGPT. While technically successful in deployment using Docker, Ollama, and OpenWeb UI, the operational costs significantly exceeded expectations, with a single g5g.2xlarge instance costing $414/month compared to ChatGPT Plus at $20/user/month. The experiment revealed that smaller models lacked production-quality responses, while larger models faced memory limitations, performance degradation with longer contexts, and stability issues, concluding that self-hosting isn't cost-effective at startup scale.

Semantic Caching for E-commerce Search Optimization

Walmart

Walmart implemented semantic caching to enhance their e-commerce search functionality, moving beyond traditional exact-match caching to understand query intent and meaning. The system achieved unexpectedly high cache hit rates of around 50% for tail queries (compared to anticipated 10-20%), while handling the challenges of latency and cost optimization in a production environment. The solution enables more relevant product recommendations and improves the overall customer search experience.

Semantic Data Processing at Scale with AI-Powered Query Optimization

DocETL

Shreyaa Shankar presents DocETL, an open-source system for semantic data processing that addresses the challenges of running LLM-powered operators at scale over unstructured data. The system tackles two major problems: how to make semantic operator pipelines scalable and cost-effective through novel query optimization techniques, and how to make them steerable through specialized user interfaces. DocETL introduces rewrite directives that decompose complex tasks and data to improve accuracy and reduce costs, achieving up to 86% cost reduction while maintaining target accuracy. The companion tool Doc Wrangler provides an interactive interface for iteratively authoring and debugging these pipelines. Real-world applications include public defenders analyzing court transcripts for racial bias and medical analysts extracting information from doctor-patient conversations, demonstrating significant accuracy improvements (2x in some cases) compared to baseline approaches.

Semantic Relevance Evaluation and Enhancement Framework for E-commerce Search

Etsy

Etsy's Search Relevance team developed a comprehensive Semantic Relevance Evaluation and Enhancement Framework to address the limitations of engagement-based search models that favored popular listings over semantically relevant ones. The solution employs a three-tier cascaded distillation approach: starting with human-curated "golden" labels, scaling with an LLM annotator (o3 model) to generate training data, fine-tuning a teacher model (Qwen 3 VL 4B) for efficient large-scale evaluation, and distilling to a lightweight BERT-based student model for real-time production inference. The framework integrates semantic relevance signals into search through filtering, feature enrichment, loss weighting, and relevance boosting. Between August and October 2025, the percentage of fully relevant listings increased from 58% to 62%, demonstrating measurable improvements in aligning search results with buyer intent while addressing the cold-start problem for smaller sellers.

Smart Ticket Routing and Support Agent Copilot using LLMs

Adyen

Adyen, a global financial technology platform, implemented LLM-powered solutions to improve their support team's efficiency. They developed a smart ticket routing system and a support agent copilot using LangChain, deployed in a Kubernetes environment. The solution resulted in more accurate ticket routing and faster response times through automated document retrieval and answer suggestions, while maintaining flexibility to switch between different LLM models.

Specialized Language Models for Contact Center Transformation

Accenture

Accenture partnered with Databricks to transform a client's customer contact center by implementing specialized language models (SLMs) that go beyond simple prompt engineering. The client faced challenges with high call volumes, impersonal service, and missed revenue opportunities. Using Databricks' MLOps platform and GPU infrastructure, they developed and deployed fine-tuned language models that understand industry-specific context, cultural nuances, and brand styles, resulting in improved customer experience and operational efficiency. The solution includes real-time monitoring and multimodal capabilities, setting a new standard for AI-driven customer service operations.

Specialized Text Editing LLM Development through Instruction Tuning

Grammarly

Grammarly developed CoEdIT, a specialized text editing LLM that outperforms larger models while being up to 60 times smaller. Through targeted instruction tuning on a carefully curated dataset of text editing tasks, they created models ranging from 770M to 11B parameters that achieved state-of-the-art performance on multiple editing benchmarks, outperforming models like GPT-3-Edit (175B parameters) and ChatGPT in both automated and human evaluations.

State of Production Machine Learning and LLMOps in 2024

Zalando

A comprehensive overview of the current state and challenges of production machine learning and LLMOps, covering key areas including motivations, industry trends, technological developments, and organizational changes. The presentation highlights the evolution from model-centric to data-centric approaches, the importance of metadata management, and the growing focus on security and monitoring in ML systems.

Strategic Framework for Generative AI Implementation in Food Delivery Platform

Doordash

DoorDash outlines a comprehensive strategy for implementing Generative AI across five key areas: customer assistance, interactive discovery, personalized content generation, information extraction, and employee productivity enhancement. The company aims to revolutionize its delivery platform while maintaining strong considerations for data privacy and security, focusing on practical applications ranging from automated cart building to SQL query generation.

Streamlining Background Check Classification with Fine-tuned Small Language Models

Checkr

Checkr tackled the challenge of classifying complex background check records by implementing a fine-tuned small language model (SLM) solution. They moved from using GPT-4 to fine-tuning Llama-2 models on Predibase, achieving 90% accuracy for their most challenging cases while reducing costs by 5x and improving response times to 0.15 seconds. This solution helped automate their background check adjudication process, particularly for the 2% of complex cases that required classification into 230 distinct categories.

Streamlining Custom LLM Deployment with Serverless Infrastructure

Salesforce

Salesforce's AI platform team faced operational challenges deploying customized large language models (fine-tuned versions of Llama, Qwen, and Mistral) for their Agentforce agentic AI applications. The deployment process was time-consuming, requiring months of optimization for instance families, serving engines, and configurations, while also proving expensive due to GPU capacity reservations for peak usage. By adopting Amazon Bedrock Custom Model Import, Salesforce integrated a unified API for model deployment that minimized infrastructure management while maintaining backward compatibility with existing endpoints. The results included a 30% reduction in deployment time, up to 40% cost savings through pay-per-use pricing, and maintained scalability without sacrificing performance.

Streamlining Legislative Analysis Model Deployment with MLOps

FiscalNote

FiscalNote, facing challenges in deploying and updating their legislative analysis ML models efficiently, transformed their MLOps pipeline using Databricks' MLflow and Model Serving. This shift enabled them to reduce deployment time and increase model deployment frequency by 3x, while improving their ability to provide timely legislative insights to clients through better model management and deployment practices.

Supervised Fine-Tuning for AI-Powered Travel Recommendations

Booking.com

Booking.com built an AI Trip Planner to handle unstructured, natural language queries from travelers seeking personalized recommendations. The challenge was combining LLMs' ability to understand conversational requests with years of structured behavioral data (searches, clicks, bookings). Instead of relying solely on prompt engineering with external APIs, they used supervised fine-tuning on open-source LLMs with parameter-efficient methods. This approach delivered superior recommendation metrics while achieving 3x faster inference compared to prompt-based solutions, while maintaining data privacy and security by keeping all processing internal.

Text-to-SQL AI Agent for Democratizing Data Access in Slack

Salesforce

Salesforce built Horizon Agent, an internal text-to-SQL Slack agent, to address a data access gap where engineers and data scientists spent dozens of hours weekly writing custom SQL queries for non-technical users. The solution combines Large Language Models with Retrieval-Augmented Generation (RAG) to allow users to ask natural language questions in Slack and receive SQL queries, answers, and explanations within seconds. After launching in Early Access in August 2024 and reaching General Availability in January 2025, the system freed technologists from routine query work and enabled non-technical users to self-serve data insights in minutes instead of waiting hours or days, transforming the role of technical staff from data gatekeepers to guides.

The Hidden Complexities of Building Production LLM Features: Lessons from Honeycomb's Query Assistant

Honeycomb

Honeycomb shares candid insights from building Query Assistant, their natural language to query interface, revealing the complex reality behind LLM-powered product development. Key challenges included managing context window limitations with large schemas, dealing with LLM latency (2-15+ seconds per query), navigating prompt engineering without established best practices, balancing correctness with usefulness, addressing prompt injection vulnerabilities, and handling legal/compliance requirements. The article emphasizes that successful LLM implementation requires treating models as feature engines rather than standalone products, and argues that early access programs often fail to reveal real-world implementation challenges.

Tool Masking for Enterprise Agentic AI Systems at Scale

Databook

Databook, which automates sales processes for large tech companies like Microsoft, Salesforce, and AWS, faced challenges running reliable agentic AI workflows at enterprise scale. The primary problem was that connecting services through Model Context Protocol (MCP) exposed entire APIs to LLMs, polluting execution with irrelevant data, increasing tokens and costs, and reducing reliability through "choice entropy." Their solution involved implementing "tool masks"—a configuration layer between agents and tool handlers that filters and reshapes input/output schemas, customizes tool interfaces per agent context, and enables prompt engineering of tools themselves. This approach resulted in cleaner, faster, more reliable agents with reduced costs, better self-correction capabilities, and the ability to rapidly adapt to customer requirements without code deployments.

Training a 70B Japanese Large Language Model with Amazon SageMaker HyperPod

Institute of Science Tokyo

The Institute of Science Tokyo successfully developed Llama 3.3 Swallow, a 70-billion-parameter large language model with enhanced Japanese capabilities, using Amazon SageMaker HyperPod infrastructure. The project involved continual pre-training from Meta's Llama 3.3 70B model using 314 billion tokens of primarily Japanese training data over 16 days across 256 H100 GPUs. The resulting model demonstrates superior performance compared to GPT-4o-mini and other leading models on Japanese language benchmarks, showcasing effective distributed training techniques including 4D parallelism, asynchronous checkpointing, and comprehensive monitoring systems that enabled efficient large-scale model training in production.

Training and Deploying AI Coding Agents at Scale with GPT-5 Codex

OpenAI

OpenAI's Bill and Brian discuss their work on GPT-5 Codex and Codex Max, AI coding agents designed for production use. The team focused on training models with specific "personalities" optimized for pair programming, including traits like communication, planning, and self-checking behaviors. They trained separate model lines: Codex models optimized specifically for their agent harness with strong opinions about tool use (particularly terminal tools), and mainline GPT-5 models that are more general and steerable across different tooling environments. The result is a coding agent that OpenAI employees trust for production work, with approximately 50% of OpenAI staff using it daily, and some engineers like Brian claiming they haven't written code by hand in months. The team emphasizes the shift toward shipping complete agents rather than just models, with abstractions moving upward to enable developers to build on top of pre-configured agentic systems.

Training and Deploying Compliant Multilingual Foundation Models

Dynamo

Dynamo, an AI company focused on secure and compliant AI solutions, developed an 8-billion parameter multilingual LLM using Databricks Mosaic AI Training platform. They successfully trained the model in just 10 days, achieving a 20% speedup in training compared to competitors. The model was designed to support enterprise-grade AI systems with built-in security guardrails, compliance checks, and multilingual capabilities for various industry applications.

Training and Deploying GPT-4.5: Scaling Challenges and System Design at the Frontier

OpenAI

OpenAI's development and training of GPT-4.5 represents a significant milestone in large-scale LLM deployment, featuring a two-year development cycle and unprecedented infrastructure scaling challenges. The team aimed to create a model 10x smarter than GPT-4, requiring intensive collaboration between ML and systems teams, sophisticated planning, and novel solutions to handle training across massive GPU clusters. The project succeeded in achieving its goals while revealing important insights about data efficiency, system design, and the relationship between model scale and intelligence.

Training and Deploying MPT: Lessons Learned in Large Scale LLM Development

MosaicML

MosaicML developed and open-sourced MPT, a family of large language models including 7B and 30B parameter versions, demonstrating that high-quality LLMs could be trained for significantly lower costs than commonly believed (under $250,000 for 7B model). They built a complete training platform handling data processing, distributed training, and model deployment at scale, while documenting key lessons around planning, experimentation, data quality, and operational best practices for production LLM development.

Transforming a Voice Assistant from Scripted Commands to Generative AI Conversation at Scale

AWS (Alexa)

AWS (Alexa) faced the challenge of evolving their voice assistant from scripted, command-based interactions to natural, generative AI-powered conversations while serving over 600 million devices and maintaining complete backward compatibility with existing integrations. The team completely rearchitected Alexa using large language models (LLMs) to create Alexa Plus, which supports conversational interactions, complex multi-step planning, and real-world action execution. Through extensive experimentation with prompt engineering, multi-model architectures, speculative execution, prompt caching, API refactoring, and fine-tuning, they achieved the necessary balance between accuracy, latency (sub-2-second responses), determinism, and model flexibility required for a production voice assistant serving hundreds of millions of users daily.

Transforming Agent and Customer Experience with Generative AI in Health Insurance

nib

nib, an Australian health insurance provider covering approximately 2 million people, transformed both customer and agent experiences using AWS generative AI capabilities. The company faced challenges around contact center efficiency, agent onboarding time, and customer service scalability. Their solution involved deploying a conversational AI chatbot called "Nibby" built on Amazon Lex, implementing call summarization using large language models to reduce after-call work, creating an internal knowledge-based GPT application for agents, and developing intelligent document processing for claims. These initiatives resulted in approximately 60% chat deflection, $22 million in savings from Nibby alone, and a reported 50% reduction in after-call work time through automated call summaries, while significantly improving agent onboarding and overall customer experience.

Two-Stage Fine-Tuning of Language Models for Hyperlocal Food Search

Swiggy

Swiggy, a major food delivery platform in India, implemented a novel two-stage fine-tuning approach for language models to improve search relevance in their hyperlocal food delivery service. They first performed unsupervised fine-tuning using historical search queries and order data, followed by supervised fine-tuning with manually curated query-item pairs. The solution leverages TSDAE and Multiple Negatives Ranking Loss approaches, achieving superior search relevance metrics compared to baseline models while meeting strict latency requirements of 100ms.

UI/UX Design Considerations for Production GenAI Chatbots

Elastic

Elastic's Field Engineering team developed a customer support chatbot, focusing on crucial UI/UX design considerations for production deployment. The case study details how they tackled challenges including streaming response handling, timeout management, context awareness, and user engagement through carefully designed animations. The team created a custom chat interface using their EUI component library, implementing innovative solutions for handling long-running LLM requests and managing multiple types of contextual information in a user-friendly way.

Unified Healthcare Data Platform with LLMOps Integration

Doctolib

Doctolib is transforming their healthcare data platform from a reporting-focused system to an AI-enabled unified platform. The company is implementing a comprehensive LLMOps infrastructure as part of their new architecture, including features for model training, inference, and GenAI assistance for data exploration. The platform aims to support both traditional analytics and advanced AI capabilities while ensuring security, governance, and scalability for healthcare data.

User Journey Identification Using LLMs for Personalized Recommendations

Pinterest

Pinterest sought to evolve from a simple content recommendation platform to an inspiration-to-realization platform by understanding users' underlying, long-term goals through identifying "user journeys" - sequences of interactions centered on particular interests and intents. To address the challenge of limited training data, Pinterest built a hybrid system that dynamically extracts keywords from user activities, performs hierarchical clustering to identify journey candidates, and then applies specialized models for journey ranking, stage prediction, naming, and expansion. The team leveraged pretrained foundation models and increasingly incorporated LLMs for tasks like journey naming, expansion, and relevance evaluation. Initial experiments with journey-aware notifications demonstrated substantial improvements, including an 88% higher email click rate and 32% higher push open rate compared to interest-based notifications, along with a 23% increase in positive user feedback.

Using Evaluation Systems and Inference-Time Scaling for Beautiful, Scannable QR Code Generation

Modal

Modal's engineering team tackled the challenge of generating aesthetically pleasing QR codes that consistently scan by implementing comprehensive evaluation systems and inference-time compute scaling. The team developed automated evaluation pipelines that measured both scan rate and aesthetic quality, using human judgment alignment to validate their metrics. They applied inference-time compute scaling by generating multiple QR codes in parallel and selecting the best candidates, achieving a 95% scan rate service-level objective while maintaining aesthetic quality and returning results in under 20 seconds.

Using LLMs to Combat Health Insurance Claim Denials

Fight Health Insurance

Fight Health Insurance is an open-source project that uses fine-tuned large language models to help people appeal denied health insurance claims in the United States. The system processes denial letters, extracts relevant information, and generates appeal letters based on training data from independent medical review boards. The project addresses the widespread problem of insurance claim denials by automating the complex and time-consuming process of crafting effective appeals, making it accessible to individuals who lack the resources or knowledge to navigate the appeals process themselves. The tool is available both as an open-source Python package and as a free hosted service, though the sustainability model is still being developed.

Using LLMs to Enhance Search Discovery and Recommendations

Instacart

Instacart integrated LLMs into their search stack to enhance product discovery and user engagement. They developed two content generation techniques: a basic approach using LLM prompting and an advanced approach incorporating domain-specific knowledge from query understanding models and historical data. The system generates complementary and substitute product recommendations, with content generated offline and served through a sophisticated pipeline. The implementation resulted in significant improvements in user engagement and revenue, while addressing challenges in content quality, ranking, and evaluation.

Variable Aggression Code Autocomplete with Fine-Tuned LLMs

Windsurf

Windsurf developed Tab v2, an AI-powered code autocomplete system that addresses the challenge of balancing prediction frequency, accuracy, and code length in developer tooling. The team reimagined their LLM-based autocomplete by focusing on total keystrokes saved rather than just acceptance rate, implementing extensive context engineering to reduce prompt length by 76%, and using reinforcement learning to train models with different "aggression" levels. The result was a 54% average increase in characters per prediction and 25-75% more accepted code, with user-selectable aggression parameters allowing developers to customize behavior based on personal preferences.

Video Super-Resolution at Scale for Ads and Generative AI Content

Meta

Meta's Media Foundation team deployed AI-powered video super-resolution (VSR) models at massive scale to enhance video quality across their ecosystem, processing over 1 billion daily video uploads. The problem addressed was the prevalence of low-quality videos from poor camera quality, cross-platform uploads, and legacy content that degraded user experience. The solution involved deploying multiple VSR models—both CPU-based (using Intel's RVSR SDK) and GPU-based—to upscale and enhance video quality for ads and generative AI features like Meta Restyle. Through extensive subjective evaluation with thousands of human raters, Meta identified effective quality metrics (VMAF-UQ), determined which videos would benefit most from VSR, and successfully deployed the technology while managing GPU resource constraints and ensuring quality improvements aligned with user preferences.

Vision Language Models for Large-Scale Product Classification and Understanding

Shopify

Shopify evolved their product classification system from basic categorization to an advanced AI-driven framework using Vision Language Models (VLMs) integrated with a comprehensive product taxonomy. The system processes over 30 million predictions daily, combining VLMs with structured taxonomy to provide accurate product categorization, attribute extraction, and metadata generation. This has resulted in an 85% merchant acceptance rate of predicted categories and doubled the hierarchical precision and recall compared to previous approaches.

Voice AI Agent Development and Production Challenges

Various (Canonical, Prosus, DeepMind)

Panel discussion with experts from various companies exploring the challenges and solutions in deploying voice AI agents in production. The discussion covers key aspects of voice AI development including real-time response handling, emotional intelligence, cultural adaptation, and user retention. Experts shared experiences from e-commerce, healthcare, and tech sectors, highlighting the importance of proper testing, prompt engineering, and understanding user interaction patterns for successful voice AI deployments.