ZenML

LLMOps Tag: error_handling

764 tools with this tag

← Back to LLMOps Database

Common industries

View all industries →

Abstractive Conversation Summarization for Google Chat Spaces

Google

Google deployed an abstractive summarization system to automatically generate conversation summaries in Google Chat Spaces to address information overload from unread messages, particularly in hybrid work environments. The solution leveraged the Pegasus transformer model fine-tuned on a custom ForumSum dataset of forum conversations, then distilled into a hybrid transformer-encoder/RNN-decoder architecture for lower latency. The system surfaces summaries through cards when users enter Spaces with unread messages, with quality controls including heuristics for triggering, detection of low-quality summaries, and ephemeral caching of pre-generated summaries to reduce latency, ultimately delivering production value to premium Google Workspace business customers.

Accelerating Drug Development with AI-Powered Clinical Trial Transformation

Novartis

Novartis partnered with AWS Professional Services and Accenture to modernize their drug development infrastructure and integrate AI across clinical trials with the ambitious goal of reducing trial development cycles by at least six months. The initiative involved building a next-generation GXP-compliant data platform on AWS that consolidates fragmented data from multiple domains, implements data mesh architecture with self-service capabilities, and enables AI use cases including protocol generation and an intelligent decision system (digital twin). Early results from the patient safety domain showed 72% query speed improvements, 60% storage cost reduction, and 160+ hours of manual work eliminated. The protocol generation use case achieved 83-87% acceleration in producing compliant protocols, demonstrating significant progress toward their goal of bringing life-saving medicines to patients faster.

Advanced Agent Monitoring and Debugging with LangSmith Integration

Replit

Replit integrated LangSmith with their complex agent workflows built on LangGraph to solve critical LLM observability challenges. The implementation addressed three key areas: handling large-scale traces from complex agent interactions, enabling within-trace search capabilities for efficient debugging, and introducing thread view functionality for monitoring human-in-the-loop workflows. These improvements significantly enhanced their ability to debug and optimize their AI agent system while enabling better human-AI collaboration.

Advanced Fine-Tuning Techniques for Multi-Agent Orchestration at Scale

Amazon

Amazon teams faced challenges in deploying high-stakes LLM applications across healthcare, engineering, and e-commerce domains where basic prompt engineering and RAG approaches proved insufficient. Through systematic application of advanced fine-tuning techniques including Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and cutting-edge reasoning optimizations like Group-based Reinforcement Learning from Policy Optimization (GRPO) and Direct Advantage Policy Optimization (DAPO), three Amazon business units achieved production-grade results: Amazon Pharmacy reduced dangerous medication errors by 33%, Amazon Global Engineering Services achieved 80% human effort reduction in inspection reviews, and Amazon A+ Content improved quality assessment accuracy from 77% to 96%. These outcomes demonstrate that approximately one in four high-stakes enterprise applications require advanced fine-tuning beyond standard techniques to achieve necessary performance levels in production environments.

Advanced Prompt Engineering Techniques for Production LLM Applications

Instacart

Instacart shares their experience implementing various prompt engineering techniques to improve LLM performance in production applications. The article details both traditional and novel approaches including Chain of Thought, ReAct, Room for Thought, Monte Carlo brainstorming, Self Correction, Classifying with logit bias, and Puppetry. These techniques were developed and tested while building internal productivity tools like Ava and Ask Instacart, demonstrating practical ways to enhance LLM reliability and output quality in production environments.

Advancing Patient Experience and Business Operations Analytics with Generative AI in Healthcare

Huron

Huron Consulting Group implemented generative AI solutions to transform healthcare analytics across patient experience and business operations. The consulting firm faced challenges with analyzing unstructured data from patient rounding sessions and revenue cycle management notes, which previously required manual review and resulted in delayed interventions due to the 3-4 month lag in traditional HCAHPS survey feedback. Using AWS services including Amazon Bedrock with the Nova LLM model, Redshift, and S3, Huron built sentiment analysis capabilities that automatically process survey responses, staff interactions, and financial operation notes. The solution achieved 90% accuracy in sentiment classification (up from 75% initially) and now processes over 10,000 notes per week automatically, enabling real-time identification of patient dissatisfaction, revenue opportunities, and staff coaching needs that directly impact hospital funding and operational efficiency.

Adversarial Grammatical Error Correction at Scale for Writing Assistance

Grammarly

Grammarly, a leading AI-powered writing assistant, tackled the challenge of improving grammatical error correction (GEC) by moving beyond traditional neural machine translation approaches that optimize n-gram metrics but sometimes produce semantically inconsistent corrections. The team developed a novel generative adversarial network (GAN) framework where a sequence-to-sequence generator produces grammatical corrections, and a sentence-pair discriminator evaluates whether the generated correction is the most appropriate rewrite for the given input sentence. Through adversarial training with policy gradients, the discriminator provides task-specific rewards to the generator, enabling better distributional alignment between generated and human corrections. Experiments showed that adversarially trained models (both RNN-based and transformer-based) consistently outperformed their standard counterparts on GEC benchmarks, striking a better balance between grammatical correctness, semantic preservation, and natural phrasing while serving millions of users in production.

Agent Testing and Evaluation Using Autonomous Vehicle Simulation Principles

Coval

Coval addresses the challenge of testing and evaluating autonomous AI agents by applying lessons learned from self-driving car testing. The company proposes moving away from static, manual testing towards probabilistic evaluation with dynamic scenarios, drawing parallels between autonomous vehicles and AI agents in terms of system architecture, error handling, and reliability requirements. Their solution enables systematic testing of agents through simulation at different layers, measuring performance against human benchmarks, and implementing robust fallback mechanisms.

Agent-Based Workflow Automation in Spreadsheets for Non-Technical Users

Otto

Otto, founded by Suli Omar, addresses the challenge of making AI agents accessible to non-technical users by embedding agent workflows directly into spreadsheet interfaces. The company transforms unstructured data processing tasks into spreadsheet-based workflows where each cell acts as an autonomous agent capable of executing tasks, waiting for dependencies, and outputting structured results. By leveraging the familiar spreadsheet UX instead of traditional chatbot interfaces, Otto enables finance teams, accountants, and other business users to harness agent capabilities without requiring technical expertise. The solution involves sophisticated model selection across three tiers (workhorse, middle-tier, and heavy reasoning models) to optimize cost and performance, continuous evaluation through customer usage patterns, and iterative model testing to maintain service quality as new LLM capabilities emerge.

Agentic AI Architecture for Investment Management Platform

Blackrock

BlackRock implemented Aladdin Copilot, an AI-powered assistant embedded across their proprietary investment management platform that serves over 11 trillion in assets under management. The system uses a supervised agentic architecture built on LangChain and LangGraph, with GPT-4 function calling for orchestration, to help users navigate complex financial workflows and democratize access to investment insights. The solution addresses the challenge of making hundreds of domain-specific APIs accessible through natural language queries while maintaining strict guardrails for responsible AI use in financial services, resulting in increased productivity and more intuitive user experiences across their global client base.

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

Agentic AI for Automated Absence Reporting and Shift Management at Airport Operations

Manchester Airports Group

Manchester Airports Group (MAG) implemented an agentic AI solution to automate unplanned absence reporting and shift management across their three UK airports handling over 1,000 flights daily. The problem involved complex, non-deterministic workflows requiring coordination across multiple systems, with different processes at each airport and high operational costs from overtime payments when staff couldn't make shifts. MAG built a multi-agent system using Amazon Bedrock Agent Core with both text-to-text and speech-to-speech interfaces, allowing employees to report absences conversationally while the system automatically authenticated users, classified absence types, updated HR and rostering systems, and notified relevant managers. The solution achieved 99% consistency in absence reporting (standardizing previously variable processes) and reduced recording time by 90%, with measurable cost reductions in overtime payments and third-party service fees.

Agentic AI for Cloud Migration and Application Modernization at Scale

Commonwealth Bank of Australia

Commonwealth Bank of Australia (CBA) partnered with AWS ProServe to modernize legacy Windows 2012 applications and migrate them to cloud at scale. Facing challenges with time-consuming manual processes, missing documentation, and significant technical debt, CBA developed "Lumos," an internal multi-agent AI platform that orchestrates the entire modernization lifecycleโ€”from application analysis and design through code transformation, testing, deployment, and operations. By integrating AI agents with deterministic engines and AWS services (Bedrock, ECS, OpenSearch, etc.), CBA increased their modernization velocity from 10 applications per year to 20-30 applications per quarter, while maintaining security, compliance, and quality standards through human-in-the-loop validation and multi-agent review processes.

Agentic AI for Legal Research: Building Deep Research in Westlaw and CoCounsel

Thomson Reuters

Thomson Reuters Labs developed Deep Research, an agentic AI system integrated into Westlaw Advantage and CoCounsel that conducts legal research with the sophistication of a practicing attorney. The system addresses the limitation of traditional RAG-based tools by autonomously planning multi-step research strategies, executing searches in parallel, selecting appropriate tools, adapting based on findings, and applying stopping criteria. Deep Research leverages specialized document-type agents, maintains memory across sessions, integrates Westlaw features as modular building blocks, and employs rigorous evaluation frameworks. The system reportedly takes about 10 minutes for comprehensive analyses and includes verification tools with inline citations, KeyCite flags, and highlighted excerpts to enable lawyers to quickly validate AI-generated insights.

Agentic AI Framework for Mainframe Modernization at Scale

Western Union / Unum

Western Union and Unum partnered with AWS and Accenture/Pega to modernize their mainframe-based legacy systems using AWS Transform, an agentic AI service designed for large-scale migration and modernization. Western Union aimed to modernize its 35-year-old money order platform to support growth targets and improve back-office operations, while Unum sought to streamline Colonial Life claims processing. The solution leveraged composable agentic AI frameworks where multiple specialized agents (AWS Transform agents, Accenture industry knowledge agents, and Pega Blueprint agents) worked together through orchestration layers. Results included converting 2.5 million lines of COBOL code in approximately 1.5 hours, reducing project timelines from 3+ months to 6 weeks for Western Union, and achieving a complete COBOL-to-cloud migration with testable applications in 3 months for Unum (compared to previous 7-year, $25 million estimates), while eliminating 7,000 annual manual hours in claims management.

Agentic AI Platform for Clinical Development and Commercial Operations in Pharmaceutical Drug Development

AstraZeneca

AstraZeneca partnered with AWS to deploy agentic AI systems across their clinical development and commercial operations to accelerate their goal of delivering 20 new medicines by 2030. The company built two major production systems: a Development Assistant serving over 1,000 users across 21 countries that integrates 16 data products with 9 agents to enable natural language queries across clinical trials, regulatory submissions, patient safety, and quality domains; and an AZ Brain commercial platform that uses 500+ AI models and agents to provide precision insights for patient identification, HCP engagement, and content generation. The implementation reduced time-to-market for various workflows from months to weeks, with field teams using the commercial assistant generating 2x more prescriptions, and reimbursement dossier authoring timelines dramatically shortened through automated agent workflows.

Agentic AI System for Construction Industry Tender Management and Quote Generation

Tendos AI

Tendos AI built an agentic AI platform to automate the tendering and quoting process for manufacturers in the construction industry. The system addresses the massive inefficiency in back-office workflows where manufacturers receive customer requests via email with attachments, manually extract information, match products, and generate quotes. Their multi-agent LLM system automatically categorizes incoming requests, extracts entities from documents up to thousands of pages, matches products from complex catalogs using semantic understanding, and generates detailed quotes for human review. Starting with a narrow focus on radiators with a single design partner, they iteratively expanded to support full workflows across multiple product categories, employing sophisticated agentic architectures with planning patterns, review agents, and extensive evaluation frameworks at each pipeline step.

Agentic AI Systems for Legal, Tax, and Compliance Workflows

Thomson Reuters

Thomson Reuters evolved their AI assistant strategy from helpfulness-focused tools to productive agentic systems that make judgments and produce output in high-stakes legal, tax, and compliance environments. They developed a framework treating agency as adjustable dials (autonomy, context, memory, coordination) rather than binary states, enabling them to decompose legacy applications into tools that AI agents can leverage. Their solutions include end-to-end tax return generation from source documents and comprehensive legal research systems that utilize their 1.5+ terabytes of proprietary content, with rigorous evaluation processes to handle the inherent variability in expert human judgment.

Agentic Search for Multi-Source Legal Research Intelligence

Harvey

Harvey, a legal AI platform, faced the challenge of enabling complex, multi-source legal research that mirrors how lawyers actually workโ€”iteratively searching across case law, statutes, internal documents, and other sources. Traditional one-shot retrieval systems couldn't handle queries requiring reasoning about what information to gather, where to find it, and when sufficient context was obtained. Harvey implemented an agentic search system based on the ReAct paradigm that dynamically selects knowledge sources, performs iterative retrieval, evaluates completeness, and synthesizes citation-backed responses. Through a privacy-preserving evaluation process involving legal experts creating synthetic queries and systematic offline testing, they improved tool selection precision from near zero to 0.8-0.9 and enabled complex queries to scale from single tool calls to 3-10 retrieval operations as needed, raising baseline query quality across their Assistant product and powering their Deep Research feature.

Agentic Security Principles for AI-Powered Development Tools

Github

GitHub outlines the security principles and threat model they developed for their hosted agentic AI products, particularly GitHub Copilot coding agent. The company addresses three primary security concerns: data exfiltration through internet-connected agents, impersonation and action attribution, and prompt injection attacks. Their solution involves implementing six core security rules: ensuring all context is visible to users, firewalling agent network access, limiting access to sensitive information, preventing irreversible state changes without human approval, consistently attributing actions to both initiator and agent, and only gathering context from authorized users. These principles aim to balance the enhanced functionality of agentic AI with the increased security risks that come with more autonomous systems.

Agentic Workflow Automation for Financial Operations

Ramp

Ramp, a finance automation platform serving over 50,000 customers, built a comprehensive suite of AI agents to automate manual financial workflows including expense policy enforcement, accounting classification, and invoice processing. The company evolved from building hundreds of isolated agents to consolidating around a single agent framework with thousands of skills, unified through a conversational interface called Omnichat. Their Policy Agent product, which uses LLMs to interpret and enforce expense policies written in natural language, demonstrates significant production deployment challenges and solutions including iterative development starting with simple use cases, extensive evaluation frameworks, human-in-the-loop labeling sessions, and careful context engineering. Additionally, Ramp built an internal coding agent called Ramp Inspect that now accounts for over 50% of production PRs merged weekly, illustrating how AI infrastructure investments enable broader organizational productivity gains.

AI Agent Automation of Security Operations Center Analysis

Doppel

Doppel implemented an AI agent using OpenAI's o1 model to automate the analysis of potential security threats in their Security Operations Center (SOC). The system processes over 10 million websites, social media accounts, and mobile apps daily to identify phishing attacks. Through a combination of initial expert knowledge transfer and training on historical decisions, the AI agent achieved human-level performance, reducing SOC workloads by 30% within 30 days while maintaining lower false-positive rates than human analysts.

AI Agent Development and Evaluation Platform for Insurance Underwriting

Snorkel

Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.

AI Agent for Automated Feature Flag Removal

Duolingo

Duolingo developed an AI agent to automate the removal of feature flags from their codebase, addressing the common engineering problem of technical debt accumulation from abandoned flags. The solution leverages Anthropic's Codex CLI running on Temporal workflow orchestration, allowing engineers to initiate automated code cleanup through an internal self-service UI. The agent clones repositories, uses AI to identify and remove obsolete feature flags across Python and Kotlin codebases, and automatically creates pull requests assigned to the requesting engineer. The tool was developed rapidlyโ€”moving from prototype to production in approximately one weekโ€”and serves as a foundation pattern for future autonomous coding agents at Duolingo.

AI Agent for Automated Root Cause Analysis in Production Systems

Cleric

Cleric developed an AI agent system to automatically diagnose and root cause production alerts by analyzing observability data, logs, and system metrics. The agent operates asynchronously, investigating alerts when they fire in systems like PagerDuty or Slack, planning and executing diagnostic tasks through API calls, and reasoning about findings to distill information into actionable root causes. The system faces significant challenges around ground truth validation, user feedback loops, and the need to minimize human intervention while maintaining high accuracy across diverse infrastructure environments.

AI Agent for Customer Service Order Management and Training

RHI Magnesita

RHI Magnesita, facing $3 million in annual losses due to human errors in order processing, implemented an AI agent to assist their Customer Service Representatives (CSRs). The solution, developed with IT-Tomatic, focuses on error reduction, standardization of processes, and enhanced training. The AI system serves as an operating system for CSRs, consolidating information from multiple sources and providing intelligent validation of orders. Early results show improved training efficiency, standardized processes, and the transformation of entry-level CSR positions into hybrid analyst roles.

AI Agent for Self-Service Business Intelligence with Text-to-SQL

BGL

BGL, a provider of self-managed superannuation fund administration solutions serving over 12,700 businesses, faced challenges with data analysis where business users relied on data teams for queries, creating bottlenecks, and traditional text-to-SQL solutions produced inconsistent results. BGL built a production-ready AI agent using Claude Agent SDK hosted on Amazon Bedrock AgentCore that allows business users to retrieve analytics insights through natural language queries. The solution combines a strong data foundation using Amazon Athena and dbt for data transformation with an AI agent that interprets natural language, generates SQL queries, and processes results using code execution. The implementation uses modular knowledge architecture with CLAUDE.md for project context and SKILL.md files for product-specific domain expertise, while AgentCore provides stateful execution sessions with security isolation. This democratized data access for over 200 employees, enabling product managers, compliance teams, and customer success managers to self-serve analytics without SQL knowledge or data team dependencies.

AI Agent System for Automated Security Investigation and Alert Triage

Slack

Slack's Security Engineering team developed an AI agent system to automate the investigation of security alerts from their event ingestion pipeline that handles billions of events daily. The solution evolved from a single-prompt prototype to a multi-agent architecture with specialized personas (Director, domain Experts, and a Critic) that work together through structured output tasks to investigate security incidents. The system uses a "knowledge pyramid" approach where information flows upward from token-intensive data gathering to high-level decision making, allowing strategic use of different model tiers. Results include transformed on-call workflows from manual evidence gathering to supervision of agent teams, interactive verifiable reports, and emergent discovery capabilities where agents spontaneously identified security issues beyond the original alert scope, such as discovering credential exposures during unrelated investigations.

AI Agent-Powered Compliance Review Automation for Financial Services

Stripe

Stripe developed an AI agent-based solution to address the growing complexity and resource intensity of compliance reviews in financial services, where enterprises spend over $206 billion annually on financial crime operations. The company implemented ReAct agents powered by Amazon Bedrock to automate the investigative and research portions of Enhanced Due Diligence (EDD) reviews while keeping human analysts in the decision-making loop. By decomposing complex compliance workflows into bite-sized tasks orchestrated through a directed acyclic graph (DAG), the agents perform autonomous investigations across multiple data sources and jurisdictions. The solution achieved a 96% helpfulness rating from reviewers and reduced average handling time by 26%, enabling compliance teams to scale without linearly increasing headcount while maintaining complete auditability for regulatory requirements.

AI Agents and Intelligent Observability for DevOps Modernization

HRS Group / Netflix / Harness

This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.

AI Agents for Automated Product Quality Testing and Bug Detection

Coinbase

Coinbase developed an AI-powered QA agent (qa-ai-agent) to dramatically scale their product testing efforts and improve quality assurance. The system addresses the challenge of maintaining high product quality standards while reducing manual testing overhead and costs. The AI agent processes natural language testing requests, uses visual and textual data to execute tests, and leverages LLM reasoning to identify issues. Results showed the agent detected 300% more bugs than human testers in the same timeframe, achieved 75% accuracy (compared to 80% for human testers), enabled new test creation in 15 minutes versus hours, and reduced costs by 86% compared to traditional manual testing, with the goal of replacing 75% of manual testing with AI-driven automation.

AI Agents for Data Labeling and Infrastructure Maintenance at Scale

Plaid

Plaid, a financial data connectivity platform, developed two internal AI agents to address operational challenges at scale. The AI Annotator agent automates the labeling of financial transaction data for machine learning model training, achieving over 95% human alignment while dramatically reducing annotation costs and time. The Fix My Connection agent proactively detects and repairs bank integration issues, having enabled over 2 million successful logins and reduced average repair time by 90%. These agents represent Plaid's strategic use of LLMs to improve data quality, maintain reliability across thousands of financial institution connections, and enhance their core product experiences.

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

AI Assistant for Global Customer Service Automation

Klarna

Klarna implemented an OpenAI-powered AI assistant for customer service that successfully handled two-thirds of all customer service chats within its first month of global deployment. The system processes 2.3 million conversations, matches human agent satisfaction scores, reduces repeat inquiries by 25%, and cuts resolution time from 11 to 2 minutes, while operating in 23 markets with support for over 35 languages, projected to deliver $40 million in profit improvement for 2024.

AI Assistant Integration for Manufacturing Execution System (MES)

42Q

42Q, a cloud-based Manufacturing Execution System (MES) provider, implemented an intelligent chatbot named Arthur to address the complexity of their system and improve user experience. The solution uses RAG and AWS Bedrock to combine documentation, training videos, and live production data, enabling users to query system functionality and real-time manufacturing data in natural language. The implementation showed significant improvements in user response times and system understanding, while maintaining data security within AWS infrastructure.

AI Data Analyst with Multi-Stage LLM Architecture for Enterprise Data Discovery

Delivery Hero

The BADA team at Woowa Brothers (part of Delivery Hero) developed QueryAnswerBird (QAB), an LLM-based agentic system to improve employee data literacy across the organization. The problem addressed was that employees with varying levels of data expertise struggled to discover, understand, and utilize the company's vast internal data resources, including structured tables and unstructured log data. The solution involved building a multi-layered architecture with question understanding (Router Supervisor) and information acquisition stages, implementing various features including query/table explanation, syntax verification, table/column guidance, and log data utilization. Through two rounds of beta testing with data analysts, engineers, and product managers, the team iteratively refined the system to handle diverse question types beyond simple Text-to-SQL, ultimately creating a comprehensive data discovery platform that integrates with existing tools like Data Catalog and Log Checker to provide contextualized answers and improve organizational productivity.

AI Error Summarizer Implementation: A Tiger Team Approach

CircleCI

CircleCI's engineering team formed a tiger team to explore AI integration possibilities, ultimately developing an AI error summarizer feature. The team spent 6-7 weeks on discovery, including extensive stakeholder interviews and technical exploration, before implementing a relatively simple but effective LLM-based solution that summarizes build errors for users. The case demonstrates how companies can successfully approach AI integration through focused exploration and iterative development, emphasizing that valuable AI features don't necessarily require complex implementations.

AI Lab: A Pre-Production Framework for ML Performance Testing and Optimization

Meta

Meta developed AI Lab, a pre-production framework for continuously testing and optimizing machine learning workflows, with a focus on minimizing Time to First Batch (TTFB). The system enables both proactive improvements and automatic regression prevention for ML infrastructure changes. Using AI Lab, Meta was able to achieve up to 40% reduction in TTFB through the implementation of the Python Cinder runtime, while ensuring no regressions occurred during the rollout process.

AI Managed Services and Agent Operations at Enterprise Scale

PriceWaterhouseCooper

PriceWaterhouseCooper (PWC) addresses the challenge of deploying and maintaining AI systems in production through their managed services practice focused on data analytics and AI. The organization has developed frameworks for deploying AI agents in enterprise environments, particularly in healthcare and back-office operations, using their Agent OS framework built on Python. Their approach emphasizes process standardization, human-in-the-loop validation, continuous model tuning, and comprehensive measurement through evaluations to ensure sustainable AI operations at scale. Results include successful deployments in healthcare pre-authorization processes and the establishment of specialized AI managed services teams comprising MLOps engineers and data scientists who continuously optimize production models.

AI Sales Representatives for Inbound Lead Conversion

ShowMe

ShowMe builds AI sales representatives that function as digital teammates for companies selling primarily through inbound channels. The company was founded in April 2025 after the co-founders identified a critical problem at their previous company: website visitors weren't converting to customers unless engaged directly by human sales representatives, but scaling human engagement was too expensive for unqualified leads. ShowMe's solution involves multi-agent voice and video systems that can conduct sales calls, share screens, demo products, qualify leads, and orchestrate follow-up actions across multiple channels. The AI agents use sophisticated prompt engineering, RAG-based knowledge bases, and workflow orchestration to guide prospects through the sales funnel, ultimately creating qualified meetings or closing contracts directly while reducing the need for human sales intervention by approximately 70%.

AI SRE Agents for Production System Diagnostics

Cleric

Cleric is developing an AI Site Reliability Engineering (SRE) agent system that helps diagnose and troubleshoot production system issues. The system uses knowledge graphs to map relationships between system components, background scanning to maintain system awareness, and confidence scoring to minimize alert fatigue. The solution aims to reduce the burden on human engineers by efficiently narrowing down problem spaces and providing actionable insights, while maintaining strict security controls and read-only access to production systems.

AI SRE System with Continuous Learning for Production Issue Investigation

Cleric AI

Cleric AI developed an AI-powered SRE system that automatically investigates production issues using existing observability tools and infrastructure. They implemented continuous learning capabilities using LangSmith to compare different investigation strategies, track investigation paths, and aggregate performance metrics. The system learns from user feedback and generalizes successful investigation patterns across deployments while maintaining strict privacy controls and data anonymization.

AI-Assisted Activity Onboarding in Travel Marketplace

GetYourGuide

GetYourGuide faced challenges with their lengthy 16-step activity creation process, where suppliers spent up to an hour manually entering content that often had quality issues, leading to traveler confusion and lower conversion rates. They implemented a generative AI solution that allows activity providers to paste existing content and automatically generates descriptions and fills structured fields across 8 key onboarding steps. After an initial failed experiment due to UX confusion and measurement challenges, they iterated with improved UI/UX design and developed a novel permutation testing framework. The second rollout successfully increased activity completion rates, improved content quality, and reduced onboarding time to as little as 14 minutes, ultimately achieving positive impacts on both supplier efficiency and traveler engagement metrics.

AI-Assisted Database Debugging Platform at Scale

Databricks

Databricks built an agentic AI platform to help engineers debug thousands of OLTP database instances across hundreds of regions on AWS, Azure, and GCP. The platform addresses the problem of fragmented tooling and dispersed expertise by unifying metrics, logs, and operational workflows into a single intelligent interface with a chat assistant. The solution reduced debugging time by up to 90%, enabled new engineers to start investigations in under 5 minutes, and has achieved company-wide adoption, fundamentally changing how engineers interact with their infrastructure.

AI-Assisted Product Attribute Extraction for E-commerce Content Creation

Zalando

Zalando developed a Content Creation Copilot to automate product attribute extraction during the onboarding process, addressing data quality issues and time-to-market delays. The manual content enrichment process previously accounted for 25% of production timelines with error rates that needed improvement. By implementing an LLM-based solution using OpenAI's GPT models (initially GPT-4 Turbo, later GPT-4o) with custom prompt engineering and a translation layer for Zalando-specific attribute codes, the system now enriches approximately 50,000 attributes weekly with 75% accuracy. The solution integrates multiple AI services through an aggregator architecture, auto-suggests attributes in the content creation workflow, and allows copywriters to maintain final decision authority while significantly improving efficiency and data coverage.

AI-Assisted Root Cause Analysis System for Incident Response

Meta

Meta developed an AI-assisted root cause analysis system to streamline incident investigations in their large-scale systems. The system combines heuristic-based retrieval with LLM-based ranking to identify potential root causes of incidents. Using a fine-tuned Llama 2 model and a novel ranking approach, the system achieves 42% accuracy in identifying root causes for investigations at creation time in their web monorepo, significantly reducing the investigation time and helping responders make better decisions.

AI-Augmented Code Review System for Large-Scale Software Development

Uber

Uber developed uReview, an AI-powered code review platform to address the challenges of traditional peer reviews at scale, including reviewer overload from increasing code volume and difficulty identifying subtle bugs and security issues. The system uses a modular, multi-stage GenAI architecture with prompt-chaining to break down code review into four sub-tasks: comment generation, filtering, validation, and deduplication. Currently analyzing over 90% of Uber's ~65,000 weekly code diffs, uReview achieves a 75% usefulness rating from engineers and sees 65% of its comments addressed, demonstrating significant adoption and effectiveness in production.

AI-Driven Clinical Trial Transformation with Next-Generation Data Platform

Novartis

Novartis embarked on a comprehensive data and AI modernization journey to accelerate drug development by at least 6 months per clinical trial. The company partnered with AWS Professional Services and Accenture to build a next-generation, GXP-compliant data platform that integrates fragmented data across multiple domains (including patient safety, medical imaging, and regulatory data), enabling both operational AI use cases and ambitious moonshot projects like a digital twin for clinical trial simulation. The initial implementation with the patient safety domain achieved significant results: 16 data pipelines processing 17 terabytes of data, 72% faster query speeds, 60% storage cost reduction, and over 160 hours of manual work eliminated, while protocol generation use cases demonstrated 83-87% acceleration in generating compliance-acceptable protocols.

AI-Driven Digital Twins for Industrial Infrastructure Optimization

Geminus

Geminus addresses the challenge of optimizing large industrial machinery operations by combining traditional ML models with high-fidelity simulations to create fast, trustworthy digital twins. Their solution reduces model development time from 24 months to just days, while building operator trust through probabilistic approaches and uncertainty bounds. The system provides optimization advice through existing control systems, ensuring safety and reliability while significantly improving machine performance.

AI-Driven Incident Response and Automated Remediation for Digital Media Platform

iHeart

iHeart Media, serving 250 million monthly users across broadcast radio, digital streaming, and podcasting platforms, faced significant operational challenges with incident response requiring engineers to navigate multiple monitoring systems, VPNs, and dashboards during critical 3 AM outages. The company implemented a multi-agent AI system using AWS Bedrock Agent Core and the Strands AI framework to automate incident triage, root cause analysis, and remediation. The solution reduced triage response time dramatically (from minutes of manual investigation to 30-60 seconds), improved operational efficiency by eliminating repetitive manual tasks, and enabled knowledge preservation across incidents while maintaining 24/7 uptime requirements for their infrastructure handling 5-7 billion requests per month.

AI-Driven Security Posture Management Platform

LinkedIn

LinkedIn developed the Security Posture Platform (SPP) to enhance their security infrastructure management, incorporating an AI-powered interface called SPP AI. The platform streamlines security data analysis and vulnerability management across their distributed systems. By leveraging large language models and a comprehensive knowledge graph, the system improved vulnerability response speed by 150% and increased digital infrastructure coverage by 155%. The solution combines natural language querying capabilities with sophisticated data integration and automated decision-making to provide real-time security insights.

AI-Enhanced Body Camera and Digital Evidence Management in Law Enforcement

An Garda Siochanna

An Garda Siochanna implemented a comprehensive digital transformation initiative focusing on body-worn cameras and digital evidence management, incorporating AI and cloud technologies. The project involved deploying 15,000+ mobile devices, implementing three different body camera systems across different regions, and developing a cloud-based digital evidence management system. While current legislation limits AI usage to basic functionalities, proposed legislation aims to enable advanced AI capabilities for video analysis, object recognition, and automated report generation, all while maintaining human oversight and privacy considerations.

AI-Powered .NET Application Modernization at Scale

Thomson Reuters

Thomson Reuters faced the challenge of modernizing over 400 legacy .NET Framework applications comprising more than 500 million lines of code, which were running on costly Windows servers and slowing down innovation. By adopting AWS Transform for .NET during its beta phase, the company leveraged agentic AI capabilities powered by Amazon Bedrock LLMs with deep .NET expertise to automate the analysis, dependency mapping, code transformation, and validation process. This approach accelerated their modernization from months of planning to weeks of execution, enabling them to transform over 1.5 million lines of code per month while running 10 parallel modernization projects. The solution not only promised substantial cost savings by migrating to Linux containers and Graviton instances but also freed developers from maintaining legacy systems to focus on delivering customer value.

AI-Powered Account Planning System for Sales Process Optimization

AWS

AWS developed Account Plan Pulse, a generative AI solution built on Amazon Bedrock, to address the increasing complexity and manual overhead in their sales account planning process. The system automates the evaluation of customer account plans across 10 business-critical categories, generates actionable insights, and provides structured summaries to improve collaboration. The implementation resulted in a 37% improvement in plan quality year-over-year and a 52% reduction in the time required to complete, review, and approve plans, while helping sales teams focus more on strategic customer engagements rather than manual review processes.

AI-Powered Accounting Automation Using Claude and Amazon Bedrock

FloQast

FloQast developed an AI-powered accounting transformation solution to automate complex transaction matching and document annotation workflows using Anthropic's Claude 3 on Amazon Bedrock. The system combines document processing capabilities like Amazon Textract with LLM-based automation through Amazon Bedrock Agents to streamline reconciliation processes and audit workflows. The solution achieved significant efficiency gains, including 38% reduction in reconciliation time and 23% decrease in audit process duration.

AI-Powered Artwork Quality Moderation and Streaming Quality Management at Scale

Amazon Prime Video

Amazon Prime Video faced challenges in manually reviewing artwork from content partners and monitoring streaming quality for millions of concurrent viewers across 240+ countries. To address these issues, they developed two AI-powered solutions: (1) an automated artwork quality moderation system using multimodal LLMs to detect defects like safe zone violations, mature content, and text legibility issues, reducing manual review by 88% and evaluation time from days to under an hour; and (2) an agentic AI system for detecting, localizing, and mitigating streaming quality issues in real-time without manual intervention. Both solutions leveraged Amazon Bedrock, Strands agents framework, and iterative evaluation loops to achieve high precision while operating at massive scale.

AI-Powered Automated GraphQL Schema Cleanup

Whatnot

Whatnot, a livestream shopping platform, faced significant technical debt in their GraphQL schema with over 2,600 unused fields accumulated from deprecated features and old endpoints. Manual cleanup was time-consuming and risky, requiring 1-2 hours per field and deep domain knowledge. The engineering team built an AI subagent integrated into a GitHub Action that automatically identifies unused fields through traffic analysis and generates pull requests to safely remove them. The agent follows the same process an engineer wouldโ€”removing schema fields, resolvers, dead code, and updating testsโ€”but operates autonomously in the background. Running daily at $1-3 per execution, the system has successfully removed 24 of approximately 200 unused root fields with minimal human intervention, requiring edits to only three PRs, transforming schema maintenance from a neglected one-time project into an ongoing automated process.

AI-Powered Automated Issue Resolution Achieving State-of-the-Art Performance on SWE-bench

Trae

Trae developed an AI engineering system that achieved 70.6% accuracy on the SWE-bench Verified benchmark, setting a new state-of-the-art record for automated software issue resolution. The solution combines multiple large language models (Claude 3.7, Gemini 2.5 Pro, and OpenAI o4-mini) in a sophisticated multi-stage pipeline featuring generation, filtering, and voting mechanisms. The system uses specialized agents including a Coder agent for patch generation, a Tester agent for regression testing, and a Selector agent that employs both syntax-based voting and multi-selection voting to identify the best solution from multiple candidate patches.

AI-Powered Autonomous Infrastructure Monitoring and Self-Healing System

Railway

This case study presents a proof-of-concept system for autonomous infrastructure monitoring and self-healing using AI coding agents. The presenter demonstrates a workflow that automatically detects issues in deployed services on Railway (memory leaks, slow database queries, high error rates), analyzes metrics and logs using LLMs to generate diagnostic plans, and then deploys OpenCodeโ€”an open-source AI coding agentโ€”to automatically create pull requests with fixes. The system leverages durable workflows via Inngest for reliability, combines multiple data sources (CPU/memory metrics, HTTP metrics, logs), and uses LLMs to analyze infrastructure health and generate remediation plans. While presented as a demo/concept, the approach showcases how LLMs can move from alerting engineers to autonomously proposing code-level fixes for production issues.

AI-Powered Autonomous Threat Analysis for Cybersecurity at Scale

Amazon

Amazon developed Autonomous Threat Analysis (ATA), a production security system that uses agentic AI and adversarial multiagent reinforcement learning to enhance cybersecurity defenses at scale. The system deploys red-team and blue-team AI agents in isolated test environments to simulate adversary techniques and automatically generate improved detection rules. ATA reduces the security testing cycle from weeks to approximately four hours (96% time reduction), successfully generates threat variations (such as 37 Python reverse shell variants), and achieves perfect precision and recall (1.00/1.00) for improved detection rules while maintaining human oversight for production deployment.

AI-Powered Betting Assistant for Sports Wagering Platform

FanDuel

FanDuel, America's leading sportsbook platform handling over 16.6 million bets during Super Bowl Sunday 2025, developed AAI (an AI-powered betting assistant) to address friction in the customer betting journey. Previously, customers would leave the FanDuel app to research bets on external platforms, often getting distracted and missing betting opportunities. Working with AWS's Generative AI Innovation Center, FanDuel built an in-app conversational assistant using Amazon Bedrock that guides customers through research, discovery, bet construction, and execution entirely within their platform. The solution reduced bet construction time from hours to seconds (particularly for complex parlays), improved customer engagement, and was rolled out incrementally across states and sports using a rigorous evaluation framework with thousands of test cases to ensure accuracy and responsible gaming safeguards.

AI-Powered Call Center Agents for Healthcare Operations

HeyRevia

HeyRevia has developed an AI call center solution specifically for healthcare operations, where over 30% of operations run on phone calls. Their system uses AI agents to handle complex healthcare-related calls, including insurance verifications, prior authorizations, and claims processing. The solution incorporates real-time audio processing, context understanding, and sophisticated planning capabilities to achieve performance that reportedly exceeds human operators while maintaining compliance with healthcare regulations.

AI-Powered Chatbot Automation with Hybrid NLU and LLM Approach

Scotiabank

Scotiabank developed a hybrid chatbot system combining traditional NLU with modern LLM capabilities to handle customer service inquiries. They created an innovative "AI for AI" approach using three ML models (nicknamed Luigi, Eva, and Peach) to automate the review and improvement of chatbot responses, resulting in 80% time savings in the review process. The system includes LLM-powered conversation summarization to help human agents quickly understand customer contexts, marking the bank's first production use of generative AI features.

AI-Powered Chief of Staff: Scaling Agent Architecture from Monolith to Distributed System

Outropy

Outropy initially built an AI-powered Chief of Staff for engineering leaders that attracted 10,000 users within a year. The system evolved from a simple Slack bot to a sophisticated multi-agent architecture handling complex workflows across team tools. They tackled challenges in agent memory management, event processing, and scaling, ultimately transitioning from a monolithic architecture to a distributed system using Temporal for workflow management while maintaining production reliability.

AI-Powered Client Services Assistant for Post-Trade Services

London Stock Exchange Group

London Stock Exchange Group developed a client services assistant application using Amazon Q Business to enhance their post-trade customer support. The solution leverages RAG techniques to provide accurate and quick responses to complex member queries by accessing internal documents and public rulebooks. The system includes a robust validation process using Claude v2 to ensure response accuracy against a golden answer dataset, delivering responses within seconds and improving both customer experience and staff productivity.

AI-Powered Clinical Documentation and Data Infrastructure for Point-of-Care Transformation

Veradigm

Veradigm, a healthcare IT company, partnered with AWS to integrate generative AI into their Practice Fusion electronic health record (EHR) system to address clinician burnout caused by excessive documentation tasks. The solution leverages AWS HealthScribe for autonomous AI scribing that generates clinical notes from patient-clinician conversations, and AWS HealthLake as a FHIR-based data foundation to provide patient context at scale. The implementation resulted in clinicians saving approximately 2 hours per day on charting, 65% of users requiring no training to adopt the technology, and high satisfaction with note quality. The system processes 60 million patient visits annually and enables ambient documentation that allows clinicians to focus on patient care rather than typing, with a clear path toward zero-edit note generation.

AI-Powered Clinical Documentation with Multi-Region Healthcare Compliance

Heidi Health

Heidi Health developed an ambient AI scribe to reduce the administrative burden on healthcare clinicians by automatically generating clinical notes from patient consultations. The company faced significant LLMOps challenges including building confidence in non-deterministic AI outputs through "clinicians in the loop" evaluation processes, scaling clinical validation beyond small teams using synthetic data generation and LLM-as-judge approaches, and managing global expansion across regions with different data sovereignty requirements, model availability constraints, and regulatory compliance needs. Their solution involved standardizing infrastructure-as-code deployments across AWS regions, using a hybrid approach of Amazon Bedrock for immediate availability and EKS for self-hosted model control, and integrating clinical ambassadors in each region to validate medical accuracy and local practice patterns. The platform now serves over 370,000 clinicians processing 10 million consultations per month globally.

AI-Powered Clinical Outcome Assessment Review Using Generative AI

Clario

Clario, a clinical trials endpoint data provider, developed an AI-powered solution to automate the analysis of Clinical Outcome Assessment (COA) interviews in clinical trials for psychosis, anxiety, and mood disorders. The traditional approach of manually reviewing audio-video recordings was time-consuming, logistically complex, and introduced variability that could compromise trial reliability. Using Amazon Bedrock and other AWS services, Clario built a system that performs speaker diarization, multi-lingual transcription, semantic search, and agentic AI-powered quality review to evaluate interviews against standardized criteria. The solution demonstrates potential for reducing manual review effort by over 90%, providing 100% data coverage versus subset sampling, and decreasing review turnaround time from weeks to hours, while maintaining regulatory compliance and improving data quality for submissions.

AI-Powered Clinical Trial Software Configuration Automation

Clario

Clario, a leading provider of endpoint data solutions for clinical trials, faced significant challenges with their manual software configuration process, which involved extracting data from multiple sources including PDF forms, study databases, and standardized protocols. The manual process was time-consuming, prone to transcription errors, and created version control challenges. To address this, Clario developed the Genie AI Service powered by Amazon Bedrock using Anthropic's Claude 3.7 Sonnet, orchestrated through Amazon ECS. The solution automates data extraction from transmittal forms, centralizes information from multiple sources, provides an interactive review dashboard for validation, and automatically generates Software Configuration Specification documents and XML configurations for their medical imaging software. This has reduced study configuration execution time while improving quality, minimizing transcription errors, and allowing teams to focus on higher-value activities like study design optimization.

AI-Powered Co-pilot System for Digital Sales Agents

Wayfair

Wayfair developed an AI-powered Agent Co-pilot system to assist their digital sales agents during customer interactions. The system uses LLMs to provide contextually relevant chat response recommendations by considering product information, company policies, and conversation history. Initial test results showed a 10% reduction in handle time, improving customer service efficiency while maintaining quality interactions.

AI-Powered Code Review and Pull Request Automation for Developer Compliance

GitHub

GitHub explored how generative AI could transform compliance in software development by automating foundational components like separation of duties and code reviews. The company developed GitHub Copilot for Pull Requests, which uses AI to automatically generate pull request descriptions based on code changes and provide AI-assisted code review suggestions. This approach aims to maintain compliance requirements while keeping developers in the flow, reducing manual overhead for both development and audit teams, and enabling separation of duties through automated, objective code analysis rather than purely human-based processes.

AI-Powered Code Review Platform at Scale

Uber

Uber developed uReview, an AI-powered code review platform, to address the challenge of reviewing over 65,000 code changes weekly across six monorepos. Traditional peer reviews were becoming overwhelmed by the volume of code and struggled to consistently catch subtle bugs, security issues, and best practice violations. The solution employs a modular, multi-stage GenAI system using prompt chaining with multiple specialized assistants (Standard, Best Practices, and AppSec) that generate, filter, validate, and deduplicate code review comments. The system achieves a 75% usefulness rating from engineers, with 65% of comments being addressed, outperforming human reviewers (51% address rate), and saves approximately 1,500 developer hours weekly across Uber's engineering organization.

AI-Powered Code Review Platform Using Abstract Syntax Trees and LLM Context

Baz

Baz is building an AI code review agent that addresses the challenge of understanding complex codebases at scale. The platform combines Abstract Syntax Trees (AST) with LLM semantic understanding to provide automated code reviews that go beyond traditional static analysis. By integrating context from multiple sources including code structure, Jira/Linear tickets, CI logs, and deployment patterns, Baz aims to replicate the knowledge of a staff engineer who understands not just the code but the entire business context. The solution has evolved from basic reviews to catching performance issues and schema changes, with customers using it to review code generated by AI coding assistants like Cursor and Codex.

AI-Powered Compliance Investigation Agents for Enhanced Due Diligence

Stripe

Stripe developed an LLM-powered AI research agent system to address the scalability challenges of enhanced due diligence (EDD) compliance reviews in financial services. The manual review process was resource-intensive, with compliance analysts spending significant time navigating fragmented data sources across different jurisdictions rather than performing high-value analysis. Stripe built a React-based agent system using Amazon Bedrock that orchestrates autonomous investigations across multiple data sources, pre-fetches analysis before reviewers open cases, and provides comprehensive audit trails. The solution maintains human oversight for final decision-making while enabling agents to handle data gathering and initial research. This resulted in a 26% reduction in average handling time for compliance reviews, with agents achieving 96% helpfulness ratings from reviewers, allowing Stripe to scale compliance operations alongside explosive business growth without proportionally increasing headcount.

AI-Powered Contact Center Copilot: From Research to Enterprise-Scale Production

Cresta / OpenAI

Cresta, founded in 2017 by Stanford PhD students with OpenAI research experience, developed an AI copilot system for contact center agents that provides real-time suggestions during customer conversations. The company tackled the challenge of transforming academic NLP and reinforcement learning research into production-grade enterprise software by building domain-specific models fine-tuned on customer conversation data. Starting with Intuit as their first customer through an unconventional internship arrangement, they demonstrated measurable ROI through A/B testing, showing improved conversion rates and agent productivity. The solution evolved from custom LSTM and transformer models to leveraging pre-trained foundation models like GPT-3/4 with fine-tuning, ultimately serving Fortune 500 customers across telecommunications, airlines, and banking with demonstrated value including a pilot generating $100 million in incremental revenue.

AI-Powered Contact Center Transformation for Student Support Services

Anthology

Anthology, an education technology company operating a BPO for higher education institutions, transformed their traditional contact center infrastructure to an AI-first, cloud-based solution using Amazon Connect. Facing challenges with seasonal spikes requiring doubling their workforce (from 1,000 to 2,000+ agents during peak periods), homegrown legacy systems, and reliability issues causing 12 unplanned outages during busy months, they migrated to AWS to handle 8 million annual student interactions. The implementation, which went live in July 2024 just before their peak back-to-school period, resulted in 50% reduction in wait times, 14-point increase in response accuracy, 10% reduction in agent attrition, and improved system reliability (reducing unplanned outages from 12 to 2 during peak months). The solution leverages AI virtual agents for handling repetitive queries, agent assist capabilities with real-time guidance, and automated quality assurance enabling 100% interaction review compared to the previous 1%.

AI-Powered Contact Center Transformation with Amazon Connect

Traeger

Traeger Grills transformed their customer experience operations from a legacy contact center with poor performance metrics (35% CSAT, 30% first contact resolution) into a modern AI-powered system built on Amazon Connect. The company implemented generative AI capabilities for automated case note generation, email composition, and chatbot interactions while building a "single pane of glass" agent experience using Amazon Connect Cases. This eliminated their legacy CRM, reduced new hire training time by 40%, improved agent satisfaction, and enabled seamless integration of their acquired Meater thermometer brand. The implementation leveraged AI to handle non-value-added work while keeping human agents focused on building emotional connections with customers in the "Traeger Hood" community, demonstrating a shift from cost center to profit center thinking.

AI-Powered Content Curation for Financial Crime Detection

LSEG

London Stock Exchange Group (LSEG) Risk Intelligence modernized its WorldCheck platformโ€”a global database used by financial institutions to screen for high-risk individuals, politically exposed persons (PEPs), and adverse mediaโ€”by implementing generative AI to accelerate data curation. The platform processes thousands of news sources in 60+ languages to help 10,000+ customers combat financial crime including fraud, money laundering, and terrorism financing. By adopting a maturity-based approach that progressed from simple prompt-only implementations to agent orchestration with human-in-the-loop validation, LSEG reduced content curation time from hours to minutes while maintaining accuracy and regulatory compliance. The solution leverages AWS Bedrock for LLM operations, incorporating summarization, entity extraction, classification, RAG for cross-referencing articles, and multi-agent orchestration, all while keeping human analysts at critical decision points to ensure trust and regulatory adherence.

AI-Powered Content Generation and Shot Commentary System for Live Golf Tournament Coverage

PGA Tour

The PGA Tour faced the challenge of engaging fans with golf content across multiple tournaments running nearly every week of the year, generating meaningful content from 31,000+ shots per tournament across 156 players, and maintaining relevance during non-tournament days. They implemented an agentic AI system using AWS Bedrock that generates up to 800 articles per week across eight different content types (betting profiles, tournament previews, player recaps, round recaps, purse breakdowns, etc.) and a real-time shot commentary system that provides contextual narration for live tournament play. The solution achieved 95% cost reduction (generating articles at $0.25 each), enabled content publication within 5-10 minutes of live events, resulted in billions of annual page views for AI-generated content, and became their highest-engaged content on non-tournament days while maintaining brand voice and factual accuracy through multi-agent validation workflows.

AI-Powered Content Moderation at Platform Scale

Roblox

Roblox moderates billions of pieces of user-generated content daily across 28 languages using a sophisticated AI-driven system that combines large transformer-based models with human oversight. The platform processes an average of 6.1 billion chat messages and 1.1 million hours of voice communication per day, requiring ML models that can make moderation decisions in milliseconds. The system achieves over 750,000 requests per second for text filtering, with specialized models for different violation types (PII, profanity, hate speech). The solution integrates GPU-based serving infrastructure, model quantization and distillation for efficiency, real-time feedback mechanisms that reduce violations by 5-6%, and continuous model improvement through diverse data sampling strategies including synthetic data generation via LLMs, uncertainty sampling, and AI-assisted red teaming.

AI-Powered Content Moderation at Scale: SafeChat Platform

DoorDash

DoorDash developed SafeChat, an AI-powered content moderation system to handle millions of daily messages, hundreds of thousands of images, and voice calls exchanged between delivery drivers (Dashers) and customers. The platform employs a multi-layered architecture that evolved from using three external LLMs to a more efficient two-layer approach combining an internally trained model with a precise external LLM, processing text, images, and voice communications in real-time. Since launch, SafeChat has achieved a 50% reduction in low to medium-severity safety incidents while maintaining low latency (under 300ms for most messages) and cost-effectiveness by intelligently routing only 0.2% of content to expensive, high-precision models.

AI-Powered Content Understanding and Ad Targeting Platform

Dotdash

Dotdash Meredith, a major digital publisher, developed an AI-powered system called Decipher that understands user intent from content consumption to deliver more relevant advertising. Through a strategic partnership with OpenAI, they enhanced their content understanding capabilities and expanded their targeting platform across the premium web. The system outperforms traditional cookie-based targeting while maintaining user privacy, proving that high-quality content combined with AI can drive better business outcomes.

AI-Powered CRM Insights with RAG and Text-to-SQL

TP ICAP

TP ICAP faced the challenge of extracting actionable insights from tens of thousands of vendor meeting notes stored in their Salesforce CRM system, where business users spent hours manually searching through records. Using Amazon Bedrock, their Innovation Lab built ClientIQ, a production-ready solution that combines Retrieval Augmented Generation (RAG) and text-to-SQL approaches to transform hours of manual analysis into seconds. The solution uses Amazon Bedrock Knowledge Bases for unstructured data queries, automated evaluations for quality assurance, and maintains enterprise-grade security through permission-based access controls. Since launch with 20 initial users, ClientIQ has driven a 75% reduction in time spent on research tasks and improved insight quality with more comprehensive and contextual information being surfaced.

AI-Powered Customer Conversation Analytics at Scale

GoDaddy

GoDaddy faced the challenge of extracting actionable insights from over 100,000 daily customer service transcripts, which were previously analyzed through limited manual review that couldn't surface systemic issues or emerging problems quickly enough. To address this, they developed Lighthouse, an internal AI analytics platform that uses large language models, prompt engineering, and lexical search to automatically analyze massive volumes of unstructured customer interaction data. The platform successfully processes the full daily volume of 100,000+ transcripts in approximately 80 minutes, enabling teams to identify pain points and operational issues within hours instead of weeks, as demonstrated in a real case where they quickly detected and resolved a spike in customer calls caused by a malfunctioning link before it escalated into a major service disruption.

AI-Powered Customer Feedback Analysis Using RAG and LLMs in Product Analytics

Meta

Meta's Reality Labs developed a self-service AI tool powered by their open-source Llama 4 LLM to analyze customer feedback for their Quest VR headsets and Ray-Ban Meta products. The challenge was that customer feedback dataโ€”from reviews, bug reports, surveys, and social mediaโ€”was underutilized due to noise, bias, and lack of structure. By building a comprehensive feedback repository from internal and external sources and implementing a Retrieval Augmented Generation (RAG) system with embedding-based similarity search, Meta created a production system that transforms qualitative feedback into actionable insights. The tool is being used for bug deduplication, internal testing summaries, and strategic planning, enabling the company to bridge quantitative metrics with qualitative customer insights and dramatically reduce manual analysis time from hours to minutes.

AI-Powered Customer Segmentation with Natural Language Interface

Klaviyo

Klaviyo, a customer data platform serving 130,000 customers, launched Segments AI in November 2023 to address two key problems: inexperienced users struggling to express customer segments through traditional UI, and experienced users spending excessive time building repetitive complex segments. The solution uses OpenAI's LLMs combined with prompt chaining and few-shot learning techniques to transform natural language descriptions into structured segment definitions adhering to Klaviyo's JSON schema. The team tackled the significant challenge of validating non-deterministic LLM outputs by combining automated LLM-based evaluation with hand-designed test cases, ultimately deploying a production system that required ongoing maintenance due to the stochastic nature of generative AI outputs.

AI-Powered Customer Service Agent for Healthcare Navigation

Alan

Alan, a healthcare company supporting 1 million members, built AI agents to help members navigate complex healthcare questions and processes. The company transitioned from traditional workflows to playbook-based agent architectures, implementing a multi-agent system with classification and specialized agents (particularly for claims handling) that uses a ReAct loop for tool calling. The solution achieved 30-35% automation of customer service questions with quality comparable to human care experts, with 60% of reimbursements processed in under 5 minutes. Critical to their success was building custom orchestration frameworks and extensive internal tooling that empowered domain experts (customer service operators) to configure, debug, and maintain agents without engineering bottlenecks.

AI-Powered Customer Service and Call Center Transformation with Multi-Agent Systems

Fastweb / Vodafone

Fastweb / Vodafone, a major European telecommunications provider serving 9.5 million customers in Italy, transformed their customer service operations by building two AI agent systems to address the limitations of traditional customer support. They developed Super TOBi, a customer-facing agentic chatbot system, and Super Agent, an internal tool that empowers call center consultants with real-time diagnostics and guidance. Built on LangGraph and LangChain with Neo4j knowledge graphs and monitored through LangSmith, the solution achieved a 90% correctness rate, 82% resolution rate, 5.2/7 Customer Effort Score for Super TOBi, and over 86% One-Call Resolution rate for Super Agent, delivering faster response times and higher customer satisfaction while reducing agent workload.

AI-Powered Customer Support Automation for Global Transportation Service

Lime

Lime, a global micromobility company, implemented Forethought's AI solutions to scale their customer support operations. They faced challenges with manual ticket handling, language barriers, and lack of prioritization for critical cases. By implementing AI-powered automation tools including Solve for automated responses and Triage for intelligent routing, they achieved 27% case automation, 98% automatic ticket tagging, and reduced response times by 77%, while supporting multiple languages and handling 1.7 million tickets annually.

AI-Powered Developer Tools for Code Quality and Test Generation

Uber

Uber's developer platform team built AI-powered developer tools using LangGraph to improve code quality and automate test generation for their 5,000 engineers. Their approach focuses on three pillars: targeted product development for developer workflows, cross-cutting AI primitives, and intentional technology transfer. The team developed Validator, an IDE-integrated tool that flags best practices violations and security issues with automatic fixes, and AutoCover, which generates comprehensive test suites with coverage validation. These tools demonstrate the successful deployment of multi-agent systems in production, achieving measurable improvements including thousands of daily fix interactions, 10% increase in developer platform coverage, and 21,000 developer hours saved through automated test generation.

AI-Powered Digital Co-Workers for Customer Support and Business Process Automation

Neople

Neople, a European startup founded almost three years ago, has developed AI-powered "digital co-workers" (called Neeles) primarily targeting customer success and service teams in e-commerce companies across Europe. The problem they address is the repetitive, high-volume work that customer service agents face, which reduces job satisfaction and efficiency. Their solution evolved from providing AI-generated response suggestions to human agents, to fully automated ticket responses, to executing actions across multiple systems, and finally to enabling non-technical users to build custom workflows conversationally. The system now serves approximately 200 customers, with AI agents handling repetitive tasks autonomously while human agents focus on complex cases. Results include dramatic improvements in first response rates (from 10% to 70% in some cases), reduced resolution times, and expanded use cases beyond customer service into finance, operations, and marketing departments.

AI-Powered Ecommerce Content Optimization Platform

Pattern

Pattern developed Content Brief, an AI-driven tool that processes over 38 trillion ecommerce data points to optimize product listings across multiple marketplaces. Using Amazon Bedrock and other AWS services, the system analyzes consumer behavior, content performance, and competitive data to provide actionable insights for product content optimization. In one case study, their solution helped Select Brands achieve a 21% month-over-month revenue increase and 14.5% traffic improvement through optimized product listings.

AI-Powered Email Search Assistant with Advanced Cognitive Architecture

Superhuman

Superhuman developed Ask AI to solve the challenge of inefficient email and calendar searching, where users spent up to 35 minutes weekly trying to recall exact phrases and sender names. They evolved from a single-prompt RAG system to a sophisticated cognitive architecture with parallel processing for query classification and metadata extraction. The solution achieved sub-2-second response times and reduced user search time by 14% (5 minutes per week), while maintaining high accuracy through careful prompt engineering and systematic evaluation.

AI-Powered Engineering Team Management and Code Review Platform

Entelligence

Entelligence addresses the challenges of managing large engineering teams by providing AI agents that handle code reviews, documentation maintenance, and team performance analytics. The platform combines LLM-based code analysis with learning from team feedback to provide contextually appropriate reviews, while maintaining up-to-date documentation and offering insights into engineering productivity beyond traditional metrics like lines of code.

AI-Powered Epilepsy Diagnosis Platform Reducing Diagnostic Time Through Multimodal Data Processing

Australian Epilepsy Project

The Australian Epilepsy Project (AEP) developed a cloud-based precision medicine platform on AWS that integrates multimodal patient data (MRI scans, neuropsychological assessments, genetic data, and medical histories) to support epilepsy diagnosis and treatment planning. The platform leverages various AI/ML techniques including machine learning models for automated brain region analysis, large language models for medical text processing through RAG approaches, and generative AI for patient summaries. This resulted in a 70% reduction in diagnosis time for language area mapping prior to surgery, 10% higher lesion detection rates, and improved patient outcomes including 9% better work productivity and 8% reduction in seizures over two years.

AI-Powered Fax Processing Automation for Healthcare Referrals

Providence

Providence Health System automated the processing of over 40 million annual faxes using GenAI and MLflow on Databricks to transform manual referral workflows into real-time automated triage. The system combines OCR with GPT-4.0 models to extract referral data from diverse document formats and integrates seamlessly with Epic EHR systems, eliminating months-long backlogs and freeing clinical staff to focus on patient care across 1,000+ clinics.

AI-Powered Financial Assistant for Automated Expense Management

Brex

Brex developed an AI-powered financial assistant to automate expense management workflows, addressing the pain points of manual data entry, policy compliance, and approval bottlenecks that plague traditional finance operations. Using Amazon Bedrock with Claude models, they built a comprehensive system that automatically processes expenses, generates compliant documentation, and provides real-time policy guidance. The solution achieved 75% automation of expense workflows, saving hundreds of thousands of hours monthly across customers while improving compliance rates from 70% to the mid-90s, demonstrating how LLMs can transform enterprise financial operations when properly integrated with existing business processes.

AI-Powered Fraud Detection in E-commerce Using AWS Fraud Detector

Awaze

E-commerce companies face significant fraud challenges, with UK e-commerce fraud reaching ยฃ1 billion stolen in 2024 despite preventing ยฃ1.5 billion. The speaker describes implementing AWS Fraud Detector, a fully managed machine learning service, to detect various fraud types including promo abuse, credit card chargeback fraud, account hijacking, and triangulation fraud. The solution uses historical labeled data to build predictive models that score orders between 0-1000 based on fraud likelihood, requiring human review for GDPR compliance. The implementation covers evaluation strategies focusing on true positives and false positives, feature engineering including geolocation enrichment, deployment options via SageMaker or Lambda, and continuous improvement through model retraining at different frequencies depending on fraud trend velocity.

AI-Powered Fraud Detection Using Mixture of Experts and Federated Learning

Feedzai

Feedzai developed TrustScore, an AI-powered fraud detection system that addresses the limitations of traditional rule-based and custom AI models in financial crime detection. The solution leverages a Mixture of Experts (MoE) architecture combined with federated learning to aggregate fraud intelligence from across Feedzai's network of financial institutions processing $8.02T in yearly transactions. Unlike traditional systems that require months of historical data and constant manual updates, TrustScore provides a zero-day, ready-to-use solution that continuously adapts to emerging fraud patterns while maintaining strict data privacy. Real-world deployments have demonstrated significant improvements in fraud detection rates and reductions in false positives compared to traditional out-of-the-box rule systems.

AI-Powered Healthcare: Building Reliable Care Agents in Production

Sword Health

Sword Health, a digital health company specializing in remote physical therapy, developed Phoenix, an AI care agent that provides personalized support to patients during and after rehabilitation sessions while acting as a co-pilot for physical therapists. The company faced challenges deploying LLMs in a highly regulated healthcare environment, requiring robust guardrails, evaluation frameworks, and human oversight. Through iterative development focusing on prompt engineering, RAG for domain knowledge, comprehensive evaluation systems combining human and LLM-based ratings, and continuous data monitoring, Sword Health successfully shipped AI-powered features that improve care accessibility and efficiency while maintaining clinical safety through human-in-the-loop validation for all clinical decisions.

AI-Powered Help Desk for Accounts Payable Automation

Xelix

Xelix developed an AI-enabled help desk system to automate responses to vendor inquiries for accounts payable teams who often receive over 1,000 emails daily. The solution uses a multi-stage pipeline that classifies incoming emails, enriches them with vendor and invoice data from ERP systems, and generates contextual responses using LLMs. The system handles invoice status inquiries, payment reminders, and statement reconciliation requests, with confidence scoring to indicate response reliability. By pre-generating responses and surfacing relevant financial data, the platform reduces average handling time for tickets while maintaining human oversight through a review-and-send workflow, enabling AP teams to process high volumes of vendor communications more efficiently.

AI-Powered Hormonal Health Platform Built in 8 Weeks

FemmFlo

FemmFlo, a women's health tech startup, developed an LLM-powered platform to address the massive data gap in women's hormonal health, where millions of women wait over seven years for accurate diagnoses. Working with Millio AI and leveraging AWS services, they built a full MVP in just eight weeks that integrates hormonal tracking, lab diagnostics, mental health support, and personalized care recommendations through an AI agent named Gabby. The platform was designed for rapid deployment with beta users, lab integrations, and partnerships, specifically targeting underserved women with culturally relevant, localized healthcare guidance. The solution uses AWS Bedrock agents, API Gateway, DynamoDB, S3, and other managed services to deliver a scalable, cost-effective system that translates complex lab results into actionable health insights while maintaining clinical rigor through a controlled testing environment.

AI-Powered Hybrid Approach for Large-Scale Test Migration from Enzyme to React Testing Library

Slack

Slack faced the challenge of migrating 15,500 Enzyme test cases to React Testing Library to enable upgrading to React 18, an effort estimated at over 10,000 engineering hours across 150+ developers. The team developed an innovative hybrid approach combining Abstract Syntax Tree (AST) transformations with Large Language Models (LLMs), specifically Claude 2.1, to automate the conversion process. The solution involved a sophisticated pipeline that collected context including DOM trees, performed partial AST conversions with annotations, and leveraged LLMs to handle complex cases that traditional codemods couldn't address. This hybrid approach achieved an 80% success rate for automated conversions and saved developers 22% of their migration time, ultimately enabling the complete migration by May 2024.

AI-Powered Incident Response System with Multi-Agent Investigation

Incident.io

Incident.io developed an AI SRE product to automate incident investigation and response for tech companies. The product uses a multi-agent system to analyze incidents by searching through GitHub pull requests, Slack messages, historical incidents, logs, metrics, and traces to build hypotheses about root causes. When incidents occur, the system automatically creates investigations that run parallel searches, generate findings, formulate hypotheses, ask clarifying questions through sub-agents, and present actionable reports in Slack within 1-2 minutes. The system demonstrates significant value by reducing mean time to detection and resolution while providing continuous ambient monitoring throughout the incident lifecycle, working collaboratively with human responders.

AI-Powered Insurance Claims Chatbot with Continuous Feedback Loop

Allianz

Allianz Benelux tackled their complex insurance claims process by implementing an AI-powered chatbot using Landbot. The system processed over 92,000 unique search terms, categorized insurance products, and implemented a real-time feedback loop with Slack and Trello integration. The solution achieved 90% positive ratings from 18,000+ customers while significantly simplifying the claims process and improving operational efficiency.

AI-Powered Lesson Generation System for Language Learning

Duolingo

Duolingo implemented an LLM-based system to accelerate their lesson creation process, enabling their teaching experts to generate language learning content more efficiently. The system uses carefully crafted prompts that combine fixed rules and variable parameters to generate exercises that meet specific educational requirements. This has resulted in faster course development, allowing Duolingo to expand their course offerings and deliver more advanced content while maintaining quality through human expert oversight.

AI-Powered Marketing Compliance Automation System

Remitly

Remitly, a global financial services company operating in 170 countries, developed an AI-based system to streamline their marketing compliance review process. The system analyzes marketing content against regulatory guidelines and internal policies, providing real-time feedback to marketers before legal review. The initial implementation focused on English text content, achieving 95% accuracy and 97% recall in identifying compliance issues, reducing the back-and-forth between marketing and legal teams, and significantly improving time-to-market for marketing materials.

AI-Powered Marketing Compliance Monitoring at Scale

PerformLine

PerformLine, a marketing compliance platform, needed to efficiently process complex product pages containing multiple overlapping products for compliance checks. They developed a serverless, event-driven architecture using Amazon Bedrock with Amazon Nova models to parse and extract contextual information from millions of web pages daily. The solution implemented prompt engineering with multi-pass inference, achieving a 15% reduction in human evaluation workload and over 50% reduction in analyst workload through intelligent content deduplication and change detection, while processing an estimated 1.5-2 million pages daily to extract 400,000-500,000 products for compliance review.

AI-Powered Marketing Intelligence Platform Accelerates Industry Analysis

CLICKFORCE

CLICKFORCE, a digital advertising leader in Taiwan, faced challenges with generic AI outputs, disconnected internal datasets, and labor-intensive analysis processes that took two to six weeks to complete industry reports. The company built Lumos, an AI-powered marketing analysis platform using Amazon Bedrock Agents for contextualized reasoning, Amazon SageMaker for Text-to-SQL fine-tuning, Amazon OpenSearch for vector embeddings, and AWS Glue for data integration. The solution reduced industry analysis time from weeks to under one hour, achieved a 47% reduction in operational costs, and enabled multiple stakeholder groups to independently generate insights without centralized analyst teams.

AI-Powered Medical Content Review and Revision at Scale

Flo Health

Flo Health, a leading women's health app, partnered with AWS Generative AI Innovation Center to develop MACROS (Medical Automated Content Review and Revision Optimization Solution), an AI-powered system for verifying and maintaining the accuracy of thousands of medical articles. The solution uses Amazon Bedrock foundation models to automatically review medical content against established guidelines, identify outdated or inaccurate information, and propose evidence-based revisions while maintaining Flo's editorial style. The proof of concept achieved 80% accuracy and over 90% recall in identifying content requiring updates, significantly reduced processing time from hours to minutes per guideline, and demonstrated more consistent application of medical guidelines compared to manual reviews while reducing the workload on medical experts.

AI-Powered Merchant Classification Correction Agent

Ramp

Ramp built an AI agent to automatically fix incorrect merchant classifications that were previously causing customer frustration and requiring hours of manual intervention from support, finance, and engineering teams. The solution uses a large language model backed by embeddings and OLAP queries, multimodal retrieval augmented generation (RAG) with receipt image analysis, and carefully constructed guardrails to validate and process user-submitted correction requests. The agent now handles nearly 100% of requests (compared to less than 3% previously handled manually) in under 10 seconds with a 99% improvement rate according to LLM-based evaluation, saving both customer time and substantial operational costs.

AI-Powered Natural Language Flight Search Implementation

Alaska Airlines

Alaska Airlines implemented a natural language destination search system powered by Google Cloud's Gemini LLM to transform their flight booking experience. The system moves beyond traditional flight search by allowing customers to describe their desired travel experience in natural language, considering multiple constraints and preferences simultaneously. The solution integrates Gemini with Alaska Airlines' existing flight data and customer information, ensuring recommendations are grounded in actual available flights and pricing.

AI-Powered Network Operations Assistant with Multi-Agent RAG Architecture

Swisscom

Swisscom, Switzerland's leading telecommunications provider, developed a Network Assistant using Amazon Bedrock to address the challenge of network engineers spending over 10% of their time manually gathering and analyzing data from multiple sources. The solution implements a multi-agent RAG architecture with specialized agents for documentation management and calculations, combined with an ETL pipeline using AWS services. The system is projected to reduce routine data retrieval and analysis time by 10%, saving approximately 200 hours per engineer annually while maintaining strict data security and sovereignty requirements for the telecommunications sector.

AI-Powered Neurosurgery: From Brain Tumor Classification to Surgical Planning

Cedars Sinai

Cedars Sinai and various academic institutions have implemented AI and machine learning solutions to improve neurosurgical outcomes across multiple areas. The applications include brain tumor classification using CNNs achieving 95% accuracy (surpassing traditional radiologists), hematoma prediction and management using graph neural networks with 80%+ accuracy, and AI-assisted surgical planning and intraoperative guidance. The implementations demonstrate significant improvements in patient outcomes while highlighting the importance of balanced innovation with appropriate regulatory oversight.

AI-Powered On-Call Assistant for Airflow Pipeline Debugging

Wix

Wix developed AirBot, an AI-powered Slack agent to address the operational burden of managing over 3,500 Apache Airflow pipelines processing 4 billion daily HTTP transactions across a 7 petabyte data lake. The traditional manual debugging process required engineers to act as "human error parsers," navigating multiple distributed systems (Airflow, Spark, Kubernetes) and spending approximately 45 minutes per incident to identify root causes. AirBot leverages LLMs (GPT-4o Mini and Claude 4.5 Opus) in a Chain of Thought architecture to automatically investigate failures, generate diagnostic reports, create pull requests with fixes, and route alerts to appropriate team owners. The system achieved measurable impact by saving approximately 675 engineering hours per month (equivalent to 4 full-time engineers), generating 180 candidate pull requests with a 15% fully automated fix rate, and reducing debugging time by at least 15 minutes per incident while maintaining cost efficiency at $0.30 per AI interaction.

AI-Powered Performance Optimization System for Go Code

Uber

Uber developed PerfInsights, a production system that combines runtime profiling data with generative AI to automatically detect performance antipatterns in Go services and recommend optimizations. The system addresses the challenge of expensive manual performance tuning by using LLMs to analyze the most CPU-intensive functions identified through profiling, applying sophisticated prompt engineering and validation techniques including LLM juries and rule-based checkers to reduce false positives from over 80% to the low teens. This has resulted in hundreds of merged optimization diffs, significant engineering time savings (93% reduction from 14.5 hours to 1 hour per issue), and measurable compute cost reductions across Uber's Go services.

AI-Powered PLC Code Generation for Industrial Automation

Wipro PARI

Wipro PARI, a global automation company, partnered with AWS and ShellKode to develop an AI-powered solution that transforms the manual process of generating Programmable Logic Controller (PLC) ladder text code from complex process requirements. Using Amazon Bedrock with Anthropic's Claude models, advanced prompt engineering techniques, and custom validation logic, the system reduces PLC code generation time from 3-4 days to approximately 10 minutes per requirement while achieving up to 85% code accuracy. The solution automates validation against IEC 61131-3 industry standards, handles complex state management and transition logic, and provides a user-friendly interface for industrial engineers, resulting in 5,000 work-hours saved across projects and enabling Wipro PARI to win key automotive clients.

AI-Powered Postmortem Analysis for Site Reliability Engineering

Zalando

Zalando developed an LLM-powered pipeline to analyze thousands of incident postmortems accumulated over two years, transforming them from static documents into actionable strategic insights. The traditional human-centric approach to postmortem analysis was unable to scale to the volume of incidents, requiring 15-20 minutes per document and making it impossible to identify systemic patterns across the organization. Their solution involved building a multi-stage LLM pipeline that summarizes, classifies, analyzes, and identifies patterns across incidents, with a particular focus on datastore technologies (Postgres, DynamoDB, ElastiCache, S3, and Elasticsearch). Despite challenges with hallucinations and surface attribution errors, the system reduced analysis time from days to hours, achieved 3x productivity gains, and uncovered critical investment opportunities such as automated change validation that prevented 25% of subsequent datastore incidents.

AI-Powered Real Estate Transaction Newsworthiness Detection System

The Globe and Mail

A collaboration between journalists and technologists from multiple news organizations (Hearst, Gannett, The Globe and Mail, and E24) developed an AI system to automatically detect newsworthy real estate transactions. The system combines anomaly detection, LLM-based analysis, and human feedback to identify significant property transactions, with a particular focus on celebrity involvement and price anomalies. Early results showed promise with few-shot prompting, and the system successfully identified several newsworthy transactions that might have otherwise been missed by traditional reporting methods.

AI-Powered Real-Time Content Moderation with Prevalence Measurement

Pinterest

Pinterest built a real-time AI-assisted system to measure the prevalence of policy-violating contentโ€”the percentage of daily views that went to harmful contentโ€”to address the limitations of relying solely on user reports. The company developed a workflow combining ML-assisted impression-weighted sampling with multimodal LLM labeling to process daily samples at scale. This approach reduced labeling turnaround time by 15x compared to human-only review while maintaining comparable decision quality, enabling continuous monitoring across multiple policy areas, faster intervention testing, and proactive risk detection that was previously impossible with infrequent manual studies.

AI-Powered Regression Testing with Natural Language Test Case Generation

Duolingo

Duolingo's QA team faced significant challenges with manual regression testing that consumed substantial bandwidth each week, requiring multiple team members several hours to validate releases against their highly iterative product with numerous A/B tests and feature variants. To address this, they partnered with MobileBoost in 2024 to implement GPT Driver, an AI-powered testing tool that accepts natural language instructions and executes them on virtual devices. By reframing test cases from prescriptive step-by-step instructions to goal-oriented prompts (e.g., "Progress through screens until you see XYZ"), they enabled the system to adapt to changing UIs and feature variations while maintaining test reliability. The solution reduced manual regression testing workflows by 70%, allowing QA team members to shift from hours of manual execution to minutes of reviewing recorded test runs, thereby freeing the team to focus on higher-value activities like bug fixes and new feature testing.

AI-Powered Root Cause Analysis Assistant for Race Day Operations

Formula 1

Formula 1 developed an AI-driven root cause analysis assistant using Amazon Bedrock to streamline issue resolution during race events. The solution reduced troubleshooting time from weeks to minutes by enabling engineers to query system issues using natural language, automatically checking system health, and providing remediation recommendations. The implementation combines ETL pipelines, RAG, and agentic capabilities to process logs and interact with internal systems, resulting in an 86% reduction in end-to-end resolution time.

AI-Powered Sales Intelligence and Go-to-Market Orchestration Platform

Clay

Clay is a creative sales and marketing platform that helps companies execute go-to-market strategies by turning unstructured data about companies and people into actionable insights. The platform addresses the challenge of finding unique competitive advantages in sales ("go-to-market alpha") by integrating with over 150 data providers and using LLM-powered agents to research prospects, enrich data, and automate outreach. Their flagship agent "Claygent" performs web research to extract custom data points that aren't available in traditional sales databases, while their newer "Navigator" agent can interact with web forms and complex websites. Clay has achieved significant scale, crossing one billion agent runs and targeting two billion runs annually, while maintaining a philosophy that data will be imperfect and building tools for rapid iteration, validation, and trust-building through features like session replay.

AI-Powered Security Operations Center with Agentic AI for Threat Detection and Response

Trellix

Trellix, in partnership with AWS, developed an AI-powered Security Operations Center (SOC) using agentic AI to address the challenge of overwhelming security alerts that human analysts cannot effectively process. The solution leverages AWS Bedrock with multiple models (Amazon Nova for classification, Claude Sonnet for analysis) to automatically investigate security alerts, correlate data across multiple sources, and provide detailed threat assessments. The system uses a multi-agent architecture where AI agents autonomously select tools, gather context from various security platforms, and generate comprehensive incident reports, significantly reducing the burden on human analysts while improving threat detection accuracy.

AI-Powered Self-Remediation Loop for Large-Scale Kubernetes Operations

Salesforce

Salesforce's Hyperforce Kubernetes platform team manages over 1,400 clusters scaling millions of pods, facing significant operational challenges with engineers spending over 1,000 hours monthly on support tasks. They developed a multi-agent AI-powered self-remediation loop built on AWS Bedrock's multi-agent collaboration framework, integrating with their existing monitoring and automation tools (Prometheus, K8sGPT, Argo CD, and custom tools like Sloop and Periscope). The solution features a manager AI agent that orchestrates multiple specialized worker agents to retrieve telemetry data, perform root cause analysis using RAG-augmented runbooks, and execute safe remediation actions with human-in-the-loop approval via Slack. The implementation achieved a 30% improvement in troubleshooting time and saved approximately 150 hours per month in operational toil, with plans to expand capabilities using knowledge graphs and advanced anomaly detection.

AI-Powered Slack Conversation Summarization System

Salesforce

Salesforce AI Research developed AI Summarist, a conversational AI-powered tool to address information overload in Slack workspaces. The system uses state-of-the-art AI to automatically summarize conversations, channels, and threads, helping users manage their information consumption based on work preferences. The solution processes messages through Slack's API, disentangles conversations, and generates concise summaries while maintaining data privacy by not storing any summarized content.

AI-Powered Sleep Coach for CBTI Protocol Delivery

Rest

Rest, a company that evolved from developing a podcast player app, built an AI sleep coach to help people solve chronic sleep problems through an 8-week protocol based on Cognitive Behavioral Therapy for Insomnia (CBTI). The problem they identified was that while CBTI is clinically proven to be effective for 80% of people with insomnia, it typically costs thousands of dollars, requires specialized practitioners who have year-long waitlists, and isn't accessible to most people. Rest's solution uses voice-first AI agents powered by OpenAI's GPT-4 and integrated with Vapi for voice capabilities, creating daily check-ins where the AI coaches users through the CBTI protocol with personalized guidance based on their sleep logs, behavioral patterns, and personal context stored in a custom memory system. The product evolved iteratively from a text-based chatbot to a sophisticated voice agent with RAG for knowledge retrieval, dynamic agenda generation tailored to each user's program stage and recent sleep data, and multi-layered memory systems that track user context over time. The company now logs hundreds of hours of voice conversations monthly with users preferring voice interactions for the intimacy and ease it provides in discussing sleep challenges.

AI-Powered SNAP Benefits Notice Interpretation System

Propel

Propel developed an AI system to help SNAP (food stamp) recipients better understand official notices they receive. The system uses LLMs to analyze notice content and provide clear explanations of importance and required actions. The prototype successfully interprets complex government communications and provides simplified, actionable guidance while maintaining high safety standards for this sensitive use case.

AI-Powered SRE Agent for Production Infrastructure Management

Cleric AI

Cleric Ai addresses the growing complexity of production infrastructure management by developing an AI-powered agent that acts as a team member for SRE and DevOps teams. The system autonomously monitors infrastructure, investigates issues, and provides confident diagnoses through a reasoning engine that leverages existing observability tools and maintains a knowledge graph of infrastructure relationships. The solution aims to reduce engineer workload by automating investigation workflows and providing clear, actionable insights.

AI-Powered Supply Chain Visibility and ETA Prediction System

Toyota / IBM

Toyota partnered with IBM and AWS to develop an AI-powered supply chain visibility platform that addresses the automotive industry's challenges with delivery prediction accuracy and customer transparency. The system uses machine learning models (XGBoost, AdaBoost, random forest) for time series forecasting and regression to predict estimated time of arrival (ETA) for vehicles throughout their journey from manufacturing to dealer delivery. The solution integrates real-time event streaming, feature engineering with Amazon SageMaker, and batch inference every four hours to provide near real-time predictions. Additionally, the team implemented an agentic AI chatbot using AWS Bedrock to enable natural language queries about vehicle status. The platform provides customers and dealers with visibility into vehicle journeys through a "pizza tracker" style interface, improving customer satisfaction and enabling proactive delay management.

AI-Powered Sustainable Fishing with LLM-Enhanced Domain Knowledge Integration

Furuno

Furuno, a marine electronics company known for inventing the first fish finder in 1948, is addressing sustainable fishing challenges by combining traditional fishermen's knowledge with AI and LLMs. They've developed an ensemble model approach that combines image recognition, classification models, and a unique knowledge model enhanced by LLMs to help identify fish species and make better fishing decisions. The system is being deployed as a $300 monthly subscription service, with initial promising results in improving fishing efficiency while promoting sustainability.

AI-Powered Text Message-Based Healthcare Treatment Management System

Stride

Stride developed an AI-powered text message-based healthcare treatment management system for Aila Science to assist patients through self-administered telemedicine regimens, particularly for early pregnancy loss treatment. The system replaced manual human operators with LLM-powered agents that can interpret patient responses, provide medically-approved guidance, schedule messages, and escalate complex situations to human reviewers. The solution achieved approximately 10x capacity improvement while maintaining treatment quality and safety through a hybrid human-in-the-loop approach.

AI-Powered Tour Guide for Financial Platform Navigation

Ramp

Ramp developed an AI-powered Tour Guide agent to help users navigate their financial operations platform more effectively. The solution guides users through complex tasks by taking control of cursor movements while providing step-by-step explanations. Using an iterative action-taking approach and optimized prompt engineering, the Tour Guide increases user productivity and platform accessibility while maintaining user trust through transparent human-agent collaboration.

AI-Powered Trade Assistant for Equities Trading Workflows

Jefferies Equities

Jefferies Equities, a full-service investment bank, developed an AI Trade Assistant on Amazon Bedrock to address challenges faced by their front-office traders who struggled to access and analyze millions of daily trades stored across multiple fragmented data sources. The solution leverages LLMs (specifically Amazon Titan embeddings model) to enable traders to query trading data using natural language, automatically generating SQL queries and visualizations through a conversational interface integrated into their existing business intelligence platform. In a beta rollout to 50 users across sales and trading operations, the system delivered an 80% reduction in time spent on routine analytical tasks, high adoption rates, and reduced technical burden on IT teams while democratizing data access across trading desks.

AI-Powered Transformation of AWS Support for Mission-Critical Workloads

Whoop

AWS Support transformed from a reactive firefighting model to a proactive AI-augmented support system to handle the increasing complexity of cloud operations. The transformation involved building autonomous agents, context-aware systems, and structured workflows powered by Amazon Bedrock and Connect to provide faster incident response and proactive guidance. WHOOP, a health wearables company, utilized AWS's new Unified Operations offering to successfully launch two new hardware products with 10x mobile traffic and 200x e-commerce traffic scaling, achieving 100% availability in May 2025 and reducing critical case response times from 8 minutes to under 2.5 minutes, ultimately improving quarterly availability from 99.85% to 99.95%.

AI-Powered Travel Assistant for Rail and Coach Platform

Trainline

Trainline, the world's leading rail and coach ticketing platform serving 27 million customers across 40 countries, developed an AI-powered travel assistant to address underserved customer needs during the travel experience. The company identified that while they excelled at selling tickets, customers lacked support during their journeys when disruptions occurred or they had questions about their travel. They built an agentic AI system using LLMs that could answer diverse customer questions ranging from refund requests to real-time train information to unusual queries like bringing pets or motorbikes on trains. The solution went from concept to production in five months, launching in February 2025, and now handles over 300,000 conversations monthly. The system uses a central orchestrator with multiple tools including RAG with 700,000 pages of curated content, real-time train data APIs, terms and conditions lookups, and automated refund capabilities, all protected by multiple layers of guardrails to ensure safety and factual accuracy.

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

AI-Powered Voice Agents for Proactive Hotel Payment Verification

Perk

Perk, a business travel management platform, faced a critical problem where virtual credit cards sent to hotels sometimes weren't charged before guest arrival, leading to catastrophic check-in experiences for exhausted travelers. To prevent this, their customer care team was making approximately 10,000 proactive phone calls per week to hotels. The team built an AI voice agent system that autonomously calls hotels to verify and request payment processing. Starting with a rapid prototype using Make.com, they iterated through extensive prompt engineering, call structure refinement, and comprehensive evaluation frameworks. The solution now successfully handles tens of thousands of calls weekly across multiple languages (English, German), matching or exceeding human performance while dramatically reducing manual workload and uncovering additional operational insights through systematic call classification.

Architecture and Production Patterns of Autonomous Coding Agents

Anthropic

This talk explores the architecture and production implementation patterns behind modern autonomous coding agents like Claude Code, Cursor, and others, presented by Jared from Prompt Layer. The speaker examines why coding agents have recently become effective, arguing that the key innovation is a simple while-loop architecture with tool calling, combined with improved models, rather than complex DAGs or RAG systems. The presentation covers implementation details including tool design (particularly bash as the universal adapter), context management strategies, sandboxing approaches, and evaluation methodologies. The speaker's company, Prompt Layer, has reorganized their engineering practices around Claude Code, establishing a rule that any task completable in under an hour using the agent should be done immediately, demonstrating practical production adoption and measurable productivity gains.

Architecture Patterns for Production AI Systems: Lessons from Building and Failing with Generative AI Products

Outropy

Phil Calรงado shares a post-mortem analysis of Outropy, a failed AI productivity startup that served thousands of users, revealing why most AI products struggle in production. Despite having superior technology compared to competitors like Salesforce's Slack AI, Outropy failed commercially but provided valuable insights into building production AI systems. Calรงado argues that successful AI products require treating agents as objects and workflows as data pipelines, applying traditional software engineering principles rather than falling into "Twitter-driven development" or purely data science approaches.

Auto-Moderating Car Dealer Reviews with GenAI

Edmunds

Edmunds transformed their dealer review moderation process from a manual system taking up to 72 hours to an automated GenAI solution using GPT-4 through Databricks Model Serving. The solution processes over 300 daily dealer quality-of-service reviews, reducing moderation time from days to minutes and requiring only two moderators instead of a larger team. The implementation included careful prompt engineering and integration with Databricks Unity Catalog for improved data governance.

Automated Carrier Claims Management Using AI Agents

FIEGE

FIEGE, a major German logistics provider, implemented an AI agent system to handle carrier claims processing end-to-end, launched in September 2024. The system automatically processes claims from initial email receipt through resolution, handling multiple languages and document types. By implementing a controlled approach with sandboxed generative AI and templated responses, the system successfully processes 70-90% of claims automatically, resulting in eight-digit cost savings while maintaining high accuracy and reliability.

Automated Data Journalism Platform Using LLMs for Real-time News Generation

Realtime

Realtime built an automated data journalism platform that uses LLMs to generate news stories from continuously updated datasets and news articles. The system processes raw data sources, performs statistical analysis, and employs GPT-4 Turbo to generate contextual summaries and headlines. The platform successfully automates routine data journalism tasks while maintaining transparency about AI usage and implementing safeguards against common LLM pitfalls.

Automated Email Triage System Using Amazon Bedrock Flows

Parameta

Parameta Solutions, a financial data services provider, transformed their client email processing system from a manual workflow to an automated solution using Amazon Bedrock Flows. The system intelligently processes technical support queries by classifying emails, extracting relevant entities, validating information, and generating appropriate responses. This transformation reduced resolution times from weeks to days while maintaining high accuracy and operational control, achieved within a two-week implementation period.

Automated Evaluation Framework for LLM-Powered Features

Slack

Slack's machine learning team developed a comprehensive evaluation framework for their LLM-powered features, including message summarization and natural language search. They implemented a three-tiered evaluation approach using golden sets, validation sets, and A/B testing, combined with automated quality metrics to assess various aspects like hallucination detection and system integration. This framework enabled rapid prototyping and continuous improvement of their generative AI products while maintaining quality standards.

Automated HCC Code Extraction from Clinical Notes Using Healthcare NLP

WVU Medicine

WVU Medicine implemented an automated system for extracting Hierarchical Condition Category (HCC) codes from clinical notes using John Snow Labs' Healthcare NLP models. The system processes radiology notes for upcoming patient appointments, extracts relevant diagnoses, converts them to CPT codes, and then maps them to HCC codes. The solution went live in December 2023 and has processed over 27,000 HCC codes with an 18.4% acceptance rate by providers, positively impacting over 5,000 patients.

Automated LLM Evaluation and Quality Monitoring in Customer Support Analytics

Echo AI

Echo AI, leveraging Log10's platform, developed a system for analyzing customer support interactions at scale using LLMs. They faced the challenge of maintaining accuracy and trust while processing high volumes of customer conversations. The solution combined Echo AI's conversation analysis capabilities with Log10's automated feedback and evaluation system, resulting in a 20-point F1 score improvement in accuracy and the ability to automatically evaluate LLM outputs across various customer-specific use cases.

Automated Medical Literature Review System Using Domain-Specific LLMs

John Snow Labs

John Snow Labs developed a medical chatbot system that automates the traditionally time-consuming process of medical literature review. The solution combines proprietary medical-domain-tuned LLMs with a comprehensive medical research knowledge base, enabling researchers to analyze hundreds of papers in minutes instead of weeks or months. The system includes features for custom knowledge base integration, intelligent data extraction, and automated filtering based on user-defined criteria, while maintaining explainability and citation tracking.

Automated News Analysis and Bias Detection Platform

AskNews

AskNews developed a news analysis platform that processes 500,000 articles daily across multiple languages, using LLMs to extract facts, analyze bias, and identify contradictions between sources. The system employs edge computing with open-source models like Llama for cost-effective processing, builds knowledge graphs for complex querying, and provides programmatic APIs for automated news analysis. The platform helps users understand global perspectives on news topics while maintaining journalistic standards and transparency.

Automated Performance Optimization with GenAI-Powered Code Analysis

Uber

Uber developed PerfInsights to address unsustainable compute costs from inefficient Go services, where traditionally manual performance optimization required deep expertise and days or weeks of effort. The system combines runtime CPU/memory profiling with GenAI-powered static analysis to automatically detect performance antipatterns in Go code, using LLM juries and rule-based validation (LLMCheck) to reduce hallucinations and false positives from over 80% to the low teens. Since deployment, PerfInsights has generated hundreds of merged optimization diffs, reduced antipattern detection time by 93% (from 14.5 hours to under 1 hour per issue), eliminated approximately 3,800 hours of manual engineering effort annually, and achieved a 33.5% reduction in codebase antipatterns over four months while delivering measurable compute cost savings.

Automated Prompt Optimization for Intelligent Text Processing using Amazon Bedrock

Yuewen Group

Yuewen Group, a global online literature platform, transitioned from traditional NLP models to Claude 3.5 Sonnet on Amazon Bedrock for intelligent text processing. Initially facing challenges with unoptimized prompts performing worse than traditional models, they implemented Amazon Bedrock's Prompt Optimization feature to automatically enhance their prompts. This led to significant improvements in accuracy for tasks like character dialogue attribution, achieving 90% accuracy compared to the previous 70% with unoptimized prompts and 80% with traditional NLP models.

Automated Reasoning Checks in Amazon Bedrock Guardrails for Responsible AI Deployment

PwC

PwC and AWS collaborated to develop Automated Reasoning checks in Amazon Bedrock Guardrails to address the challenge of deploying generative AI solutions while maintaining accuracy, security, and compliance in regulated industries. The solution combines mathematical verification with LLM outputs to provide verifiable trust and rapid deployment capabilities. Three key use cases were implemented: EU AI Act compliance for financial services risk management, pharmaceutical content review through the Regulated Content Orchestrator (RCO), and utility outage management for real-time decision support, all demonstrating enhanced accuracy and compliance verification compared to traditional probabilistic methods.

Automated Search Quality Evaluation Using LLMs for Typeahead Suggestions

LinkedIn

LinkedIn developed an automated evaluation system using GPT models served through Azure to assess the quality of their typeahead search suggestions at scale. The system replaced manual human evaluation with automated LLM-based assessment, using carefully engineered prompts and a golden test set. The implementation resulted in faster evaluation cycles (hours instead of weeks) and demonstrated significant improvements in suggestion quality, with one experiment showing a 6.8% absolute improvement in typeahead quality scores.

Automated Sign Language Translation Using Large Language Models

VSL Labs

VSL Labs is developing an automated system for translating English into American Sign Language (ASL) using generative AI models. The solution addresses the significant challenges faced by the deaf community, including limited availability and high costs of human interpreters. Their platform uses a combination of in-house and GPT-4 models to handle text processing, cultural adaptation, and generates precise signing instructions including facial expressions and body movements for realistic avatar-based sign language interpretation.

Automated Software Development Insights and Communication Platform

Blueprint AI

Blueprint AI addresses the challenge of communication and understanding between business and technical teams in software development by leveraging LLMs. The platform automatically analyzes data from various sources like GitHub and Jira, creating intelligent reports that surface relevant insights, track progress, and identify potential blockers. The system provides 24/7 monitoring and context-aware updates, helping teams stay informed about development progress without manual reporting overhead.

Automated Sports Commentary Generation using LLMs

WSC Sport

WSC Sport developed an automated system to generate real-time sports commentary and recaps using LLMs. The system takes game events data and creates coherent, engaging narratives that can be automatically translated into multiple languages and delivered with synthesized voice commentary. The solution reduced production time from 3-4 hours to 1-2 minutes while maintaining high quality and accuracy.

Automated Synopsis Generation Pipeline with Human-in-the-Loop Quality Control

Netflix

Netflix developed an automated pipeline for generating show and movie synopses using LLMs, replacing a highly manual context-gathering process. The system uses Metaflow to orchestrate LLM-based content summarization and synopsis generation, with multiple human feedback loops and automated quality control checks. While maintaining human writers and editors in the process, the system has significantly improved efficiency and enabled the creation of more synopses per title while maintaining quality standards.

Automated Unit Test Improvement Using LLMs for Android Applications

Meta

Meta developed TestGen-LLM, a tool that leverages large language models to automatically improve unit test coverage for Android applications written in Kotlin. The system uses an Assured Offline LLM-Based Software Engineering approach to generate additional test cases while maintaining strict quality controls. When deployed at Meta, particularly for Instagram and Facebook platforms, the tool successfully enhanced 10% of the targeted classes with reliable test improvements that were accepted by engineers for production use.

Automating Code Sample Updates with LLMs for Technical Documentation

Wix

When Wix needed to update over 2,000 code samples in their API reference documentation due to a syntax change, they implemented an LLM-based automation solution instead of manual updates. The team used GPT-4 for code classification and GPT-3.5 Turbo for code conversion, combined with TypeScript compilation for validation. This automated approach reduced what would have been weeks of manual work to a single morning of team involvement, while maintaining high accuracy in the code transformations.

Automating Community Conference Operations with AI Coding Agents

PyCon

A volunteer-run conference organization (PyData/PyConDE) with events serving up to 1,500 attendees faced significant operational overhead in managing tickets, marketing, video production, and community engagement. Over a three-month period, the team experimented with various AI coding agents (Claude, Gemini, Qwen Coder Plus, Codex) to automate tasks including LinkedIn scraping for social media content, automated video cutting using computer vision, ticket management integration, and multi-step workflow automation. The results were mixed: while AI agents proved valuable for well-documented API integration, boilerplate code generation, and specific automation tasks like screenshot capture and video processing, they struggled with multi-step procedural workflows, data normalization, and maintaining code quality without close human oversight. The team concluded that AI agents work best when kept on a "short leash" with narrow use cases, frequent commits, and human validation, delivering time savings for generalist tasks but requiring careful expectation management and not delivering the "10x productivity" improvements often claimed.

Automating Enterprise Workflows with Foundation Models in Healthcare

Various

The researchers present aCLAr (Demonstrate, Execute, Validate framework), a system that uses multimodal foundation models to automate enterprise workflows, particularly in healthcare settings. The system addresses limitations of traditional RPA by enabling passive learning from demonstrations, human-like UI navigation, and self-monitoring capabilities. They successfully demonstrated the system automating a real healthcare workflow in Epic EHR, showing how foundation models can be leveraged for complex enterprise automation without requiring API integration.

Automating Healthcare Documentation and Rule Management with GenAI

Orizon

Orizon, a healthcare tech company, faced challenges with manual code documentation and rule interpretation for their medical billing fraud detection system. They implemented a GenAI solution using Databricks' platform to automate code documentation and rule interpretation, resulting in 63% of tasks being automated and reducing documentation time to under 5 minutes. The solution included fine-tuned Llama2-code and DBRX models deployed through Mosaic AI Model Serving, with strict governance and security measures for protecting sensitive healthcare data.

Automating Healthcare Procedure Code Selection Through Domain-Specific LLM Platform

Hasura / PromptQL

A large public healthcare company specializing in radiology software deployed an AI-powered automation solution to streamline the complex process of procedure code selection during patient appointment scheduling. The traditional manual process took 12-15 minutes per call, requiring operators to navigate complex UIs and select from hundreds of procedure codes that varied by clinic, regulations, and patient circumstances. Using PromptQL's domain-specific LLM platform, non-technical healthcare administrators can now write automation logic in natural language that gets converted into executable code, reducing call times and potentially delivering $50-100 million in business impact through increased efficiency and reduced training costs.

Automating Job Role Extraction Using Prosus AI Assistant in Production

OLX

OLX faced a challenge with unstructured job roles in their job listings platform, making it difficult for users to find relevant positions. They implemented a production solution using Prosus AI Assistant, a GenAI/LLM model, to automatically extract and standardize job roles from job listings. The system processes around 2,000 daily job updates, making approximately 4,000 API calls per day. Initial A/B testing showed positive uplift in most metrics, particularly in scenarios with fewer than 50 search results, though the high operational cost of ~15K per month has led them to consider transitioning to self-hosted models.

Automating Merchant Onboarding with Reinforcement Learning

Doordash

DoorDash faced challenges with menu accuracy during merchant onboarding, where their existing AI system struggled with diverse and messy real-world menu formats. Working with Applied Compute, they developed an automated grading system calibrated to internal expert standards, then used reinforcement learning to train a menu error correction model against this grader as a reward function. The solution achieved a 30% relative reduction in low-quality menus and was rolled out to all USA menu traffic, demonstrating how institutional knowledge can be encoded into automated training signals for production LLM systems.

Automating Post Incident Review Summaries with GPT-4

Canva

Canva implemented GPT-4 chat to automate the summarization of Post Incident Reports (PIRs), addressing inconsistency and workload challenges in their incident review process. The solution involves extracting PIR content from Confluence, preprocessing to remove sensitive data, using carefully crafted prompts with GPT-4 chat for summary generation, and integrating the results with their data warehouse and Jira tickets. The implementation proved successful with most AI-generated summaries requiring no human modification while maintaining high quality and consistency.

Automating Private Credit Deal Analysis with LLMs and RAG

Riskspan

Riskspan, a technology company providing analysis for complex investment asset classes, tackled the challenge of analyzing private credit deals that traditionally required 3-4 weeks of manual document review and Excel modeling. The company built a production GenAI system on AWS using Claude LLM, embeddings, RAG (Retrieval Augmented Generation), and automated code generation to extract information from unstructured documents (PDFs, emails, amendments) and dynamically generate investment waterfall models. The solution reduced deal processing time from 3-4 weeks to 3-5 days, achieved 87% faster customer onboarding, delivered 10x scalability improvement, and reduced per-deal processing costs by 90x to under $50, while enabling the company to address a $9 trillion untapped market opportunity in private credit.

Automating Root Cause Analysis Using Amazon Bedrock Agents

BMW

BMW implemented a generative AI solution using Amazon Bedrock Agents to automate and accelerate root cause analysis (RCA) for cloud incidents in their connected vehicle services. The solution combines architecture analysis, log inspection, metrics monitoring, and infrastructure evaluation tools with a ReAct (Reasoning and Action) framework to identify service disruptions. The automated RCA agent achieved 85% accuracy in identifying root causes, significantly reducing diagnosis times and enabling less experienced engineers to effectively troubleshoot complex issues.

Automating Supplier Ticket Management with LLM Agents

Wayfair

Wayfair developed Wilma, an LLM-based ticket automation system, to automate the manual triage of supplier support tickets in their SupportHub JIRA-based system. The solution uses LangGraph to orchestrate LLM calls and tool interactions for intent classification, language detection, and supplier ID lookup through a ReAct agent with BigQuery access. The system achieved better-than-human performance with 93% accuracy on question type identification (vs. 75% human accuracy), 98% on language detection, and 88% on supplier ID identification, while reducing processing time and allowing associates to focus on higher-value work.

Automating Test Generation with LLMs at Scale

Assembled

Assembled leveraged Large Language Models to automate and streamline their test writing process, resulting in hundreds of saved engineering hours. By developing effective prompting strategies and integrating LLMs into their development workflow, they were able to generate comprehensive test suites in minutes instead of hours, leading to increased test coverage and improved engineering velocity without compromising code quality.

Automating Translation Workflows with LLMs for Post-Editing and Transcreation

TransPerfect

TransPerfect integrated Amazon Bedrock into their GlobalLink translation management system to automate and improve translation workflows. The solution addressed two key challenges: automating post-editing of machine translations and enabling AI-assisted transcreation of creative content. By implementing LLM-powered workflows, they achieved up to 50% cost savings in translation post-editing, 60% productivity gains in transcreation, and up to 80% reduction in project turnaround times while maintaining high quality standards.

Automating Weather Forecast Text Generation Using Fine-Tuned Vision-Language Models

UK MetOffice

The UK Met Office partnered with AWS to automate the generation of the Shipping Forecast, a 100-year-old maritime weather forecast that traditionally required expert meteorologists several hours daily to produce. The solution involved fine-tuning Amazon Nova foundation models (both LLM and vision-language model variants) to convert complex multi-dimensional weather data into structured text forecasts. Within four weeks of prototyping, they achieved 52-62% accuracy using vision-language models and 62% accuracy using text-based LLMs, reducing forecast generation time from hours to under 5 minutes. The project demonstrated scalable architectural patterns for data-to-text conversion tasks involving massive datasets (45GB+ per forecast run) and established frameworks for rapid experimentation with foundation models in production weather services.

Autonomous Codebase Migration at Scale Using LLM-Powered Agents

Spotify

Spotify faced the challenge of maintaining a massive, diverse codebase across thousands of repositories, with developers spending less than one hour per day actually writing code and the rest on maintenance tasks. While they had pre-existing automation through their "fleet management" system that could handle simple migrations like dependency bumps, this approach struggled with the complex "long tail" of edge cases affecting 30% of their codebase. The solution involved building an agentic LLM system that replaces deterministic scripts with AI-powered code generation combined with automated verification loops, enabling unsupervised migrations from prompt to pull request. In the first three months, the system generated over 1,000 merged production PRs, enabling previously impossible large-scale refactors and allowing non-experts to perform complex migrations through natural language prompts rather than writing complicated transformation scripts.

Autonomous Coding Agent Evolution: From Short-Burst to Extended Runtime Operations

Replit

Replit evolved their AI coding agent from V1 (running autonomously for only a couple of minutes) to V2 (running for 10-15 minutes of productive work) through significant rearchitecting and leveraging new frontier models. The company focuses on enabling non-technical users to build complete applications without writing code, emphasizing performance and cost optimization over latency while maintaining comprehensive observability through tools like Langsmith to manage the complexity of production AI agents at scale.

Autonomous Network Operations Using Agentic AI

British Telecom

British Telecom (BT) partnered with AWS to deploy agentic AI systems for autonomous network operations across their 5G standalone mobile network infrastructure serving 30 million subscribers. The initiative addresses major operational challenges including high manual operations costs (up to 20% of revenue), complex failure diagnosis in containerized networks with 20,000 macro sites generating petabytes of data, and difficulties in change impact analysis with 11,000 weekly network changes. The solution leverages AWS Bedrock Agent Core, Amazon SageMaker for multivariate anomaly detection, Amazon Neptune for network topology graphs, and domain-specific community agents for root cause analysis and service impact assessment. Early results focus on cost reduction through automation, improved service level agreements, faster customer impact identification, and enhanced change efficiency, with plans to expand coverage optimization, dynamic network slicing, and further closed-loop automation across all network domains.

Autonomous Observability with AI Agents and Model Context Protocol

Pinterest

Pinterest's observability team faced a fragmented infrastructure challenge where logs, metrics, traces, and change events existed in disconnected silos, predating modern standards like OpenTelemetry. Engineers had to navigate multiple interfaces during incident resolution, increasing mean time to resolution (MTTR) and creating steep learning curves. To address this without a complete infrastructure overhaul, Pinterest developed an MCP (Model Context Protocol) server that acts as a unified interface for AI agents to access all observability data pillars. The centerpiece is "Tricorder Agent," which autonomously gathers relevant information from alerts, generates filtered dashboard links, queries dependencies, and provides root cause hypotheses. Early results show the agent successfully navigating dependency graphs and correlating data across previously disconnected systems, streamlining incident response and reducing the time engineers spend context-switching between tools.

Autonomous Software Development Using Multi-Model LLM System with Advanced Planning and Tool Integration

Factory.ai

Factory.ai has developed Code Droid, an autonomous software development system that leverages multiple LLMs and sophisticated planning capabilities to automate various programming tasks. The system incorporates advanced features like HyperCode for codebase understanding, ByteRank for information retrieval, and multi-model sampling for solution generation. In benchmark testing, Code Droid achieved 19.27% on SWE-bench Full and 31.67% on SWE-bench Lite, demonstrating strong performance in real-world software engineering tasks while maintaining focus on safety and explainability.

Autonomous SRE Agent for Cloud Infrastructure Monitoring Using FastMCP

FuzzyLabs

FuzzyLabs developed an autonomous Site Reliability Engineering (SRE) agent using Anthropic's Model Context Protocol (MCP) with FastMCP to automate the diagnosis of production incidents in cloud-native applications. The agent integrates with Kubernetes, GitHub, and Slack to automatically detect issues, analyze logs, identify root causes in source code, and post diagnostic summaries to development teams. While the proof-of-concept successfully demonstrated end-to-end incident response automation using a custom MCP client with optimizations like tool caching and filtering, the project raises important questions about effectiveness measurement, security boundaries, and cost optimization that require further research.

Avoiding Unearned Complexity in Production LLM Systems

Microsoft

Microsoft's ISE team shares their experiences working with large customers implementing LLM solutions in production, highlighting how premature adoption of complex frameworks like LangChain and multi-agent architectures can lead to maintenance and reliability challenges. They advocate for starting with simpler, more explicit designs before adding complexity, and provide detailed analysis of the security, dependency, and versioning considerations when adopting pre-v1.0 frameworks in production systems.

Background Coding Agents with Strong Feedback Loops for Large-Scale Code Transformations

Spotify

Spotify deployed background coding agents across thousands of software components to automate large-scale code transformations and maintenance tasks, addressing the challenge of ensuring correctness and reliability when agents operate without direct human supervision. The solution centered on implementing strong verification loops consisting of deterministic verifiers (for syntax, building, and testing) and an LLM-as-a-judge component to prevent scope creep. The system successfully generated over 1,500 merged pull requests, with the judge component catching roughly a quarter of problematic changes and enabling course correction in half of those cases, demonstrating that verification loops are essential for predictable agent behavior at scale.

Benchmarking AI Agents for Software Bug Detection and Maintenance Tasks

Bismuth

Bismuth, a startup focused on software agents, developed SM-100, a comprehensive benchmark to evaluate AI agents' capabilities in software maintenance tasks, particularly bug detection and fixing. The benchmark revealed significant limitations in existing popular agents, with most achieving only 7% accuracy in finding complex bugs and exhibiting high false positive rates (90%+). While agents perform well on feature development benchmarks like SWE-bench, they struggle with real-world maintenance tasks that require deep system understanding, cross-file reasoning, and holistic code evaluation. Bismuth's own agent achieved better performance (10 out of 100 bugs found vs. 7 for the next best), demonstrating that targeted improvements in model architecture, prompting strategies, and navigation techniques can enhance bug detection capabilities in production software maintenance scenarios.

Best Practices for Building Production-Grade MCP Servers for AI Agents

Prefect

This case study presents best practices for designing and implementing Model Context Protocol (MCP) servers for AI agents in production environments, addressing the widespread problem of poorly designed MCP servers that fail to account for agent-specific constraints. The speaker, founder and CEO of Prefect Technologies and creator of fastmcp (a widely-adopted framework downloaded 1.5 million times daily), identifies key design principles including outcome-oriented tool design, flattened arguments, comprehensive documentation, token budget management, and ruthless curation. The solution involves treating MCP servers as agent-optimized user interfaces rather than simple REST API wrappers, acknowledging fundamental differences between human and agent capabilities in discovery, iteration, and context management. Results include actionable guidelines that have shaped the MCP ecosystem, with the fastmcp framework becoming the de facto standard for building MCP servers and influencing the official Anthropic SDK design.

Best Practices for Implementing LLMs in High-Stakes Applications

Moonhub

The presentation discusses implementing LLMs in high-stakes use cases, particularly in healthcare and therapy contexts. It addresses key challenges including robustness, controllability, bias, and fairness, while providing practical solutions such as human-in-the-loop processes, task decomposition, prompt engineering, and comprehensive evaluation strategies. The speaker emphasizes the importance of careful consideration when implementing LLMs in sensitive applications and provides a framework for assessment and implementation.

Best Practices for LLM Production Deployments: Evaluation, Prompt Management, and Fine-tuning

HumanLoop

HumanLoop, based on their experience working with companies from startups to large enterprises like Jingo, shares key lessons for successful LLM deployment in production. The talk emphasizes three critical aspects: systematic evaluation frameworks for LLM applications, treating prompts as serious code artifacts requiring proper versioning and collaboration, and leveraging fine-tuning for improved performance and cost efficiency. The presentation uses GitHub Copilot as a case study of successful LLM deployment at scale.

Blueprint for Scalable and Reliable Enterprise LLM Systems

Various

A panel discussion featuring leaders from Bank of America, NVIDIA, Microsoft, and IBM discussing best practices for deploying and scaling LLM systems in enterprise environments. The discussion covers key aspects of LLMOps including business alignment, production deployment, data management, monitoring, and responsible AI considerations. The panelists share insights on the evolution from traditional ML deployments to LLM systems, highlighting unique challenges around testing, governance, and the increasing importance of retrieval and agent-based architectures.

Building a Collaborative Multi-Agent AI Ecosystem for Enterprise Knowledge Access

DoorDash

DoorDash developed an internal agentic AI platform to address the challenge of fragmented knowledge spread across experimentation platforms, metrics hubs, dashboards, wikis, and team communications. The solution evolved from deterministic workflows through single agents to hierarchical deep agents and exploratory agent swarms, built on foundational capabilities including hybrid vector search with RRF-based re-ranking, schema-aware SQL generation with pre-cached examples, multi-stage zero-data query validation, and LLM-as-judge evaluation frameworks. The platform integrates with Slack and Cursor to meet users in their existing workflows, enabling business teams and developers to access complex data and insights without context-switching, democratizing data access across the organization while maintaining rigorous guardrails and provenance tracking.

Building a Commonsense Knowledge Graph for E-commerce Product Recommendations

Amazon

Amazon developed COSMO, a framework that leverages LLMs to build a commonsense knowledge graph for improving product recommendations in e-commerce. The system uses LLMs to generate hypotheses about commonsense relationships from customer interaction data, validates these through human annotation and ML filtering, and uses the resulting knowledge graph to enhance product recommendation models. Tests showed up to 60% improvement in recommendation performance when using the COSMO knowledge graph compared to baseline models.

Building a Complex AI Answer Engine with Multi-Step Reasoning

Perplexity

Perplexity developed Pro Search, an advanced AI answer engine that handles complex, multi-step queries by breaking them down into manageable steps. The system combines careful prompt engineering, step-by-step planning and execution, and an interactive UI to deliver precise answers. The solution resulted in a 50% increase in query search volume, demonstrating its effectiveness in handling complex research questions efficiently.

Building a Comprehensive LLM Evaluation Framework with BrainTrust Integration

Hostinger

Hostinger's AI team developed a systematic approach to LLM evaluation for their chatbots, implementing a framework that combines offline development testing against golden examples with continuous production monitoring. The solution integrates BrainTrust as a third-party tool to automate evaluation workflows, incorporating both automated metrics and human feedback. This framework enables teams to measure improvements, track performance, and identify areas for enhancement through a combination of programmatic testing and user feedback analysis.

Building a Comprehensive LLM Platform for Food Delivery Services

Swiggy

Swiggy implemented various generative AI solutions to enhance their food delivery platform, focusing on catalog enrichment, review summarization, and vendor support. They developed a platformized approach with a middle layer for GenAI capabilities, addressing challenges like hallucination and latency through careful model selection, fine-tuning, and RAG implementations. The initiative showed promising results in improving customer experience and operational efficiency across multiple use cases including image generation, text descriptions, and restaurant partner support.

Building a Context-Aware AI Assistant with RAG for Developer Support

Vectorize

Vectorize, a platform for building RAG pipelines, faced a challenge where users frequently asked questions already answered in their documentation but were reluctant to leave the UI to search for answers. To address this, they built an AI assistant integrated directly into their product interface using RAG technology. The solution leverages their own platform to ingest documentation from multiple sources (docs site, Discord, Intercom), implements context-sensitive retrieval using page topics, employs reranking models to filter irrelevant results, and uses anti-hallucination prompting with Llama 3.1 70B on Groq. The resulting assistant provides users with immediate, contextually relevant answers without requiring them to leave their workflow, while the system continuously improves as new support content and documentation are added.

Building a Conversational AI Agent for Slack Integration

Linear

Linear, a project management tool for product teams, developed an experimental AI agent that operates within Slack to allow users to create issues and query workspace data without leaving their communication platform. The project faced challenges around balancing context provision to the LLM, maintaining conversation continuity, and determining appropriate boundaries between LLM-driven decisions and programmatic logic. The team solved these issues by providing localized context (10 messages) rather than full conversation history, splitting the system early to distinguish between issue creation and data lookup requests, and limiting LLM involvement to tasks it excels at (summarization, title generation) while handling complex business logic programmatically. This approach resulted in higher accuracy for issue creation, faster response times, and improved user satisfaction as the agent could quickly generate well-formed issues that users could then refine manually.

Building a Digital Workforce with Multi-Agent Systems and User-Centric Design

Monday.com

Monday.com built a digital workforce of AI agents to handle their billion annual work tasks, focusing on user experience and trust over pure automation. They developed a multi-agent system using LangGraph that emphasizes user control, preview capabilities, and explainability, achieving 100% month-over-month growth in AI usage. The system includes specialized agents for data retrieval, board actions, and answer composition, with robust fallback mechanisms and evaluation frameworks to handle the 99% of user interactions they can't initially predict.

Building a Digital Workforce with Multi-Agent Systems for Task Automation

Monday.com

Monday.com, a work OS platform processing 1 billion tasks annually, developed a digital workforce using AI agents to automate various work tasks. The company built their agent ecosystem on LangGraph and LangSmith, focusing heavily on user experience design principles including user control over autonomy, preview capabilities, and explainability. Their approach emphasizes trust as the primary adoption barrier rather than technology, implementing guardrails and human-in-the-loop systems to ensure production readiness. The system has shown significant growth with 100% month-over-month increases in AI usage since launch.

Building a Guardrail System for LLM-based Menu Transcription

Doordash

Doordash developed a system to automatically transcribe restaurant menu photos using LLMs, addressing the challenge of maintaining accurate menu information on their delivery platform. Instead of relying solely on LLMs, they created an innovative guardrail framework using traditional machine learning to evaluate transcription quality and determine whether AI or human processing should be used. This hybrid approach allowed them to achieve high accuracy while maintaining efficiency and adaptability to new AI models.

Building a Healthcare Copilot for Biology and Life Science Research

Owkin

Owkin, a company focused on drug discovery and AI for healthcare, developed a copilot system in four months to help biology and life science researchers navigate complex healthcare data and answer scientific questions. The system addresses challenges unique to healthcare including strict regulations, semantic complexity, and data sensitivity by implementing two main tools: a text-to-SQL system that queries structured biological databases (using natural language to SQL translation with Polars), and a RAG-based literature search tool that retrieves relevant information from PubMed's 26 million abstracts. The copilot was deployed for academic researchers with monitoring via LangFuse and OpenTelemetry, though the team faced challenges with evaluation in a domain where questions rarely have binary answers, and noted that frameworks and models change rapidly in the LLM space.

Building a High-Quality Q&A Assistant for Database Research

Airtable

Airtable developed Omni, an AI assistant capable of building custom apps and extracting insights from complex databases containing customer feedback, marketing data, and product information. The challenge was creating a reliable Q&A agent that could overcome LLM limitations like unpredictable reasoning, premature conclusions, and hallucinations when dealing with large table schemas and vague questions. Their solution employed an agentic framework with contextual schema exploration, planning/replanning mechanisms, hybrid search combining keyword and semantic approaches, token-efficient citation systems, and comprehensive evaluation frameworks using both curated test suites and production feedback. This multi-faceted approach enabled them to deliver a production-ready assistant that users could trust, though the post doesn't provide specific quantitative results on accuracy improvements or user adoption metrics.

Building a High-Quality RAG-based Support System with LLM Guardrails and Quality Monitoring

Doordash

Doordash implemented a RAG-based chatbot system to improve their Dasher support automation, replacing a traditional flow-based system. They developed a comprehensive quality control approach combining LLM Guardrail for real-time response verification, LLM Judge for quality monitoring, and an iterative improvement pipeline. The system successfully reduced hallucinations by 90% and severe compliance issues by 99%, while handling thousands of support requests daily and allowing human agents to focus on more complex cases.

Building a Horizontal Enterprise Agent Platform with Infrastructure-First Approach

Dust.tt

Dust.tt evolved from a developer framework competitor to LangChain into a horizontal enterprise platform for deploying AI agents, achieving remarkable 88% daily active user rates in some deployments. The company focuses on building robust infrastructure for agent deployment, maintaining its own integrations with enterprise systems like Notion and Slack, while making agent creation accessible to non-technical users through careful UX design and abstraction of technical complexities.

Building a Hybrid Cloud AI Infrastructure for Large-Scale ML Inference

Roblox

Roblox underwent a three-phase transformation of their AI infrastructure to support rapidly growing ML inference needs across 250+ production models. They built a comprehensive ML platform using Kubeflow, implemented a custom feature store, and developed an ML gateway with vLLM for efficient large language model operations. The system now processes 1.5 billion tokens weekly for their AI Assistant, handles 1 billion daily personalization requests, and manages tens of thousands of CPUs and over a thousand GPUs across hybrid cloud infrastructure.

Building a Low-Latency Global Code Completion Service

Github

Github built Copilot, a global code completion service handling hundreds of millions of daily requests with sub-200ms latency. The system uses a proxy architecture to manage authentication, handle request cancellation, and route traffic to the nearest available LLM model. Key innovations include using HTTP/2 for efficient connection management, implementing a novel request cancellation system, and deploying models across multiple global regions for improved latency and reliability.

Building a Microservices-Based Multi-Agent Platform for Financial Advisors

Prudential

Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.

Building a Multi-Agent Healthcare Analytics Assistant with LLM-Powered Natural Language Queries

Komodo Health

Komodo Health, a company with a large database of anonymized American patient medical events, developed an AI assistant over two years to answer complex healthcare analytics queries through natural language. The system evolved from a simple chaining architecture with fine-tuned models to a sophisticated multi-agent system using a supervisor pattern, where an intelligent agent-based supervisor routes queries to either deterministic workflows or sub-agents as needed. The architecture prioritizes trust by ensuring raw database outputs are presented directly to users rather than LLM-generated content, with LLMs primarily handling natural language to structured query conversion and explanations. The production system balances autonomous AI capabilities with control, avoiding the cost and latency issues of pure agentic approaches while maintaining flexibility for unexpected user queries.

Building a Multi-Agent LLM Platform for Customer Service Automation

Deutsche Telekom

Deutsche Telekom developed a comprehensive multi-agent LLM platform to automate customer service across multiple European countries and channels. They built their own agent computing platform called LMOS to manage agent lifecycles, routing, and deployment, moving away from traditional chatbot approaches. The platform successfully handled over 1 million customer queries with an 89% acceptable answer rate and showed 38% better performance compared to vendor solutions in A/B testing.

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

Building a Multi-Provider GenAI Gateway for Enterprise-Scale LLM Access

Grab

Grab developed an AI Gateway to provide centralized, secure access to multiple GenAI providers (including OpenAI, Azure, AWS Bedrock, and Google VertexAI) for their internal developers. The gateway handles authentication, cost management, auditing, and rate limiting while providing a unified API interface. Since its launch in 2023, it has enabled over 300 unique use cases across the organization, from real-time audio analysis to content moderation, while maintaining security and cost efficiency through centralized management.

Building a Natural Language Agent Builder with Comprehensive LLMOps Practices

Vellum

Vellum, a company that has spent three years building tools for production-grade agent development, launched a beta natural language agent builder that allows users to create agents through conversation rather than drag-and-drop interfaces or code. The speaker shares lessons learned from building this meta-level agent, focusing on tool design, testing strategies, execution monitoring, and user experience considerations. Key insights include the importance of carefully designing tool abstractions from first principles, balancing vibes-based testing with rigorous test suites, storing and analyzing all execution data to iterate on agent performance, and creating enhanced UI/UX by parsing agent outputs into interactive elements beyond simple text responses.

Building a Natural Language Business Intelligence Interface with MCP

Ramp

Ramp built an MCP (Model Context Protocol) server to enable natural language querying of business spend data through their developer API. The initial prototype allowed Claude to generate visualizations and run analyses, but struggled with scale due to context window limitations and high token usage. By pivoting to a SQL-based approach using an in-memory SQLite database with a lightweight ETL pipeline, they enabled Claude to query tens of thousands of transactions efficiently. The solution includes load tools for API data extraction, data transformation capabilities, and query execution tools, allowing users to gain insights into business spend patterns through conversational queries while addressing security concerns through audit logging and OAuth scopes.

Building a Privacy-Preserving LLM Usage Analytics System (Clio)

Anthropic

Anthropic developed Clio, a privacy-preserving system to understand how their LLM Claude is being used in the real world while maintaining strict user privacy. The system uses Claude itself to analyze and cluster conversations, extracting high-level insights without humans ever reading the raw data. This allowed Anthropic to improve their safety evaluations, understand usage patterns across languages and domains, and detect potential misuse - all while maintaining strong privacy guarantees through techniques like minimum cluster sizes and privacy auditing.

Building a Production AI Agent System for Customer Support

Decagon

Decagon has developed a comprehensive AI agent system for customer support that handles multiple communication channels including chat, email, and voice. Their system includes a core AI agent brain, intelligent routing, agent assistance capabilities, and robust testing and monitoring infrastructure. The solution aims to improve traditionally painful customer support experiences by providing consistent, quick responses while maintaining brand voice and safely handling sensitive operations like refunds.

Building a Production AI Translation and Lip-Sync System at Scale

Meta

Meta developed an AI-powered system for automatically translating and lip-syncing video content across multiple languages. The system combines Meta's Seamless universal translator model with custom lip-syncing technology to create natural-looking translated videos while preserving the original speaker's voice characteristics and emotions. The solution includes comprehensive safety measures, complex model orchestration, and handles challenges like background noise and timing alignment. Early alpha testing shows 90% eligibility rates for submitted content and meaningful increases in content impressions due to expanded language accessibility.

Building a Production Coding Agent Model with Speed and Intelligence

Cursor

Cursor developed Composer, a specialized coding agent model designed to balance speed and intelligence for real-world software engineering tasks. The challenge was creating a model that could perform at near-frontier levels while being four times more efficient at token generation than comparable models, moving away from the "airplane Wi-Fi" problem where agents were either too slow for synchronous work or required long async waits. The solution involved extensive reinforcement learning (RL) training in an environment that closely mimicked production, using custom kernels for low-precision training, parallel tool calling capabilities, semantic search with custom embeddings, and a fleet of cloud VMs to simulate the real Cursor IDE environment. The result was a model that performs close to frontier models like GPT-4.5 and Claude Sonnet 3.5 on coding benchmarks while maintaining significantly faster token generation, enabling developers to stay in flow state rather than context-switching during long agent runs.

Building a Production Fantasy Football AI Assistant in 8 Weeks

NFL

The NFL, in collaboration with AWS Generative AI Innovation Center, developed a fantasy football AI assistant for NFL Plus users that went from concept to production in just 8 weeks. Fantasy football managers face overwhelming amounts of data and conflicting expert advice, making roster decisions stressful and time-consuming. The team built an agentic AI system using Amazon Bedrock, Strands Agent framework, and Model Context Protocol (MCP) to provide analyst-grade fantasy advice in under 5 seconds, achieving 90% analyst approval ratings. The system handles complex multi-step reasoning, accesses NFL NextGen Stats data through semantic data layers, and successfully manages peak Sunday traffic loads with zero reported incidents in the first month of 10,000+ questions.

Building a Production MCP Server for AI Assistant Integration

Hugging Face

Hugging Face developed an official Model Context Protocol (MCP) server to enable AI assistants to access their AI model hub and thousands of AI applications through a simple URL. The team faced complex architectural decisions around transport protocols, choosing Streamable HTTP over deprecated SSE transport, and implementing a stateless, direct response configuration for production deployment. The server provides customizable tools for different user types and integrates seamlessly with existing Hugging Face infrastructure including authentication and resource quotas.

Building a Production RAG-based Customer Support Assistant with Elasticsearch

Elastic

Elastic's Field Engineering team developed a customer support chatbot using RAG instead of fine-tuning, leveraging Elasticsearch for document storage and retrieval. They created a knowledge library of over 300,000 documents from technical support articles, product documentation, and blogs, enriched with AI-generated summaries and embeddings using ELSER. The system uses hybrid search combining semantic and BM25 approaches to provide relevant context to the LLM, resulting in more accurate and trustworthy responses.

Building a Production Text-to-SQL Assistant with Multi-Agent Architecture

LinkedIn

LinkedIn developed SQL Bot, an AI-powered assistant integrated within their DARWIN data science platform, to help employees access data insights independently. The system uses a multi-agent architecture built on LangChain and LangGraph, combining retrieval-augmented generation with knowledge graphs and LLM-based ranking and correction systems. The solution has been deployed successfully with hundreds of users across LinkedIn's business verticals, achieving a 95% query accuracy satisfaction rate and demonstrating particular success with its query debugging feature.

Building a Production-Grade GenAI Customer Support Assistant with Comprehensive Observability

Elastic

Elastic developed a customer support chatbot using generative AI and RAG, focusing heavily on production-grade observability practices. They implemented a comprehensive observability strategy using Elastic's own stack, including APM traces, custom dashboards, alerting systems, and detailed monitoring of LLM interactions. The system successfully launched with features like streaming responses, rate limiting, and abuse prevention, while maintaining high reliability through careful monitoring of latency, errors, and usage patterns.

Building a Production-Ready AI Phone Call Assistant with Multi-Modal Processing

RealChar

RealChar is developing an AI assistant that can handle customer service phone calls on behalf of users, addressing the frustration of long wait times and tedious interactions. The system uses a complex architecture combining traditional ML and generative AI, running multiple models in parallel through an event bus system, with fallback mechanisms for reliability. The solution draws inspiration from self-driving car systems, implementing real-time processing of multiple input streams and maintaining millisecond-level observability.

Building a Production-Ready MCP Server for AI Agents to Manage Feature Flags

DevCycle

DevCycle developed an MCP (Model Context Protocol) server to enable AI coding agents to manage feature flags directly within development workflows. The project began as a hackathon proof-of-concept that adapted their existing CLI interface to work with AI agents, allowing natural language interactions for creating flags, investigating incidents, and cleaning up stale features. Through iterative refinement, the team identified key production requirements including clear input schemas, descriptive error handling, tool call pruning, OAuth authentication via Cloudflare Workers, and remote server architecture. The result was a production-ready integration that enables developers to create and manage feature flags without leaving their code editor, with early results showing approximately 3x more users reaching SDK installation compared to their previous onboarding flow.

Building a Production-Ready Multi-Agent Coding Assistant

Replit

Replit developed a coding agent system that helps users create software applications without writing code. The system uses a multi-agent architecture with specialized agents (manager, editor, verifier) and focuses on user engagement rather than full autonomy. The agent achieved hundreds of thousands of production runs and maintains around 90% success rate in tool invocations, using techniques like code-based tool calls, memory management, and state replay for debugging.

Building a Property Management AI Copilot with LangGraph and LangSmith

AppFolio

AppFolio developed Realm-X Assistant, an AI-powered copilot for property management, using LangChain ecosystem tools. By transitioning from LangChain to LangGraph for complex workflow management and leveraging LangSmith for monitoring and debugging, they created a system that helps property managers save over 10 hours per week. The implementation included dynamic few-shot prompting, which improved specific feature performance from 40% to 80%, along with robust testing and evaluation processes to ensure reliability.

Building a Property Question-Answering Chatbot to Replace 8-Hour Email Responses with Instant AI-Powered Answers

Agoda

Agoda, an online travel platform, developed the Property AMA (Ask Me Anything) Bot to address the challenge of users waiting an average of 8 hours for property-related question responses, with only 55% of inquiries receiving answers. The solution leverages ChatGPT integrated with Agoda's Property API to provide instant, accurate answers to property-specific questions through a conversational interface deployed across desktop, mobile web, and native app platforms. The implementation includes sophisticated prompt engineering with input topic guardrails, in-context learning that fetches real-time property data, and a comprehensive evaluation framework using response labeling and A/B testing to continuously improve accuracy and reliability.

Building a RAG-Based Documentation Chatbot: Lessons from Fiddler's LLMOps Journey

Fiddler

Fiddler AI developed a documentation chatbot using OpenAI's GPT-3.5 and Retrieval-Augmented Generation (RAG) to help users find answers in their documentation. The project showcases practical implementation of LLMOps principles including continuous evaluation, monitoring of chatbot responses and user prompts, and iterative improvement of the knowledge base. Through this implementation, they identified and documented key lessons in areas like efficient tool selection, query processing, document management, and hallucination reduction.

Building a RAG-Based Premium Audit Assistant for Insurance Workflows

Verisk

Verisk developed PAAS AI, a generative AI-powered conversational assistant to help premium auditors efficiently search and retrieve information from their vast repository of insurance documentation. Using a RAG architecture built on Amazon Bedrock with Claude, along with ElastiCache, OpenSearch, and custom evaluation frameworks, the system reduced document processing time by 96-98% while maintaining high accuracy. The solution demonstrates effective use of hybrid search, careful data chunking, and comprehensive evaluation metrics to ensure reliable AI-powered customer support.

Building a Resilient Embedding System for Semantic Search

Airtable

Airtable built a production-scale embedding system to enable semantic search across customer data, allowing teams to ask questions like "find past campaigns similar to this one" or "find engineers whose expertise matches this project." The system manages the complete lifecycle of embeddings including generation, storage, consistency tracking, and migrations while handling the challenge of maintaining eventual consistency between their primary in-memory database (MemApp) and a separate vector database. Their approach centers on a flexible "embedding config" abstraction and a reset-based strategy for handling migrations and failures, trading off temporary downtime and regeneration costs for operational simplicity and resilience across diverse scenarios like database migrations, model changes, and data residency requirements.

Building a Scalable AI Feature Evaluation System

Notion

Notion developed an advanced evaluation system for their AI features, transitioning from a manual process using JSONL files to a sophisticated automated workflow powered by Braintrust. This transformation enabled them to improve their testing and deployment of AI features like Q&A and workspace search, resulting in a 10x increase in issue resolution speed, from 3 to 30 issues per day.

Building a Scalable Chatbot Platform with Edge Computing and Multi-Layer Security

Fastmind

Fastmind developed a chatbot builder platform that focuses on scalability, security, and performance. The solution combines edge computing via Cloudflare Workers, multi-layer rate limiting, and a distributed architecture using Next.js, Hono, and Convex. The platform uses Cohere's AI models and implements various security measures to prevent abuse while maintaining cost efficiency for thousands of users.

Building a Scalable Conversational Video Agent with LangGraph and Twelve Labs APIs

Jockey

Jockey is an open-source conversational video agent that leverages LangGraph and Twelve Labs' video understanding APIs to process and analyze video content intelligently. The system evolved from v1.0 to v1.1, transitioning from basic LangChain to a more sophisticated LangGraph architecture, enabling better scalability and precise control over video workflows through a multi-agent system consisting of a Supervisor, Planner, and specialized Workers.

Building a Secure AI Assistant for Visual Effects Artists Using Amazon Bedrock

Untold Studios

Untold Studios developed an AI assistant integrated into Slack to help their visual effects artists access internal resources and tools more efficiently. Using Amazon Bedrock with Claude 3.5 Sonnet and a serverless architecture, they created a natural language interface that handles 120 queries per day, reducing information search time from minutes to seconds while maintaining strict data security. The solution combines RAG capabilities with function calling to access multiple knowledge bases and internal systems, significantly reducing the support team's workload.

Building a Secure and Scalable LLM Gateway for Enterprise GenAI Adoption

Wealthsimple

Wealthsimple developed a comprehensive LLM platform to enable secure and productive use of generative AI across their organization. They started with a basic gateway for audit trails, evolved to include PII redaction, self-hosted models, and RAG capabilities, while focusing on user adoption and security. The platform now serves over half the company with 2,200+ daily messages, demonstrating successful enterprise-wide GenAI adoption while maintaining data security.

Building a Secure and Scalable LLM Gateway for Financial Services

Wealthsimple

Wealthsimple, a Canadian FinTech company, developed a comprehensive LLM platform to securely leverage generative AI while protecting sensitive financial data. They built an LLM gateway with built-in security features, PII redaction, and audit trails, eventually expanding to include self-hosted models, RAG capabilities, and multi-modal inputs. The platform achieved widespread adoption with over 50% of employees using it monthly, leading to improved productivity and operational efficiencies in client service workflows.

Building a Secure Enterprise AI Assistant with RAG and Custom Infrastructure

Hexagon

Hexagon's Asset Lifecycle Intelligence division developed HxGN Alix, an AI-powered digital worker to enhance user interaction with their Enterprise Asset Management products. They implemented a secure solution using AWS services, custom infrastructure, and RAG techniques. The solution successfully balanced security requirements with AI capabilities, deploying models on Amazon EKS with private subnets, implementing robust guardrails, and solving various RAG-related challenges to provide accurate, context-aware responses while maintaining strict data privacy standards.

Building a Structured AI Evaluation Framework for Educational Tools

Coursera

Coursera developed a robust AI evaluation framework to support the deployment of their Coursera Coach chatbot and AI-assisted grading tools. They transitioned from fragmented offline evaluations to a structured four-step approach involving clear evaluation criteria, curated datasets, combined heuristic and model-based scoring, and rapid iteration cycles. This framework resulted in faster development cycles, increased confidence in AI deployments, and measurable improvements in student engagement and course completion rates.

Building a Systematic LLM Evaluation Framework from Scratch

Coda

Coda's journey in developing a robust LLM evaluation framework, evolving from manual playground testing to a comprehensive automated system. The team faced challenges with model upgrades affecting prompt behavior, leading them to create a systematic approach combining automated checks with human oversight. They progressed through multiple phases using different tools (OpenAI Playground, Coda itself, Vellum, and Brain Trust), ultimately achieving scalable evaluation running 500+ automated checks weekly, up from 25 manual evaluations initially.

Building a Systematic SNAP Benefits LLM Evaluation Framework

Propel

Propel is developing a comprehensive evaluation framework for testing how well different LLMs handle SNAP (food stamps) benefit-related queries. The project aims to assess model accuracy, safety, and appropriateness in handling complex policy questions while balancing strict accuracy with practical user needs. They've built a testing infrastructure including a Slackbot called Hydra for comparing multiple LLM outputs, and plan to release their evaluation framework publicly to help improve AI models' performance on SNAP-related tasks.

Building a Tool Calling Platform for LLM Agents

Arcade AI

Arcade AI developed a comprehensive tool calling platform to address key challenges in LLM agent deployments. The platform provides a dedicated runtime for tools separate from orchestration, handles authentication and authorization for agent actions, and enables scalable tool management. It includes three main components: a Tool SDK for easy tool development, an engine for serving APIs, and an actor system for tool execution, making it easier to deploy and manage LLM-powered tools in production.

Building a Universal Search Product with RAG and AI Agents

Dropbox

Dropbox developed Dash, a universal search and knowledge management product that addresses the challenges of fragmented business data across multiple applications and formats. The solution combines retrieval-augmented generation (RAG) and AI agents to provide powerful search capabilities, content summarization, and question-answering features. They implemented a custom Python interpreter for AI agents and developed a sophisticated RAG system that balances latency, quality, and data freshness requirements for enterprise use.

Building a Voice Assistant from Open Source LLMs: A Home Project Case Study

Weights & Biases

A developer built a custom voice assistant similar to Alexa using open-source LLMs, demonstrating the journey from prototype to production-ready system. The project used Whisper for speech recognition and various LLM models (Llama 2, Mistral) running on consumer hardware, with systematic improvements through prompt engineering and fine-tuning to achieve 98% accuracy in command interpretation, showing how iterative improvement and proper evaluation frameworks are crucial for LLM applications.

Building Agentic AI Assistant for Observability Platform

Grafana

Grafana Labs developed an agentic AI assistant integrated into their observability platform to help users query data, create dashboards, troubleshoot issues, and learn the platform. The team started with a hackathon project that ran entirely in the browser, iterating rapidly from a proof-of-concept to a production system. The assistant uses Claude as the primary LLM, implements tool calling with extensive context about Grafana's features, and employs multiple techniques including tool overloading, error feedback loops, and natural language tool responses. The solution enables users to investigate incidents, generate queries across multiple data sources, and modify visualizations through conversational interfaces while maintaining transparency by showing all intermediate steps and data to keep humans in the loop.

Building AI Assist: LLM Integration for E-commerce Product Listings

Mercari

Mercari developed an AI Assist feature to help sellers create better product listings using LLMs. They implemented a two-part system using GPT-4 for offline attribute extraction and GPT-3.5-turbo for real-time title suggestions, conducting both offline and online evaluations to ensure quality. The team focused on practical implementation challenges including prompt engineering, error handling, and addressing LLM output inconsistencies in a production environment.

Building AI Developer Tools Using LangGraph for Large-Scale Software Development

Uber

Uber's developer platform team built a suite of AI-powered developer tools using LangGraph to improve productivity for 5,000 engineers working on hundreds of millions of lines of code. The solution included tools like Validator (for detecting code violations and security issues), AutoCover (for automated test generation), and various other AI assistants. By creating domain-expert agents and reusable primitives, they achieved significant impact including thousands of daily code fixes, 10% improvement in developer platform coverage, and an estimated 21,000 developer hours saved through automated test generation.

Building AI Memory Layers with File-Based Vector Storage and Knowledge Graphs

Cognee

Cognee, a platform that helps AI agents retrieve, reason, and remember with structured context, needed a vector storage solution that could support per-workspace isolation for parallel development and testing without the operational overhead of managing multiple database services. The company implemented LanceDB, a file-based vector database, which enables each developer, user, or test instance to have its own fully independent vector store. This solution, combined with Cognee's Extract-Cognify-Load pipeline that builds knowledge graphs alongside embeddings, allows teams to develop locally with complete isolation and then seamlessly transition to production through Cognee's hosted service (cogwit). The results include faster development cycles due to eliminated shared state conflicts, improved multi-hop reasoning accuracy through graph-aware retrieval, and a simplified path from prototype to production without architectural redesign.

Building AI Products at Stack Overflow: From Conversational Search to Technical Benchmarking

Stack Overflow

Stack Overflow faced a significant disruption when ChatGPT launched in late 2022, as developers began changing their workflows and asking AI tools questions that would traditionally be posted on Stack Overflow. In response, the company formed an "Overflow AI" team to explore how AI could enhance their products and create new revenue streams. The team pursued two main initiatives: first, developing a conversational search feature that evolved through multiple iterations from basic keyword search to semantic search with RAG, ultimately being rolled back due to insufficient accuracy (below 70%) for developer expectations; and second, creating a data licensing business that involved fine-tuning models with Stack Overflow's corpus and developing technical benchmarks to demonstrate improved model performance. The initiatives showcased rapid iteration, customer-focused evaluation methods, and ultimately led to a new revenue stream while strengthening Stack Overflow's position in the AI era.

Building AI-Assisted Development at Scale with Cursor

Cursor

This case study explores how Cursor's solutions team has observed enterprise companies successfully deploying AI-assisted coding in production environments. The problem addressed is helping developers leverage LLMs effectively for coding tasks while avoiding common pitfalls like context window bloat, over-reliance on AI, and hallucinations. The solution involves teaching developers to break down problems into appropriately-sized tasks, maintain clean context windows, use semantic search for brownfield codebases, and build deterministic harnesses around non-deterministic LLM outputs. Results include significant productivity gains when developers learn proper prompt engineering, context management, and maintain responsibility for AI-generated code, with specific improvements like bench scores jumping from 45% to 65% through harness optimization.

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec

This panel discussion features three AI-native companiesโ€”Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)โ€”discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

Building Alfred: Production-Ready Agentic Orchestration Layer for E-commerce

Loblaws

Loblaws Digital, the technology arm of one of Canada's largest retail companies, developed Alfredโ€”a production-ready orchestration layer for running agentic AI workflows across their e-commerce, pharmacy, and loyalty platforms. The system addresses the challenge of moving agent prototypes into production at enterprise scale by providing a reusable template-based architecture built on LangGraph, FastAPI, and Google Cloud Platform components. Alfred enables teams across the organization to quickly deploy conversational commerce applications and agentic workflows (such as recipe-based shopping) while handling critical enterprise requirements including security, privacy, PII masking, observability, and integration with 50+ platform APIs through their Model Context Protocol (MCP) ecosystem.

Building Alyx: An AI Agent for LLM Observability and Debugging

Arize AI

Arize AI built "Alyx," an AI agent embedded in their observability platform to help users debug and optimize their machine learning and LLM applications. The problem they addressed was that their platform had advanced features that required significant expertise to use effectively, with customers needing guidance from solutions architects to extract maximum value. Their solution was to create an AI agent that emulates an expert solutions architect, capable of performing complex debugging workflows, optimizing prompts, generating evaluation templates, and educating users on platform features. Starting in November 2023 with GPT-3.5 and launching at their July 2024 conference, Alyx evolved from a highly structured, on-rails decision tree architecture to a more autonomous agent leveraging modern LLM capabilities. The team used their own platform to build and evaluate Alex, establishing comprehensive evaluation frameworks across multiple levels (tool calls, tasks, sessions, traces) and involving cross-functional stakeholders in defining success criteria.

Building an Agentic AI System for Healthcare Support Using LangGraph

Doctolib

Doctolib developed an agentic AI system called Alfred to handle customer support requests for their healthcare platform. The system uses multiple specialized AI agents powered by LLMs, working together in a directed graph structure using LangGraph. The initial implementation focused on managing calendar access rights, combining RAG for knowledge base integration with careful security measures and human-in-the-loop confirmation for sensitive actions. The system was designed to maintain high customer satisfaction while managing support costs efficiently.

Building an Agentic DevOps Copilot for Infrastructure Automation

Qovery

Qovery developed an agentic DevOps copilot to automate infrastructure tasks and eliminate repetitive DevOps work. The solution evolved through four phases: from basic intent-to-tool mapping, to a dynamic agentic system that plans tool sequences, then adding resilience and recovery mechanisms, and finally incorporating conversation memory. The copilot now handles complex multi-step workflows like deployments, infrastructure optimization, and configuration management, currently using Claude Sonnet 3.7 with plans for self-hosted models and improved performance.

Building an AI Agent for Real Estate with Systematic Evaluation Framework

Rechat

Rechat developed an AI agent to assist real estate agents with tasks like contact management, email marketing, and website creation. Initially struggling with reliability and performance issues using GPT-3.5, they implemented a comprehensive evaluation framework that enabled systematic improvement through unit testing, logging, human review, and fine-tuning. This methodical approach helped them achieve production-ready reliability and handle complex multi-step commands that combine natural language with UI elements.

Building an AI Agent Platform for Enterprise Automation and Collaboration

Abundly.ai

Abundly.ai developed an AI agent platform that enables companies to deploy autonomous AI agents as digital colleagues. The company evolved from experimental hobby projects to a production platform serving multiple industries, addressing challenges in agent lifecycle management, guardrails, context engineering, and human-AI collaboration. The solution encompasses agent creation, monitoring, tool integration, and governance frameworks, with successful deployments in media (SVT journalist agent), investment screening, and business intelligence. Results include 95% time savings in repetitive tasks, improved decision quality through diligent agent behavior, and the ability for non-technical users to create and manage agents through conversational interfaces and dynamic UI generation.

Building an AI Co-Pilot Application: Patterns and Best Practices

Thoughtworks

Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to learn about building generative AI experiences beyond chat interfaces. The team implemented several key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed insights into practical LLMOps patterns for building production LLM applications with enhanced user experiences.

Building an AI Innovation Team and Platform with Safeguards at Scale

Twilio

Twilio's Emerging Tech and Innovation team tackled the challenge of integrating AI capabilities into their customer engagement platform while maintaining quality and trust. They developed an AI assistance platform that bridges structured and unstructured customer data, implementing a novel approach using a separate "Twilio Alpha" brand to enable rapid iteration while managing customer expectations. The team successfully balanced innovation speed with enterprise requirements through careful team structure, flexible architecture, and open communication practices.

Building an AI Interview Coach for Product Discovery Training

Product Talk

Teresa Torres, a product discovery coach, built an AI-powered interview coach to provide automated feedback to students in her continuous interviewing course. Starting with simple ChatGPT and Claude prototypes, she progressively developed a production system using Replit, Zapier, and eventually AWS Lambda and Step Functions. The system analyzes student interview transcripts against a rubric for story-based interviewing, providing detailed feedback on multiple dimensions including opening questions, scene-setting, timeline building, and redirecting generalizations. Through rigorous evaluation methodology including error analysis, code-based evals, and LLM-as-judge evals, she achieved sufficient quality to deploy the tool to course students. The tool now processes interviews automatically, with continuous monitoring and iteration based on comprehensive evaluation frameworks, and is being scaled through a partnership with Vistily for handling real customer interview data with appropriate SOC 2 compliance.

Building an AI Legal Assistant: From Early Testing to Production Deployment

Casetext

Casetext transformed their legal research platform into an AI-powered legal assistant called Co-Counsel using GPT-4, leading to a $650M acquisition by Thomson Reuters. The company shifted their entire 120-person team to focus on building this AI assistant after early access to GPT-4 showed promising results. Through rigorous testing, prompt engineering, and a test-driven development approach, they created a reliable AI system that could perform complex legal tasks like document review and research that previously took lawyers days to complete. The product achieved rapid market acceptance and true product-market fit within months of launch.

Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations

Nubank

Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.

Building an AI-Assisted Content Creation Platform for Language Learning

Babbel

Babbel developed an AI-assisted content creation tool to streamline their traditional 35-hour content creation pipeline for language learning materials. The solution integrates LLMs with human expertise through a gradio-based interface, enabling prompt management, content generation, and evaluation while maintaining quality standards. The system successfully reduced content creation time while maintaining high acceptance rates (>85%) from editors.

Building an AI-Native Browser with Integrated LLM Tools and Evaluation Systems

The Browser Company

The Browser Company transitioned from their Arc browser to building Dia, an AI-native browser, requiring a fundamental shift in how they approached product development and LLMOps. The company invested heavily in tooling for rapid prototyping, evaluation systems, and automated prompt optimization using techniques like Jeba (a sample-efficient prompt optimization method). They created a "model behavior" discipline to define and ship desired LLM behaviors, treating it as a craft analogous to product design. Additionally, they built security considerations into the product design from the ground up, particularly addressing prompt injection vulnerabilities through user confirmation workflows. The result was a browser that provides an AI assistant alongside users, personalizing experiences and helping with tasks, while enabling their entire companyโ€”from CEO to strategy team membersโ€”to iterate on AI features.

Building an AI-Powered Help Desk with RAG and Model Evaluation

Vimeo

Vimeo developed a prototype AI help desk chat system that leverages RAG (Retrieval Augmented Generation) to provide accurate customer support responses using their existing Zendesk help center content. The system uses vector embeddings to store and retrieve relevant help articles, integrates with various LLM providers through Langchain, and includes comprehensive testing of different models (Google Vertex AI Chat Bison, GPT-3.5, GPT-4) for performance and cost optimization. The prototype demonstrates successful integration of modern LLMOps practices including prompt engineering, model evaluation, and production-ready architecture considerations.

Building an AI-Powered Interview Coach with Comprehensive Evaluation Framework

Product Talk

Teresa Torres, founder of Product Talk, describes her journey building an AI interview coach over four months to help students in her Continuous Discovery course practice customer interviewing skills. Starting from a position of limited AI engineering experience, she developed a production system that analyzes interview transcripts and provides detailed feedback across four dimensions of interviewing technique. The case study focuses extensively on her implementation of a comprehensive evaluation (eval) framework, including human annotation, code-based assertions, and LLM-as-judge evaluations, to ensure quality and reliability of the AI coach's feedback before deploying it to real students.

Building an AI-Powered Software Development Platform with Multiple LLM Integration

Lovable

Lovable addresses the challenge of making software development accessible to non-programmers by creating an AI-powered platform that converts natural language descriptions into functional applications. The solution integrates multiple LLMs (including OpenAI and Anthropic models) in a carefully orchestrated system that prioritizes speed and reliability over complex agent architectures. The platform has achieved significant success, with over 1,000 projects being built daily and a rapidly growing user base that doubled its paying customers in a recent month.

Building an Asynchronous Event-Driven Agentic Framework for AI-Powered App Building

Airtable

Airtable built a custom agentic framework to power AI features including Omni (conversational app builder) and Field Agents (AI-powered fields). The problem was that early AI capabilities couldn't handle complex tasks requiring dynamic decision-making, data retrieval, or multi-step reasoning. The solution was an asynchronous event-driven state machine architecture with three core components: a context manager for maintaining information, a tool dispatcher for executing predefined actions, and a decision engine (LLM-powered) for autonomous planning. The framework enables agents to reason through complex tasks, self-correct errors, and handle large context windows through trimming and summarization strategies, resulting in production AI agents capable of automating thousands of hours of work.

Building an Autonomous AI Software Engineer with Multi-Turn RL and Codebase Understanding

Devin

Cognition, the company behind Devon (an AI software engineer), addresses the challenge of enabling AI agents to work effectively within large, existing codebases where traditional LLMs struggle with limited context windows and complex dependencies. Their solution involves creating DeepWiki, a continuously-updated interactive knowledge graph and wiki system that indexes codebases using both code and metadata (pull requests, git history, team discussions), combined with Devon Search for deep codebase research, and custom post-training using multi-turn reinforcement learning to optimize models for specific narrow domains. Results include Devon being used by teams worldwide to autonomously go from ticket to pull request, the release of Kevin 32B (an open-source model achieving 91% correctness on CUDA kernel generation, outperforming frontier models like GPT-4), and thousands of open-source projects incorporating DeepWiki into their official documentation.

Building an Evaluation-First Development Strategy for AI Service Agents

Monday

Monday Service built an AI-native Enterprise Service Management platform featuring customizable, role-based AI agents to automate customer service across IT, HR, and Legal departments. The team embedded evaluation into their development cycle from Day 0, creating a dual-layered approach with offline "safety net" evaluations for regression testing and online "monitor" evaluations for real-time production quality. This eval-driven development framework, built on LangGraph agents with LangSmith and Vitest integration, achieved 8.7x faster evaluation feedback loops (from 162 seconds to 18 seconds), comprehensive testing across hundreds of examples in minutes, real-time end-to-end quality monitoring on production traces using multi-turn evaluators, and GitOps-style CI/CD deployment with evaluations managed as version-controlled code.

Building an LLM-Powered Support Response System

Stripe

Stripe developed an LLM-based system to help support agents handle customer inquiries more efficiently by providing relevant response prompts. The solution evolved from a simple GPT implementation to a sophisticated multi-stage framework incorporating fine-tuned models for question validation, topic classification, and response generation. Despite strong offline performance, the team faced challenges with agent adoption and online monitoring, leading to valuable lessons about the importance of UX consideration, online feedback mechanisms, and proper data management in LLM production systems.

Building Analytics Applications with LLMs for E-commerce Review Analysis

Microsoft

The case study explores how Large Language Models (LLMs) can revolutionize e-commerce analytics by analyzing customer product reviews. Traditional methods required training multiple models for different tasks like sentiment analysis and aspect extraction, which was time-consuming and lacked explainability. By implementing OpenAI's LLMs with careful prompt engineering, the solution enables efficient multi-task analysis including sentiment analysis, aspect extraction, and topic clustering while providing better explainability for stakeholders.

Building and Automating Comprehensive LLM Evaluation Framework for SNAP Benefits

Propel

Propel developed a sophisticated evaluation framework for testing and benchmarking LLM performance in handling SNAP (food stamp) benefit inquiries. The company created two distinct evaluation approaches: one for benchmarking current base models on SNAP topics, and another for product development. They implemented automated testing using Promptfoo and developed innovative ways to evaluate model responses, including using AI models as judges for assessing response quality and accessibility.

Building and Debugging Web Automation Agents with LangChain Ecosystem

Airtop

Airtop developed a web automation platform that enables AI agents to interact with websites through natural language commands. They leveraged the LangChain ecosystem (LangChain, LangSmith, and LangGraph) to build flexible agent architectures, integrate multiple LLM models, and implement robust debugging and testing processes. The platform successfully enables structured information extraction and real-time website interactions while maintaining reliability and scalability.

Building and Deploying an AI-Powered Incident Summary Generator

Incident.io

incident.io developed an AI feature to automatically generate and suggest incident summaries using OpenAI's models. The system processes incident updates, Slack conversations, and metadata to create comprehensive summaries that help newcomers get up to speed quickly. The feature achieved a 63% direct acceptance rate, with an additional 26% of suggestions being edited before use, demonstrating strong practical utility in production.

Building and Deploying Production LLM Code Review Agents: Architecture and Best Practices

Ellipsis

Ellipsis developed an AI-powered code review system that uses multiple specialized LLM agents to analyze pull requests and provide feedback. The system employs parallel comment generators, sophisticated filtering pipelines, and advanced code search capabilities backed by vector stores. Their approach emphasizes accuracy over latency, uses extensive evaluation frameworks including LLM-as-judge, and implements robust error handling. The system successfully processes GitHub webhooks and provides automated code reviews with high accuracy and low false positive rates.

Building and Deploying the Codex App: A Multi-Agent AI Development Environment

OpenAI

OpenAI's Codex team developed a dedicated GUI application for AI-powered coding that serves as a command center for multi-agent systems, moving beyond traditional IDE and terminal interfaces. The team addressed the challenge of making AI coding agents accessible to broader audiences while maintaining professional-grade capabilities for software developers. By combining the GPT-5.3 Codex model with agent skills, automations, and a purpose-built interface, they created a production system that enables delegation-based development workflows where users supervise AI agents performing complex coding tasks. The result was over one million downloads in the first week, widespread internal adoption at OpenAI including by research teams, and a strategic shift positioning AI coding tools for mainstream use, culminating in a Super Bowl advertisement.

Building and Evaluating a RAG-based Menopause Information Chatbot

Vira Health

Vira Health developed and evaluated an AI chatbot to provide reliable menopause information using peer-reviewed position statements from The Menopause Society. They implemented a RAG (Retrieval Augmented Generation) architecture using GPT-4, with careful attention to clinical safety and accuracy. The system was evaluated using both AI judges and human clinicians across four criteria: faithfulness, relevance, harmfulness, and clinical correctness, showing promising results in terms of safety and effectiveness while maintaining strict adherence to trusted medical sources.

Building and Evaluating Production AI Agents: From Function Calling to Complex Multi-Agent Systems

Google Deepmind

This case study explores the evolution of LLM-based systems in production through discussions with Raven Kumar from Google DeepMind about building products like Notebook LM, Project Mariner, and working with the Gemini and Gemma model families. The conversation covers the rapid progression from simple function calling to complex agentic systems capable of multi-step reasoning, the critical importance of evaluation harnesses as competitive advantages, and practical considerations around context engineering, tool orchestration, and model selection. Key insights include how model improvements are causing teams to repeatedly rebuild agent architectures, the importance of shipping products quickly to learn from real users, and strategies for evaluating increasingly complex multi-modal agentic systems across different scales from edge devices to cloud-based deployments.

Building and Evaluating Production Voice Agents: From Custom Infrastructure to Platform Solutions

Nomore Engineering

A team explored building a phone agent system for handling doctor appointments in Polish primary care, initially attempting to build their own infrastructure before evaluating existing platforms. They implemented a complex system involving speech-to-text, LLMs, text-to-speech, and conversation orchestration, along with comprehensive testing approaches. After building the complete system, they ultimately decided to use a third-party platform (Vapi.ai) due to the complexities of maintaining their own infrastructure, while gaining valuable insights into voice agent architecture and testing methodologies.

Building and Evolving a Production GenAI Application Stack

LinkedIn

LinkedIn's journey in developing their GenAI application tech stack, transitioning from simple prompt-based solutions to complex conversational agents. The company evolved from Java-based services to a Python-first approach using LangChain, implemented comprehensive prompt management, developed a skill-based task automation framework, and built robust conversational memory infrastructure. This transformation included migrating existing applications while maintaining production stability and enabling both commercial and fine-tuned open-source LLM deployments.

Building and Implementing a Company-Wide GenAI Strategy

Thumbtack

Thumbtack developed and implemented a comprehensive generative AI strategy focusing on three key areas: enhancing their consumer product with LLMs for improved search and data analysis, transforming internal operations through AI-powered business processes, and boosting employee productivity. They established new infrastructure and policies for secure LLM deployment, demonstrated value through early wins in policy violation detection, and successfully drove company-wide adoption through executive sponsorship and careful expectation management.

Building and Managing Production Agents with Testing and Evaluation Infrastructure

Nearpod

Nearpod, an edtech company, implemented a sophisticated agent-based architecture to help teachers generate educational content. They developed a framework for building, testing, and deploying AI agents with robust evaluation capabilities, ensuring 98-100% accuracy while managing costs. The system includes specialized agents for different tasks, an agent registry for reuse across teams, and extensive testing infrastructure to ensure reliable production deployment of non-deterministic systems.

Building and Operating a CLI-Based LLM Coding Assistant

Anthropic

Anthropic developed Claude Code, a CLI-based coding assistant that provides direct access to their Sonnet LLM for software development tasks. The tool started as an internal experiment but gained rapid adoption within Anthropic, leading to its public release. The solution emphasizes simplicity and Unix-like utility design principles, achieving an estimated 2-10x developer productivity improvement for active users while maintaining a pay-as-you-go pricing model averaging $6/day per active user.

Building and Operating an MCP Server for LLM-Powered Cloud Infrastructure Queries

CloudQuery

CloudQuery built a Model Context Protocol (MCP) server in Go to enable Claude and Cursor to directly query their cloud infrastructure database. They encountered significant challenges with LLM tool selection, context window limitations, and non-deterministic behavior. By rewriting tool descriptions to be longer and more domain-specific, renaming tools to better match user intent, implementing schema filtering to reduce token usage by 90%, and embedding recommended multi-tool workflows, they dramatically improved how the LLM engaged with their system. The solution transformed Claude's interaction from hallucinating queries to systematically following a discovery-to-execution pipeline.

Building and Operating Production LLM Agents: Lessons from the Trenches

Ellipsis

A comprehensive analysis of 15 months experience building LLM agents, focusing on the practical aspects of deployment, testing, and monitoring. The case study covers essential components of LLMOps including evaluation pipelines in CI, caching strategies for deterministic and cost-effective testing, and observability requirements. The author details specific challenges with prompt engineering, the importance of thorough logging, and the limitations of existing tools while providing insights into building reliable AI agent systems.

Building and Optimizing a RAG-based Customer Service Chatbot

HDI

HDI, a German insurance company, implemented a RAG-based chatbot system to help customer service agents quickly find and access information across multiple knowledge bases. The system processes complex insurance documents, including tables and multi-column layouts, using various chunking strategies and vector search optimizations. After 120 experiments to optimize performance, the production system now serves 800+ users across multiple business lines, handling 26 queries per second with 88% recall rate and 6ms query latency.

Building and Scaling a Production Generative AI Assistant for Professional Networking

LinkedIn

LinkedIn developed a generative AI-powered experience to enhance job searches and professional content browsing. The system uses a RAG-based architecture with specialized AI agents to handle different query types, integrating with internal APIs and external services. Key challenges included evaluation at scale, API integration, maintaining consistent quality, and managing computational resources while keeping latency low. The team achieved basic functionality quickly but spent significant time optimizing for production-grade reliability.

Building and Scaling AI-Powered Password Detection in Production

Github

Github developed and deployed Copilot secret scanning to detect generic passwords in codebases using AI/LLMs, addressing the limitations of traditional regex-based approaches. The team iteratively improved the system through extensive testing, prompt engineering, and novel resource management techniques, ultimately achieving a 94% reduction in false positives while maintaining high detection accuracy. The solution successfully scaled to handle enterprise workloads through sophisticated capacity management and workload-aware request handling.

Building and Scaling Codex: OpenAI's Production Coding Agent

OpenAI

OpenAI developed Codex, a coding agent that serves as an AI-powered software engineering teammate, addressing the challenge of accelerating software development workflows. The solution combines a specialized coding model (GPT-5.1 Codex Max), a custom API layer with features like context compaction, and an integrated harness that works through IDE extensions and CLI tools using sandboxed execution environments. Since launching and iterating based on user feedback in August, Codex has grown 20x, now serves many trillions of tokens per week, has become the most-served coding model both in first-party use and via API, and has enabled dramatic productivity gains including shipping the Sora Android app (which became the #1 app in the app store) in just 28 days with 2-3 engineers, demonstrating significant acceleration in production software development at scale.

Building and Scaling Conversational Voice AI Agents for Enterprise Go-to-Market

Thoughtly / Gladia

Thoughtly, a voice AI platform founded in late 2023, provides conversational AI agents for enterprise sales and customer support operations. The company orchestrates speech-to-text, large language models, and text-to-speech systems to handle millions of voice calls with sub-second latency requirements. By optimizing every layer of their stackโ€”from telephony providers to LLM inferenceโ€”and implementing sophisticated caching, conditional navigation, and evaluation frameworks, Thoughtly delivers 3x conversion rates over traditional methods and 15x ROI for customers. The platform serves enterprises with HIPAA and SOC 2 compliance while handling both inbound customer support and outbound lead activation at massive scale across multiple languages and regions.

Building and Scaling Enterprise LLMOps Platforms: From Team Topology to Production

Various

A comprehensive overview of how enterprises are implementing LLMOps platforms, drawing from DevOps principles and experiences. The case study explores the evolution from initial AI adoption to scaling across teams, emphasizing the importance of platform teams, enablement, and governance. It highlights the challenges of testing, model management, and developer experience while providing practical insights into building robust AI infrastructure that can support multiple teams within an organization.

Building and Scaling GitHub Copilot: From Prototype to Enterprise AI Coding Assistant

GitHub

GitHub shares the three-year journey of developing GitHub Copilot, an LLM-powered code completion tool, from concept to general availability. The team followed a "find it, nail it, scale it" framework to identify the problem space (helping developers code faster), create a smooth product experience through rapid iteration and A/B testing, and scale to enterprise readiness. Starting with a focused problem of function-level code completion in IDEs, they leveraged OpenAI's LLMs and Microsoft Azure infrastructure, implementing techniques like neighboring tabs processing, caching for consistency, and security filters. Through technical previews and community feedback, they achieved a 55% faster coding speed and 74% reduction in developer frustration, while addressing responsible AI concerns through code reference tools and vulnerability filtering.

Building and Scaling LLM Applications at Discord

Discord

Discord shares their comprehensive approach to building and deploying LLM-powered features, from ideation to production. They detail their process of identifying use cases, defining requirements, prototyping with commercial LLMs, evaluating prompts using AI-assisted evaluation, and ultimately scaling through either hosted or self-hosted solutions. The case study emphasizes practical considerations around latency, quality, safety, and cost optimization while building production LLM applications.

Building and Scaling Production Code Agents: Lessons from Replit

Replit

Replit developed and deployed a production-grade code agent that helps users create and modify code through natural language interaction. The team faced challenges in defining their target audience, detecting failure cases, and implementing comprehensive evaluation systems. They scaled from 3 to 20 engineers working on the agent, developed custom evaluation frameworks, and successfully launched features like rapid build mode that reduced initial application setup time from 7 to 2 minutes. The case study highlights key learnings in agent development, testing, and team scaling in a production environment.

Building and Scaling Production-Ready AI Agents: Lessons from Agent Force

Salesforce

Salesforce introduced Agent Force, a low-code/no-code platform for building, testing, and deploying AI agents in enterprise environments. The case study explores the challenges of moving from proof-of-concept to production, emphasizing the importance of comprehensive testing, evaluation, monitoring, and fine-tuning. Key insights include the need for automated evaluation pipelines, continuous monitoring, and the strategic use of fine-tuning to improve performance while reducing costs.

Building and Sunsetting Ada: An Internal LLM-Powered Chatbot Assistant

Leboncoin

Leboncoin, a French e-commerce platform, built Adaโ€”an internal LLM-powered chatbot assistantโ€”to provide employees with secure access to GenAI capabilities while protecting sensitive data from public LLM services. Starting in late 2023, the project evolved from a general-purpose Claude-based chatbot to a suite of specialized RAG-powered assistants integrated with internal knowledge sources like Confluence, Backstage, and organizational data. Despite achieving strong technical results and valuable learning outcomes around evaluation frameworks, retrieval optimization, and enterprise LLM deployment, the project was phased out in early 2025 in favor of ChatGPT Enterprise with EU data residency, allowing the team to redirect their expertise toward more user-facing use cases while reducing operational overhead.

Building and Testing a Production LLM-Powered Quiz Application

Google

A case study of transforming a traditional trivia quiz application into an LLM-powered system using Google's Vertex AI platform. The team evolved from using static quiz data to leveraging PaLM and later Gemini models for dynamic quiz generation, addressing challenges in prompt engineering, validation, and testing. They achieved significant improvements in quiz accuracy from 70% with Gemini Pro to 91% with Gemini Ultra, while implementing robust validation methods using LLMs themselves to evaluate quiz quality.

Building and Testing Production AI Applications at CircleCI

CircleCI

CircleCI shares their experience building AI-enabled applications like their error summarizer tool, focusing on the challenges of testing and evaluating LLM-powered applications in production. They discuss implementing model-graded evals, handling non-deterministic outputs, managing costs, and building robust testing strategies that balance thoroughness with practicality. The case study provides insights into applying traditional software development practices to AI applications while addressing unique challenges around evaluation, cost management, and scaling.

Building Ask Learn: A Large-Scale RAG-Based Knowledge Service for Azure Documentation

Microsoft

Microsoft's Skilling organization built "Ask Learn," a retrieval-augmented generation (RAG) system that powers AI-driven question-answering capabilities for Microsoft Q&A and serves as ground truth for Microsoft Copilot for Azure. Starting from a 2023 hackathon project, the team evolved a naรฏve RAG implementation into an advanced RAG system featuring sophisticated pre- and post-processing pipelines, continuous content ingestion from Microsoft Learn documentation, vector database management, and comprehensive evaluation frameworks. The system handles massive scale, provides accurate and verifiable answers, and serves multiple use cases including direct question answering, grounding data for other chat handlers, and fallback functionality when the Copilot cannot complete requested tasks.

Building Claude Code: Scaling AI-Powered Development from Terminal Prototype to Production

Anthropic

Anthropic's Boris Churnney, creator of Claude Code, describes the journey from an accidental terminal prototype in September 2024 to a production coding tool used by 70% of startups and responsible for 4% of all public commits globally. Starting as a simple API testing tool, Claude Code evolved through continuous user feedback and rapid iteration, with the entire codebase rewritten every few months to adapt to improving model capabilities. The tool achieved remarkable productivity gains at Anthropic itself, with engineers seeing 70% productivity increases per capita despite team doubling, and total productivity improvements of 150% since launch. The development philosophy centered on building for future model capabilities rather than current ones, anticipating improvements 6 months ahead, and minimizing scaffolding that would become obsolete with each new model release.

Building Effective Agents: Practical Framework and Design Principles

Anthropic

Anthropic presents a practical framework for building production-ready AI agents, addressing the challenge of when and how to deploy agentic systems effectively. The presentation introduces three core principles: selective use of agents for appropriate use cases, maintaining simplicity in design, and adopting the agent's perspective during development. The solution emphasizes a checklist-based approach for evaluating agent suitability considering task complexity, value justification, capability validation, and error costs. Results include successful deployment of coding agents and other domain-specific agents that share a common backbone of environment, tools, and system prompts, demonstrating that simple architectures can deliver sophisticated behavior when properly designed and iterated upon.

Building Enterprise-Grade GenAI Platform with Multi-Cloud Architecture

Coinbase

Coinbase developed CB-GPT, an enterprise GenAI platform, to address the challenges of deploying LLMs at scale across their organization. Initially focused on optimizing cost versus accuracy, they discovered that enterprise-grade LLM deployment requires solving for latency, availability, trust and safety, and adaptability to the rapidly evolving LLM landscape. Their solution was a multi-cloud, multi-LLM platform that provides unified access to models across AWS Bedrock, GCP VertexAI, and Azure, with built-in RAG capabilities, guardrails, semantic caching, and both API and no-code interfaces. The platform now serves dozens of internal use cases and powers customer-facing applications including a conversational chatbot launched in June 2024 serving all US consumers.

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

Building Evaluation Frameworks for AI Product Managers: A Workshop on Production LLM Testing

Arize

This workshop, presented by Aman, an AI product manager at Arize, addresses the challenge of shipping reliable AI applications in production by establishing evaluation frameworks specifically designed for product managers. The problem identified is that LLMs inherently hallucinate and are non-deterministic, making traditional software testing approaches insufficient. The solution involves implementing "LLM as a judge" evaluation systems, building comprehensive datasets, running experiments with prompt variations, and establishing human-in-the-loop validation workflows. The approach demonstrates how product managers can move from "vibe coding" to "thrive coding" by using data-driven evaluation methods, prompt playgrounds, and continuous monitoring. Results show that systematic evaluation can catch issues like mismatched tone, missing features, and hallucinations before production deployment, though the workshop candidly acknowledges that evaluations themselves require validation and iteration.

Building Evaluation Systems for AI-Powered Healthcare at Scale

Sword Health

Sword Health developed Phoenix, an AI care specialist that provides clinical support to patients during physical therapy sessions and between appointments. The company addressed the challenge of deploying large language models safely in healthcare by implementing a comprehensive evaluation framework combining offline and online assessments. Their approach includes building diverse evaluation datasets through strategic sampling and synthetic data generation, developing multiple types of evaluators (human-based, code-based, and LLM-as-judge), conducting vibe checks before release, and maintaining continuous monitoring in production through guardrails, A/B testing, manual audits, and automated evaluation of production traces. This eval-driven development process enables iterative improvement, quality assurance, objective model comparison, and cost optimization while ensuring patient safety.

Building Fair Housing Guardrails for Real Estate LLMs: Zillow's Multi-Strategy Approach to Preventing Discrimination

Zillow

Zillow developed a comprehensive Fair Housing compliance system for LLMs in real estate applications, combining three distinct strategies to prevent discriminatory responses: prompt engineering, stop lists, and a custom classifier model. The system addresses critical Fair Housing Act requirements by detecting and preventing responses that could enable steering or discrimination based on protected characteristics. Using a BERT-based classifier trained on carefully curated and augmented datasets, combined with explicit stop lists and prompt engineering, Zillow created a dual-layer protection system that validates both user inputs and model outputs. The approach achieved high recall in detecting non-compliant content while maintaining reasonable precision, demonstrating how domain-specific guardrails can be successfully implemented for LLMs in regulated industries.

Building Fully Autonomous Coding Agents for Non-Technical Users

Replit

Replit developed autonomous coding agents designed specifically for non-technical users, evolving from basic code completion tools to fully autonomous agents capable of running for hours while handling all technical decisions. The company identified that autonomy shouldn't be conflated with long runtimes but rather defined by the agent's ability to make technical decisions without user intervention. Their solution involved three key pillars: leveraging frontier model capabilities, implementing comprehensive autonomous testing using browser automation and Playwright, and sophisticated context management through sub-agent orchestration. The approach reduced context compression needs significantly (from 35 to 45-50 memories per compression), enabled agents to run coherently for extended periods without technical user input, and achieved order-of-magnitude improvements in testing cost and latency compared to computer vision approaches.

Building Gemini Deep Research: An Agentic Research Assistant with Custom-Tuned Models

Google Deepmind

Google DeepMind developed Gemini Deep Research, an AI-powered research assistant that autonomously browses the web for 5-10 minutes to generate comprehensive research reports with citations. The product addresses the challenge of users wanting to go from "zero to 50" on new topics quickly, automating what would typically require opening dozens of browser tabs and hours of manual research. The team solved key technical challenges around agentic planning, transparent UX design with editable research plans, asynchronous orchestration, and post-training custom models (initially Gemini 1.5 Pro, moving toward 2.0 Flash) to reliably perform iterative web search and synthesis. The product launched in December 2024 and has been widely praised as potentially the most useful public-facing AI agent to date, with users reporting it can compress hours or days of research work into minutes.

Building Healthcare-Specific LLM Pipelines for Oncology Patient Timelines

Roche Diagnostics / John Snow Labs

Roche Diagnostics developed an AI-assisted data abstraction solution using healthcare-specific LLMs to extract and structure oncology patient timelines from unstructured clinical notes. The system leverages natural language processing and machine learning to automatically detect medical concepts, focusing particularly on chemotherapy treatment timelines. The solution addresses the challenge of processing diverse, unstructured healthcare data formats while maintaining high accuracy through domain-specific LLMs and carefully engineered prompts.

Building Internal LLM Tools with Security and Privacy Focus

Wealthsimple

Wealthsimple developed an internal LLM Gateway and suite of generative AI tools to enable secure and privacy-preserving use of LLMs across their organization. The gateway includes features like PII redaction, multi-model support, and conversation checkpointing. They achieved significant adoption with over 50% of employees using the tools, primarily for programming support, content generation, and information retrieval. The platform also enabled operational improvements like automated customer support ticket triaging using self-hosted models.

Building LinkedIn's First Production Agent: Hiring Assistant Platform and Architecture

LinkedIn

LinkedIn evolved from simple GPT-based collaborative articles to sophisticated AI coaches and finally to production-ready agents, culminating in their Hiring Assistant product announced in October 2025. The company faced the challenge of moving from conversational assistants with prompt chains to task automation using agent-based architectures that could handle high-scale candidate evaluation while maintaining quality and enabling rapid iteration. They built a comprehensive agent platform with modular sub-agent architecture, centralized prompt management, LLM inference abstraction, messaging-based orchestration for resilience, and a skill registry for dynamic tool discovery. The solution enabled parallel development of agent components, independent quality evaluation, and the ability to serve both enterprise recruiters and SMB customers with variations of the same underlying platform, processing thousands of candidate evaluations at scale while maintaining the flexibility to iterate on product design.

Building Low-Latency Voice AI Agents for Home Services

Elyos AI

Elyos AI built end-to-end voice AI agents for home services companies (plumbers, electricians, HVAC installers) to handle customer calls, emails, and messages 24/7. The company faced challenges achieving human-like conversation latency (targeting sub-400ms response times) while maintaining reliability and accuracy for complex workflows including appointment booking, payment processing, and emergency dispatch. Through careful orchestration, they optimized speech-to-text, LLM, and text-to-speech components, implemented just-in-time context engineering, state machine-based workflows, and parallel monitoring streams to achieve consistent performance with approximately 85% call automation (15% requiring human involvement).

Building Modular and Scalable RAG Systems with Hybrid Batch/Incremental Processing

Bell

Bell developed a sophisticated hybrid RAG (Retrieval Augmented Generation) system combining batch and incremental processing to handle both static and dynamic knowledge bases. The solution addresses challenges in managing constantly changing documentation while maintaining system performance. They created a modular architecture using Apache Beam, Cloud Composer (Airflow), and GCP services, allowing for both scheduled batch updates and real-time document processing. The system has been successfully deployed for multiple use cases including HR policy queries and dynamic Confluence documentation management.

Building Multi-Agent AI Systems for Developer Support and Infrastructure Operations

Electrolux

Electrolux, a Swedish home appliances manufacturer with over 100 years of history, developed "Infra Assistant," an AI-powered multi-agent system to support their internal development teams and reduce bottlenecks in their platform engineering organization. The company faced challenges with their small Site Reliability Engineering (SRE) team being overwhelmed with repetitive support requests via Slack channels. Using Amazon Bedrock agents with both retrieval-augmented generation (RAG) and multi-agent collaboration patterns, they built a sophisticated system that answers questions based on organizational documentation, executes operations via API integrations, and can even troubleshoot cloud infrastructure issues autonomously. The system has proven cost-efficient compared to manual effort, successfully handles repetitive tasks like access management, and provides context-aware responses by accessing multiple organizational knowledge sources, though challenges remain around response latency and achieving consistent accuracy across all interactions.

Building Multi-Agent Systems with MCP and Pydantic AI for Document Processing

Deepsense

Deepsense AI built a multi-agent system for a customer who operates a document processing platform that handles various file types and data sources at scale. The problem was to create both an MCP (Model Context Protocol) server for the platform's internal capabilities and a demonstration multi-agent system that could structure data on demand from documents. Using Pydantic AI as the core agent framework and Anthropic's Claude models, the team developed a solution where users specify goals for document processing, and the system automatically extracts structured information into tables. The implementation involved creating custom MCP servers, integrating with Databricks MCP, and applying 10 key lessons learned around tool design, token optimization, model selection, observability, testing, and security. The result was a modular, scalable system that demonstrates practical patterns for building production-ready agentic applications.

Building Omega: A Multi-Agent Sales Assistant Embedded in Slack

Netguru

Netguru developed Omega, an AI agent designed to support their sales team by automating routine tasks and reinforcing workflow processes directly within Slack. The problem they faced was that as their sales team scaled, key information became scattered across multiple systems (Slack, CRM, call transcripts, shared drives), slowing down coordination and making it difficult to maintain consistency with their Sales Framework 2.0. Omega was built as a modular, multi-agent system using AutoGen for role-based orchestration, deployed on serverless AWS infrastructure (Lambda, Step Functions) with integrations to Google Drive, Apollo, and BlueDot for call transcription. The solution provides context-aware assistance for preparing expert calls, summarizing sales conversations, navigating documentation, generating proposal feature lists, and tracking deal momentumโ€”all within the team's existing Slack workflow, resulting in improved efficiency and process consistency.

Building Product Copilots: Engineering Challenges and Best Practices

Various

A comprehensive study examining the challenges faced by 26 professional software engineers in building AI-powered product copilots. The research reveals significant pain points across the entire engineering process, including prompt engineering difficulties, orchestration challenges, testing limitations, and safety concerns. The study provides insights into the need for better tooling, standardized practices, and integrated workflows for developing AI-first applications.

Building Production Agentic AI Systems for IT Operations and Support Automation

WEX

WEX, a global commerce platform processing over $230 billion in transactions annually, built a production agentic AI system called "Chat GTS" to address their 40,000+ annual IT support requests. The company's Global Technology Services team developed specialized agents using AWS Bedrock and Agent Core Runtime to automate repetitive operational tasks, including network troubleshooting and autonomous EBS volume management. Starting with Q&A capabilities, they evolved into event-driven agents that can autonomously respond to CloudWatch alerts, execute remediation playbooks via SSM documents exposed as MCP tools, and maintain infrastructure drift through automated pull requests. The system went from pilot to production in under 3 months, now serving over 2,000 internal users, with multi-agent architectures handling both user-initiated chat interactions and autonomous incident response workflows.

Building Production Agentic Systems with Platform-Level LLMOps Features

Anthropic

Anthropic's presentation at the AI Engineer conference outlined their platform evolution for building high-performance agentic systems, using Claude Code as the primary example. The company identified three core challenges in production LLM deployments: harnessing model capabilities through API features, managing context windows effectively, and providing secure computational infrastructure for autonomous agent operation. Their solution involved developing platform-level features including extended thinking modes, tool use APIs, Model Context Protocol (MCP) for standardized external system integration, memory management for selective context retrieval, context editing capabilities, and secure code execution environments with container orchestration. The combination of memory tools and context editing demonstrated a 39% performance improvement on internal benchmarks, while their infrastructure solutions enabled Claude Code to run autonomously on web and mobile platforms with session persistence and secure sandboxing.

Building Production AI Agents and Agentic Platforms at Scale

Vercel

This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

Building Production AI Agents with API Platform and Multi-Modal Capabilities

Manus AI

Manus AI demonstrates their production-ready AI agent platform through a technical workshop showcasing their API and application framework. The session covers building complex AI applications including a Slack bot, web applications, browser automation, and invoice processing systems. The platform addresses key production challenges such as infrastructure scaling, sandboxed execution environments, file handling, webhook management, and multi-turn conversations. Through live demonstrations and code walkthroughs, the workshop illustrates how their platform enables developers to build and deploy AI agents that handle millions of daily conversations while providing consistent pricing and functionality across web, mobile, Slack, and API interfaces.

Building Production AI Agents: Lessons from Claude Code and Enterprise Deployments

Anthropic

Anthropic's Applied AI team shares learnings from building and deploying AI agents in production throughout 2024-2025, focusing on their Claude Code product and enterprise customer implementations. The presentation covers the evolution from simple Q&A chatbots and RAG systems to sophisticated agentic architectures that run LLMs in loops with tools. Key technical challenges addressed include context engineering, prompt optimization, tool design, memory management, and handling long-running tasks that exceed context windows. The team transitioned from workflow-based architectures (chained LLM calls with deterministic logic) to agent-based systems where models autonomously use tools to solve open-ended problems, resulting in more robust error handling and the ability to tackle complex tasks like multi-hour coding sessions.

Building Production AI Products: A Framework for Continuous Calibration and Development

OpenAI / Various

AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.

Building Production Analytics Agents with Semantic Layer Integration

Wobby

Wobby, a company that helps business teams get insights from their data warehouses in under one minute, shares their journey building production-ready analytics agents over two years. The team developed three specialized agents (Quick, Deep, and Steward) that work with semantic layers to answer business questions. Their solution emphasizes Slack/Teams integration for adoption, building their own semantic layer to encode business logic, preferring prompt-based logic over complex workflows, implementing comprehensive testing strategies beyond just evals, and optimizing for latency through caching and progressive disclosure. The approach led to successful adoption by clients, with analytics agents being actively used in production to handle ad-hoc business intelligence queries.

Building Production Evaluation Systems for GitHub Copilot at Scale

Github

This case study examines the challenges of building evaluation systems for AI products in production, drawing from the author's experience leading the evaluation team at GitHub Copilot serving 100M developers. The problem addressed was the gap between evaluation tooling and developer workflows, as most AI teams consist of engineers rather than data scientists, yet evaluation tools are designed for data science workflows. The solution involved building a comprehensive evaluation stack including automated harnesses for code completion testing, A/B testing infrastructure, and implicit user behavior metrics like acceptance rates. The results showed that while sophisticated evaluation systems are valuable, successful AI products in practice rely heavily on rapid iteration, monitoring in production, and "vibes-based" testing, with the dominant strategy being to ship fast and iterate based on real user feedback rather than extensive offline evaluation.

Building Production LLM Applications with DSPy Framework

AlixPartners

A technical consultant presents a comprehensive workshop on using DSPy, a declarative framework for building modular LLM-powered applications in production. The presenter demonstrates how DSPy enables rapid iteration on LLM applications by treating LLMs as first-class citizens in Python programs, with built-in support for structured outputs, type guarantees, tool calling, and automatic prompt optimization. Through multiple real-world use cases including document classification, contract analysis, time entry correction, and multi-modal processing, the workshop shows how DSPy's core primitivesโ€”signatures, modules, tools, adapters, optimizers, and metricsโ€”allow teams to build production-ready systems that are transferable across models, optimizable without fine-tuning, and maintainable at scale.

Building Production LLM Pipelines for Insurance Risk Assessment and Document Processing

Vouch

Vouch Insurance implemented a production machine learning system using Metaflow to handle risk classification and document processing for their technology-focused insurance business. The system combines traditional data warehousing with LLM-powered predictions, processing structured and unstructured data through hourly pipelines. They built a comprehensive stack that includes data transformation, LLM integration via OpenAI, and a FastAPI service layer with an SDK for easy integration by product engineers.

Building Production Multi-Agent Research Systems with Claude

Anthropic

Anthropic developed a production-grade multi-agent research system for their Claude Research feature that uses multiple LLM agents working in parallel to explore complex topics across web, Google Workspace, and integrated data sources. The system employs an orchestrator-worker pattern where a lead agent coordinates specialized subagents that search and filter information simultaneously, addressing challenges in agent coordination, evaluation, and reliability. Internal evaluations showed the multi-agent approach with Claude Opus 4 and Sonnet 4 outperformed single-agent Claude Opus 4 by 90.2% on research tasks, with token usage explaining 80% of performance variance, though the architecture consumes approximately 15ร— more tokens than standard chat interactions, requiring careful consideration of economic viability and deployment strategies.

Building Production Web Agents for Food Ordering

iFood

A team at Prosus built web agents to help automate food ordering processes across their e-commerce platforms. Rather than relying on APIs, they developed web agents that could interact directly with websites, handling complex tasks like searching, navigating menus, and placing orders. Through iterative development and optimization, they achieved an 80% success rate target for specific e-commerce tasks by implementing a modular architecture that separated planning and execution, combined with various operational modes for different scenarios.

Building Production-Grade Agentic AI Analytics: Lessons from Real-World Deployment

Tellius

Tellius shares hard-won lessons from building their agentic analytics platform that transforms natural language questions into trustworthy SQL-based insights. The core problem addressed is that chat-based analytics requires far more than simple text-to-SQL conversionโ€”it demands deterministic planning, governed semantic layers, ambiguity management, multi-step consistency, transparency, performance engineering, and comprehensive observability. Their solution architecture separates language understanding from execution through typed plan artifacts that validate against schemas and policies before execution, implements clarification workflows for ambiguous queries, maintains plan/result fingerprinting for consistency, provides inline transparency with preambles and lineage, enforces latency budgets across execution hops, and treats feedback as governed policy changes. The result is a production system that achieves determinism, explainability, and sub-second interactive performance while avoiding the common pitfalls that cause 95% of AI pilot failures.

Building Production-Grade AI Agents with Distributed Architecture and Error Recovery

Parcha

Parcha's journey in building enterprise-grade AI Agents for automating compliance and operations workflows, evolving from a simple Langchain-based implementation to a sophisticated distributed system. They overcame challenges in reliability, context management, and error handling by implementing async processing, coordinator-worker patterns, and robust error recovery mechanisms, while maintaining clean context windows and efficient memory management.

Building Production-Grade AI Agents with Guardrails, Context Management, and Security

Portia / Riff / Okta

This panel discussion features founders from Portia AI and Rift.ai (formerly Databutton) discussing the challenges of moving AI agents from proof-of-concept to production. The speakers address critical production concerns including guardrails for agent reliability, context engineering strategies, security and access control challenges, human-in-the-loop patterns, and identity management. They share real-world customer examples ranging from custom furniture makers to enterprise CRM enrichment, emphasizing that while approximately 40% of companies experimenting with AI have agents in production, the journey requires careful attention to trust, security, and supportability. Key solutions include conditional example-based prompting, sandboxed execution environments, role-based access controls, and keeping context windows smaller for better precision rather than utilizing maximum context lengths.

Building Production-Grade AI Agents with Observability, Evaluation, and Insights

Langchain

Langchain discusses the evolution of their LangSmith platform for managing AI agents in production, addressing the challenge of bringing rigor and reliability to deployed LLM applications. The company describes launching two major feature sets: Insights, which automatically discovers patterns and trends in millions of production traces to help teams understand user interactions and agent behavior, and thread-based evaluations, which enable assessment of multi-turn conversations and complete user sessions rather than just individual interactions. These features aim to help teams transition from informal "vibe testing" to more methodical approaches as agents move from initial prototypes to production deployments handling millions of daily traces, with the goal of reducing unknowns and improving reliability in production AI systems.

Building Production-Grade AI Agents: Overcoming Reasoning and Tool Challenges

Kentauros AI

Kentauros AI presents their experience building production-grade AI agents, detailing the challenges in developing agents that can perform complex, open-ended tasks in real-world environments. They identify key challenges in agent reasoning (big brain, little brain, and tool brain problems) and propose solutions through reinforcement learning, generalizable algorithms, and scalable data approaches. Their evolution from G2 to G5 agent architectures demonstrates practical solutions to memory management, task-specific reasoning, and skill modularity.

Building Production-Grade Generative AI Applications with Comprehensive LLMOps

Block (Square)

Block (Square) implemented a comprehensive LLMOps strategy across multiple business units using a combination of retrieval augmentation, fine-tuning, and pre-training approaches. They built a scalable architecture using Databricks' platform that allowed them to manage hundreds of AI endpoints while maintaining operational efficiency, cost control, and quality assurance. The solution enabled them to handle sensitive data securely, optimize model performance, and iterate quickly while maintaining version control and monitoring capabilities.

Building Production-Grade LLM Applications: An Architectural Guide

Github

A comprehensive technical guide on building production LLM applications, covering the five key steps from problem definition to evaluation. The article details essential components including input processing, enrichment tools, and responsible AI implementations, using a practical customer service example to illustrate the architecture and deployment considerations.

Building Production-Grade LLM Evaluation Systems for HR Tech Interview Intelligence

Zebra

Spotted Zebra, an HR tech company building AI-powered hiring software for large enterprises, faced challenges scaling their interview intelligence product when transitioning from slow research-phase development to rapid client-driven iterations. The company developed a comprehensive evaluation framework centered on six key lessons: codifying human judgment through golden examples, versioning prompts systematically, using LLM-as-a-judge for open-ended tasks, building adversarial testing banks, implementing robust API logging, and treating evaluation as a strategic capability. This approach enabled faster development cycles, improved product quality, better client communication around fairness and transparency, and successful compliance certification (ISO 42001), positioning them for EU AI Act requirements.

Building Production-Grade RAG Systems for Financial Document Analysis

Microsoft

Microsoft's team shares their experience implementing a production RAG system for analyzing financial documents, including analyst reports and SEC filings. They tackled complex challenges around metadata extraction, chart/graph analysis, and evaluation methodologies. The system needed to handle tens of thousands of documents, each containing hundreds of pages with tables, graphs, and charts spanning different time periods and fiscal years. Their solution incorporated multi-modal models for image analysis, custom evaluation frameworks, and specialized document processing pipelines.

Building Production-Ready Agentic AI Systems in Financial Services

Fitch Group

Jayeeta Putatunda, Director of AI Center of Excellence at Fitch Group, shares lessons learned from deploying agentic AI systems in the financial services industry. The discussion covers the challenges of moving from proof-of-concept to production, emphasizing the importance of evaluation frameworks, observability, and the "data prep tax" required for reliable AI agent deployments. Key insights include the need to balance autonomous agents with deterministic workflows, implement comprehensive logging at every checkpoint, combine LLMs with traditional predictive models for numerical accuracy, and establish strong business-technical partnerships to define success metrics. The conversation highlights that while agentic frameworks enable powerful capabilities, production success requires careful system design, multi-layered evaluation, human-in-the-loop validation patterns, and a focus on high-ROI use cases rather than chasing the latest model architectures.

Building Production-Ready Agentic Systems with the Claude Developer Platform

Anthropic

Anthropic's Claude Developer Platform team discusses their evolution from a simple API to a comprehensive platform for building autonomous AI agents in production. The conversation covers their philosophy of "unhobbling" models by reducing scaffolding and giving Claude more autonomous decision-making capabilities through tools like web search, code execution, and context management. They introduce the Claude Code SDK as a general-purpose agentic harness that handles the tool-calling loop automatically, making it easier for developers to prototype and deploy agents. The platform addresses key production challenges including prompt caching, context window management, observability for long-running tasks, and agentic memory, with a roadmap focused on higher-order abstractions and self-improving systems.

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

Building Production-Ready AI Agents and Monitoring Systems

Portkey, Airbyte, Comet

The panel discussion and demo sessions showcase how companies like Portkey, Airbyte, and Comet are tackling the challenges of deploying LLMs and AI agents in production. They address key issues including monitoring, observability, error handling, data movement, and human-in-the-loop processes. The solutions presented range from AI gateways for enterprise deployments to experiment tracking platforms and tools for building reliable AI agents, demonstrating both the challenges and emerging best practices in LLMOps.

Building Production-Ready AI Agents for Internal Workflow Automation

Vercel

Vercel, a web hosting and deployment platform, addressed the challenge of identifying and implementing successful AI agent projects across their organization by focusing on employee pain pointsโ€”specifically repetitive, boring tasks that humans disliked. The company deployed three internal production agents: a lead processing agent that automated sales qualification and research (saving hundreds of days of manual work), an anti-abuse agent that accelerated content moderation decisions by 59%, and a data analyst agent that automated SQL query generation for business intelligence. Their methodology centered on asking employees "What do you hate most about your job?" to identify tasks that were repetitive enough for current AI models to handle reliably while still delivering high business impact.

Building Production-Ready AI Agents: OpenAI Codex CLI Architecture and Agent Loop Design

OpenAI

OpenAI's Codex CLI is a cross-platform software agent that executes reliable code changes on local machines, demonstrating production-grade LLMOps through its sophisticated agent loop architecture. The system orchestrates interactions between users, language models, and tools through an iterative process that manages inference calls, tool execution, and conversation state. Key technical achievements include stateless request handling for Zero Data Retention compliance, strategic prompt caching optimization to achieve linear rather than quadratic performance, automatic context window management through intelligent compaction, and robust handling of multi-turn conversations while maintaining conversation coherence across potentially hundreds of model-tool iterations.

Building Production-Ready AI Analytics Agents Through Advanced Prompt Engineering

Explai

Explai, a company building AI-powered data analytics companions, encountered significant challenges when deploying multi-agent LLM systems for enterprise analytics use cases. Their initial approach of pre-loading agent contexts with extensive domain knowledge, business information, and intermediate results led to context pollution and degraded instruction following at scale. Through iterative learning over two years, they developed three key prompt engineering tactics: reversing the traditional RAG approach by using trigger messages with pull-based document retrieval, writing structured artifacts instead of raw data to context, and allowing agents to generate full executable code in sandboxed environments. These tactics enabled more autonomous agent behavior while maintaining accuracy and reducing context window bloat, ultimately creating a more robust production system for complex, multi-step data analysis workflows.

Building Production-Ready AI Analytics with LLMs: Lessons from Jira Integration

Luna

Luna developed an AI-powered Jira analytics system using GPT-4 and Claude 3.7 to extract actionable insights from complex project management data, helping engineering and product teams track progress, identify risks, and predict delays. Through iterative development, they identified seven critical lessons for building reliable LLM applications in production, including the importance of data quality over prompt engineering, explicit temporal context handling, optimal temperature settings for structured outputs, chain-of-thought reasoning for accuracy, focused constraints to reduce errors, leveraging reasoning models effectively, and addressing the "yes-man" effect where models become overly agreeable rather than critically analytical.

Building Production-Ready Conversational AI Voice Agents: Latency, Voice Quality, and Integration Challenges

Deepgram

Deepgram, a leader in transcription services, shares insights on building effective conversational AI voice agents. The presentation covers critical aspects of implementing voice AI in production, including managing latency requirements (targeting 300ms benchmark), handling end-pointing challenges, ensuring voice quality through proper prosody, and integrating LLMs with speech-to-text and text-to-speech services. The company introduces their new text-to-speech product Aura, designed specifically for conversational AI applications with low latency and natural voice quality.

Building Production-Ready CRM Integration for ChatGPT using Model Context Protocol

Hubspot

HubSpot developed the first third-party CRM connector for ChatGPT using the Model Context Protocol (MCP), creating a remote MCP server that enables 250,000+ businesses to perform deep research through conversational AI without requiring local installations. The solution involved building a homegrown MCP server infrastructure using Java and Dropwizard, implementing OAuth-based user-level permissions, creating a distributed service discovery system for automatic tool registration, and designing a query DSL that allows AI models to generate complex CRM searches through natural language interactions.

Building Production-Ready Customer Support AI Agents: Challenges and Solutions

Gradient Labs

Gradient Labs shares their experience building and deploying AI agents for customer support automation in production. While prototyping with LLMs is relatively straightforward, deploying agents to production introduces complex challenges around state management, knowledge integration, tool usage, and handling race conditions. The company developed a state machine-based architecture with durable execution engines to manage these challenges, successfully handling hundreds of conversations per day with high customer satisfaction.

Building Production-Ready Healthcare AI That Scales With Model Progress

Anterior

This case study examines Anterior's experience building LLM-powered products for healthcare prior authorization over three years. The company faced the challenge of building production systems around rapidly evolving AI capabilities, where approaches designed around current model limitations could quickly become obsolete. Through experimentation with techniques like hierarchical query reasoning, finetuning, domain knowledge injection, and expert review systems, they learned which approaches compound with model progress versus those that compete with it. The result was a framework for "Sour Lesson-pilled" product development that emphasizes building systems that benefit from model improvements rather than being made redundant by them, with key surviving techniques including dynamic domain knowledge injection and scalable expert review infrastructure.

Building Production-Ready LLMs for Automated Code Repair: A Scalable IDE Integration Case Study

Replit

Replit tackled the challenge of automating code repair in their IDE by developing a specialized 7B parameter LLM that integrates directly with their Language Server Protocol (LSP) diagnostics. They created a production-ready system that can automatically fix Python code errors by processing real-time IDE events, operational transformations, and project snapshots. Using DeepSeek-Coder-Instruct-v1.5 as their base model, they implemented a comprehensive data pipeline with serverless verification, structured input/output formats, and GPU-accelerated inference. The system achieved competitive results against much larger models like GPT-4 and Claude-3, with their finetuned 7B model matching or exceeding the performance of these larger models on both academic benchmarks and real-world error fixes. The production system features low-latency inference, load balancing, and real-time code application, demonstrating successful deployment of an LLM system in a high-stakes development environment where speed and accuracy are crucial.

Building Production-Scale Code Completion Tools with Continuous Evaluation and Prompt Engineering

Gitlab

Gitlab's ModelOps team developed a sophisticated code completion system using multiple LLMs, implementing a continuous evaluation and improvement pipeline. The system combines both open-source and third-party LLMs, featuring a comprehensive architecture that includes continuous prompt engineering, evaluation benchmarks, and reinforcement learning to consistently improve code completion accuracy and usefulness for developers.

Building QueryAnswerBird: An AI Data Analyst with Text-to-SQL and RAG

Delivery Hero

Woowa Brothers, part of Delivery Hero, developed QueryAnswerBird (QAB), an LLM-based AI data analyst to address employee challenges with SQL query generation and data literacy. Through a company-wide survey, they identified that 95% of employees used data for work, but over half struggled with SQL due to time constraints or difficulty translating business logic into queries. The solution leveraged RAG, LangChain, and GPT-4 to build a Slack-integrated assistant that automatically generates SQL queries from natural language, interprets queries, validates syntax, and explores tables. After winning first place at an internal hackathon in 2023, a dedicated task force spent six months developing the production system with comprehensive LLMOps practices including A/B testing, monitoring dashboards, API load balancing, GPT caching, and CI/CD deployment, conducting over 500 tests to optimize performance.

Building QueryAnswerBird: An LLM-Powered AI Data Analyst with RAG and Text-to-SQL

Delivery Hero

Woowa Brothers, part of Delivery Hero, developed QueryAnswerBird (QAB), an LLM-based AI data analyst to address the challenge that while 95% of employees used data in their work, over half struggled with SQL proficiency and data extraction reliability. The solution leveraged GPT-4, RAG architecture, LangChain, and comprehensive LLMOps practices to create a Slack-based chatbot that could generate SQL queries from natural language, interpret queries, validate syntax, and provide data discovery features. The development involved building automated unstructured data pipelines with vector stores, implementing multi-chain RAG architecture with router supervisors, establishing LLMOps infrastructure including A/B testing and monitoring dashboards, and conducting over 500 experiments to optimize performance, resulting in a 24/7 accessible service that provides high-quality query responses within 30 seconds to 1 minute.

Building Reliable AI Agent Systems with Effect TypeScript Framework

14.ai

14.ai, an AI-native customer support platform, uses Effect, a TypeScript framework, to manage the complexity of building reliable LLM-powered agent systems that interact directly with end users. The company built a comprehensive architecture using Effect across their entire stack to handle unreliable APIs, non-deterministic model outputs, and complex workflows through strong type guarantees, dependency injection, retry mechanisms, and structured error handling. Their approach enables reliable agent orchestration with fallback strategies between LLM providers, real-time streaming capabilities, and comprehensive testing through dependency injection, resulting in more predictable and resilient AI systems.

Building Reliable AI Agents for Application Development with Multi-Agent Architecture

Replit

Replit developed an AI agent system to help users create applications from scratch, addressing the challenge of blank page syndrome in software development. They implemented a multi-agent architecture with manager, editor, and verifier agents, focusing on reliability and user engagement. The system incorporates advanced prompt engineering techniques, human-in-the-loop workflows, and comprehensive monitoring through LangSmith, resulting in a powerful tool that simplifies application development while maintaining user control and visibility.

Building Reliable AI Agents Through Production Monitoring and Intent Discovery

Raindrop

Raindrop, a monitoring platform for AI products, addresses the challenge of building reliable AI agents in production where traditional offline evaluations fail to capture real-world usage patterns. The company developed a "Sentry for AI products" approach that emphasizes experimentation, production monitoring, and discovering user intents through clustering and signal detection. Their solution combines explicit signals (like thumbs up/down, regenerations) and implicit signals (detecting refusals, task failures, user frustration) to identify issues that don't manifest as traditional software errors. The platform trains custom models to detect issues across production data at scale, enabling teams to discover unknown problems, track their impact on users, and fix them systematically without breaking existing functionality.

Building Reliable AI DevOps Agents: Engineering Practices for Nondeterministic LLM Output

Trunk

Trunk developed an AI DevOps agent to handle root cause analysis (RCA) for test failures in CI pipelines, facing challenges with nondeterministic LLM outputs. They applied traditional software engineering principles adapted for LLMs, including starting with narrow use cases, switching between models (Claude to Gemini) for better tool calling, implementing comprehensive testing with mocked LLM responses, and establishing feedback loops through internal usage and user feedback collection. The approach resulted in a more reliable agent that performs well on specific tasks like analyzing test failures and posting summaries to GitHub PRs.

Building Reliable Background Coding Agents with Verification Loops

Spotify

Spotify developed a background coding agent system to automate large-scale software maintenance across thousands of components, addressing the challenge of ensuring reliable and correct code changes without direct human supervision. The solution centers on implementing strong verification loops consisting of deterministic verifiers (for formatting, building, and testing) and an LLM-as-judge layer to prevent the agent from making out-of-scope changes. After generating over 1,500 pull requests, the system demonstrates that verification loops are essential for maintaining predictability, with the judge layer vetoing approximately 25% of proposed changes and the agent successfully course-correcting about half the time, significantly reducing the risk of functionally incorrect code reaching production.

Building Reliable LLM Workflows in Biotech Research

Moderna

Moderna Therapeutics applies large language models primarily for document reformatting and regulatory submission preparation within their research organization, deliberately avoiding autonomous agents in favor of highly structured workflows. The team, led by Eric Maher in research data science, focuses on automating what they term "intellectual drudgery" - reformatting laboratory records and experiment documentation into regulatory-compliant formats. Their approach prioritizes reliability over novelty, implementing rigorous evaluation processes matched to consequence levels, with particular emphasis on navigating the complex security and permission mapping challenges inherent in regulated biotech environments. The team employs a "non-LLM filter" methodology, only reaching for generative AI after exhausting simpler Python or traditional ML approaches, and leverages serverless infrastructure like Modal and reactive notebooks with Marimo to enable rapid experimentation and deployment.

Building Resilient Multi-Provider AI Agent Infrastructure for Financial Services

Gradient Labs

Gradient Labs built an AI agent that handles customer interactions for financial services companies, requiring high reliability in production. The company architected a sophisticated failover system that spans multiple LLM providers (OpenAI, Anthropic, Google) and hosting platforms (native APIs, Azure, AWS, GCP), enabling both traffic distribution across rate limits and automatic failover during errors, rate limiting, or latency spikes. They use Temporal for durable execution to checkpoint progress across long-running agentic workflows, and have implemented both provider-level and model-level failover strategies with tailored prompts for backup models, ensuring continuous operation even during catastrophic provider outages.

Building Robust Evaluation Systems for Auto-Generated Video Titles

Loom

Loom developed a systematic approach to evaluating and improving their AI-powered video title generation feature. They created a comprehensive evaluation framework combining code-based scorers and LLM-based judges, focusing on specific quality criteria like relevance, conciseness, and engagement. This methodical approach to LLMOps enabled them to ship AI features faster and more confidently while ensuring consistent quality in production.

Building Robust Evaluation Systems for GitHub Copilot

Github

This case study explores how Github developed and evolved their evaluation systems for Copilot, their AI code completion tool. Initially skeptical about the feasibility of code completion, the team built a comprehensive evaluation framework called "harness lib" that tested code completions against actual unit tests from open source repositories. As the product evolved to include chat capabilities, they developed new evaluation approaches including LLM-as-judge for subjective assessments, along with A/B testing and algorithmic evaluations for function calls. This systematic approach to evaluation helped transform Copilot from an experimental project to a robust production system.

Building Robust Legal Document Processing Applications with LLMs

Anzen

The case study explores how Anzen builds robust LLM applications for processing insurance documents in environments where accuracy is critical. They employ a multi-model approach combining specialized models like LayoutLM for document structure analysis with LLMs for content understanding, implement comprehensive monitoring and feedback systems, and use fine-tuned classification models for initial document sorting. Their approach demonstrates how to effectively handle LLM hallucinations and build production-grade systems with high accuracy (99.9% for document classification).

Building Scalable LLM Evaluation Systems for Healthcare Prior Authorization

Anterior

Anterior, a healthcare AI company, developed a scalable evaluation system for their LLM-powered prior authorization decision support tool. They faced the challenge of maintaining accuracy while processing over 100,000 medical decisions daily, where errors could have serious consequences. Their solution combines real-time reference-free evaluation using LLMs as judges with targeted human expert review, achieving an F1 score of 96% while keeping their clinical review team under 10 people, compared to competitors who employ hundreds of nurses.

Building Secure and Private Enterprise LLM Infrastructure

Slack

Slack implemented AI features by developing a secure architecture that ensures customer data privacy and compliance. They used AWS SageMaker to host LLMs in their VPC, implemented RAG instead of fine-tuning models, and maintained strict data access controls. The solution resulted in 90% of AI-adopting users reporting increased productivity while maintaining enterprise-grade security and compliance requirements.

Building Secure Generative AI Applications at Scale: Amazon's Journey from Experimental to Production

Amazon

Amazon faced the challenge of securing generative AI applications as they transitioned from experimental proof-of-concepts to production systems like Rufus (shopping assistant) and internal employee chatbots. The company developed a comprehensive security framework that includes enhanced threat modeling, automated testing through their FAST (Framework for AI Security Testing) system, layered guardrails, and "golden path" templates for secure-by-default deployments. This approach enabled Amazon to deploy customer-facing and internal AI applications while maintaining security, compliance, and reliability standards through continuous monitoring, evaluation, and iterative refinement processes.

Building State-of-the-Art AI Programming Agents with OpenAI's o1 Model

Weights & Biases

Weights & Biases developed an advanced AI programming agent using OpenAI's o1 model that achieved state-of-the-art performance on the SWE-Bench-Verified benchmark, successfully resolving 64.6% of software engineering issues. The solution combines o1 with custom-built tools, including a Python code editor toolset, memory components, and parallel rollouts with crosscheck mechanisms, all developed and evaluated using W&B's Weave toolkit and newly created Eval Studio platform.

Building Trustworthy LLM Agents for Automated Expense Management

Ramp

Ramp developed and deployed a suite of LLM-powered agents to automate expense management workflows, with a particular focus on their "policy agent" that automates expense approvals. The company faced the challenge of building AI systems that finance teams could trust in a domain where low-quality outputs could quickly erode confidence. Their solution emphasized explainable reasoning with citations, built-in uncertainty handling, collaborative context refinement, user-controlled autonomy levels, and comprehensive evaluation frameworks. Since deployment, the policy agent has handled over 65% of expense approvals autonomously, demonstrating that carefully designed LLM systems can deliver significant automation value while maintaining user trust through transparency and control.

Building Trustworthy LLM-Powered Agents for Automated Expense Management

Ramp

Ramp developed a suite of LLM-backed agents to automate expense management processes, focusing on building user trust through transparent reasoning, escape hatches for uncertainty, and collaborative context management. The team addressed the challenge of deploying LLMs in a finance environment where accuracy and trust are critical by implementing clear explanations for decisions, allowing users to control agent autonomy levels, and creating feedback loops for continuous improvement. Their policy agent now handles over 65% of expense approvals automatically while maintaining user confidence through transparent decision-making and the ability to defer to human judgment when uncertain.

Building Uma: In-House AI Research and Custom Fine-Tuning for Marketplace Intelligence

Upwork

Upwork developed Uma, their "mindful AI" assistant, by rejecting off-the-shelf LLM solutions in favor of building custom-trained models using proprietary platform data and in-house AI research. The company hired expert freelancers to create high-quality training datasets, generated synthetic data anchored in real platform interactions, and fine-tuned open-source LLMs specifically for hiring workflows. This approach enabled Uma to handle complex, business-critical tasks including crafting job posts, matching freelancers to opportunities, autonomously coordinating interviews, and evaluating candidates. The strategy resulted in models that substantially outperform generic alternatives on domain-specific tasks while reducing costs by up to 10x and improving reliability in production environments. Uma now operates as an increasingly agentic system that takes meaningful actions across the full hiring lifecycle.

Building Verifiable Retrieval Infrastructure for Agentic Systems

Hornet

Hornet is developing a retrieval engine specifically designed for AI agents, addressing the challenge that their API surface isn't in any LLM's pre-training data and traditional documentation-in-prompt approaches proved insufficient. Their solution centers on making the entire API surface verifiable through three validation layers (syntactic, semantic, and behavioral), structured similarly to code with configuration files that agents can write, edit, and test. This approach enables agents to not only use Hornet but to learn, configure, and optimize retrieval on their own through feedback loops, similar to how coding agents verify output through compilers and tests, ultimately creating self-improving systems where agents can tune their own context retrieval without human intervention.

Building Voice-Enabled AI Assistants with Real-Time Processing

Bee

A detailed exploration of building real-time voice-enabled AI assistants, featuring multiple approaches from different companies and developers. The case study covers how to achieve low-latency voice processing, transcription, and LLM integration for interactive AI assistants. Solutions demonstrated include both commercial services like Deepgram and open-source implementations, with a focus on achieving sub-second latency, high accuracy, and cost-effective deployment.

Challenges and Opportunities in Building Product Copilots: An Industry Interview Study

Microsoft / GitHub

Microsoft and GitHub researchers conducted a comprehensive interview study with 26 professional software engineers across various companies who are building AI-powered product copilotsโ€”conversational agents that assist users with natural language interactions. The study identified significant pain points across the entire engineering lifecycle, including the time-consuming and fragile nature of prompt engineering, difficulties in orchestration and managing multi-turn workflows, the lack of standardized testing and benchmarking approaches, challenges in learning best practices in a rapidly evolving field, and concerns around safety, privacy, and compliance. The research reveals that existing software engineering processes and tools have not yet adapted to the unique challenges of building AI-powered applications, leaving engineers to improvise without established best practices. Through subsequent brainstorming sessions, the researchers collaboratively identified opportunities for improved tooling, including prompt linters, automated benchmark creation, better visibility into model behavior, and more integrated development workflows.

Challenges in Building Enterprise Chatbots with LLMs: A Banking Case Study

Invento Robotics

A bank's attempt to implement a customer support chatbot using GPT-4 and RAG reveals the complexities and challenges of deploying LLMs in production. What was initially estimated as a three-month project struggled to deliver after a year, highlighting key challenges in domain knowledge management, retrieval effectiveness, conversation flow design, state management, latency, and regulatory compliance.

Claude Code Agent Architecture: Single-Threaded Master Loop for Autonomous Coding

Anthropic

Anthropic's Claude Code implements a production-ready autonomous coding agent using a deceptively simple architecture centered around a single-threaded master loop (codenamed nO) enhanced with real-time steering capabilities, comprehensive developer tools, and controlled parallelism through limited sub-agent spawning. The system addresses the complexity of autonomous code generation and editing by prioritizing debuggability and transparency over multi-agent swarms, using a flat message history design with TODO-based planning, diff-based workflows, and robust safety measures including context compression and permission systems. The architecture achieved significant user engagement, requiring Anthropic to implement weekly usage limits due to users running Claude Code continuously, demonstrating the effectiveness of the simple-but-disciplined approach to agentic system design.

Climate Tech Foundation Models for Environmental AI Applications

Various

Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.

Clinical-Grade Patient Education Agent with LangGraph and LangSmith

Lubu Labs

Lubu Labs built a production AI agent for a digital health platform that helps patients understand their health test results from camera-based scans measuring 30+ vital signs. The system needed to provide plain-language medical explanations, answer follow-up questions conversationally, and route uncertain cases to cliniciansโ€”all while meeting healthcare regulatory requirements. The solution used LangGraph for explicit control flow with confidence-based routing decisions, RAG over a versioned medical knowledge base, and LangSmith for audit-grade observability. Key results included approximately 15% of conversations appropriately triggering human review, an 80% accuracy rate in routing decisions validated by clinicians, a 40% reduction in false positive reviews after threshold tuning, and very low rates of inappropriate clinical advice in production validated through weekly audits.

Collaborative Prompt Engineering Platform for Production LLM Development

LinkedIn

LinkedIn developed a collaborative prompt engineering platform using Jupyter Notebooks to bridge the gap between technical and non-technical teams in developing LLM-powered features. The platform enabled rapid prototyping and testing of prompts, with built-in access to test data and external APIs, leading to successful deployment of features like AccountIQ which reduced company research time from two hours to five minutes. The solution addressed challenges in LLM configuration management, prompt template handling, and cross-functional collaboration while maintaining production-grade quality.

Comprehensive Debugging and Observability Framework for Production Agent AI Systems

DocuSign

The presentation addresses the critical challenge of debugging and maintaining agent AI systems in production environments. While many organizations are eager to implement and scale AI agents, they often hit productivity plateaus due to insufficient tooling and observability. The speaker proposes a comprehensive rubric for assessing AI agent systems' operational maturity, emphasizing the need for complete visibility into environment configurations, system logs, model versioning, prompts, RAG implementations, and fine-tuning pipelines across the entire organization.

Comprehensive LLM Evaluation Framework for Production AI Code Assistants

Github

Github describes their robust evaluation framework for testing and deploying new LLM models in their Copilot product. The team runs over 4,000 offline tests, including automated code quality assessments and chat capability evaluations, before deploying any model changes to production. They use a combination of automated metrics, LLM-based evaluation, and manual testing to assess model performance, quality, and safety across multiple programming languages and frameworks.

Comprehensive Security and Risk Management Framework for Enterprise LLM Deployments

PredictionGuard

PredictionGuard presents a comprehensive framework for addressing key challenges in deploying LLMs securely in enterprise environments. The case study outlines solutions for hallucination detection, supply chain vulnerabilities, server security, data privacy, and prompt injection attacks. Their approach combines traditional security practices with AI-specific safeguards, including the use of factual consistency models, trusted model registries, confidential computing, and specialized filtering layers, all while maintaining reasonable latency and performance.

Contact Center Transformation with AI-Powered Customer Service and Agent Assistance

Canada Life

Canada Life, a leading financial services company serving 14 million customers (one in three Canadians), faced significant contact center challenges including 5-minute average speed to answer, wait times up to 40 minutes, complex routing, high transfer rates, and minimal self-service options. The company migrated 21 business units from a legacy system to Amazon Connect in 7 months, implementing AI capabilities including chatbots, call summarization, voice-to-text, automated authentication, and proficiency-based routing. Results included 94% reduction in wait time, 10% reduction in average handle time, $7.5 million savings in first half of 2025, 92% reduction in average speed to answer (now 18 seconds), 83% chatbot containment rate, and 1900 calls deflected per week. The company plans to expand AI capabilities including conversational AI, agent assist, next best action, and fraud detection, projecting $43 million in cost savings over five years.

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systemsโ€”particularly across dozens to hundreds of tool callsโ€”presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

Context Engineering and Tool Design for Background Coding Agents at Scale

Spotify

Spotify deployed a background coding agent to automate large-scale software maintenance across thousands of repositories, initially experimenting with open-source tools like Goose and Aider before building a custom agentic loop, and ultimately adopting Claude Code with the Anthropic Agent SDK. The primary challenge shifted from building the agent to effective context engineeringโ€”crafting prompts that produce reliable, mergeable pull requests at scale. Through extensive experimentation, Spotify developed prompt engineering principles (tailoring to the agent, stating preconditions, using examples, defining end states through tests) and designed a constrained tool ecosystem (limited bash commands, custom verify tool, git tool) to maintain predictability. The system has successfully merged approximately 50 migrations with thousands of AI-generated pull requests into production, demonstrating that careful prompt design and strategic tool limitation are critical for production LLM deployments in code generation scenarios.

Context Engineering for AI-Assisted Employee Onboarding

Etsy

Etsy explored using prompt engineering as an alternative to fine-tuning for AI-assisted employee onboarding, focusing on Travel & Entertainment policy questions and community forum support. They implemented a RAG-style approach using embeddings-based search to augment prompts with relevant Etsy-specific documents. The system achieved 86% accuracy on T&E policy questions and 72% on community forum queries, with various prompt engineering techniques like chain-of-thought reasoning and source citation helping to mitigate hallucinations and improve reliability.

Context Engineering for Background Coding Agents at Scale

Spotify

Spotify built a background coding agent system to automate large-scale software maintenance and migrations across thousands of repositories. The company initially experimented with open-source agents like Goose and Aider, then built a custom agentic loop, before ultimately adopting Claude Code from Anthropic. The core challenge centered on context engineeringโ€”crafting effective prompts and selecting appropriate tools to enable the agent to reliably generate mergeable pull requests. By developing sophisticated prompt engineering practices and carefully constraining the agent's toolset, Spotify has successfully applied this system to approximately 50 migrations with thousands of merged PRs across hundreds of repositories.

Context Engineering for Production AI Agents at Scale

Manus

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

Context Engineering Strategies for Production AI Agents

Manus

Manus AI developed a production AI agent system that uses context engineering instead of fine-tuning to enable rapid iteration and deployment. The company faced the challenge of building an effective agentic system that could operate reliably at scale while managing complex multi-step tasks. Their solution involved implementing several key strategies including KV-cache optimization, tool masking instead of removal, file system-based context management, attention manipulation through task recitation, and deliberate error preservation for learning. These approaches allowed Manus to achieve faster development cycles, improved cost efficiency, and better agent performance across millions of users while maintaining system stability and scalability.

Context-Aware Item Recommendations Using Hybrid LLM and Embedding-Based Retrieval

DoorDash

DoorDash's Core Consumer ML team developed a GenAI-powered context shopping engine to address the challenge of lost user intent during in-app searches for items like "fresh vegetarian sushi." The traditional search system struggled to preserve specific user context, leading to generic recommendations and decision fatigue. The team implemented a hybrid approach combining embedding-based retrieval (EBR) using FAISS with LLM-based reranking to balance speed and personalization. The solution achieved end-to-end latency of approximately six seconds with store page loads under two seconds, while significantly improving user satisfaction through dynamic, personalized item carousels that maintained user context and preferences. This hybrid architecture proved more practical than pure LLM or deep neural network approaches by optimizing for both performance and cost efficiency.

Converting Natural Language to Structured GraphQL Queries Using LLMs

Cato Networks

Cato Networks implemented a natural language search interface for their SASE management console's events page using Amazon Bedrock's foundation models. They transformed free-text queries into structured GraphQL queries by employing prompt engineering and JSON schema validation, reducing query time from minutes to near-instant while making the system more accessible to new users and non-English speakers. The solution achieved high accuracy with an error rate below 0.05 while maintaining reasonable costs and latency.

Cost-Effective LLM Transaction Categorization for Business Banking

ANNA

ANNA, a UK business banking provider, implemented LLMs to automate transaction categorization for tax and accounting purposes across diverse business types. They achieved this by combining traditional ML with LLMs, particularly focusing on context-aware categorization that understands business-specific nuances. Through strategic optimizations including offline predictions, improved context utilization, and prompt caching, they reduced their LLM costs by 75% while maintaining high accuracy in their AI accountant system.

Deploying Agentic AI for Clinical Trial Protocol Deviation Monitoring

Bayezian Limited

Bayezian Limited deployed a multi-agent AI system to monitor protocol deviations in clinical trials, where traditional manual review processes were time-consuming and error-prone. The system used specialized LLM agents, each responsible for checking specific protocol rules (visit timing, medication use, inclusion criteria, etc.), working on top of a pipeline that processed clinical documents and used FAISS for semantic retrieval of protocol requirements. While the system successfully identified patterns early and improved reviewer efficiency by shifting focus from manual checking to intelligent triage, it encountered significant challenges including handover failures between agents, memory lapses causing coordination breakdowns, and difficulties handling real-world data ambiguities like time windows and exceptions. The team improved performance through structured memory snapshots, flexible prompt engineering, stronger handoff signals, and process tracking, ultimately creating a useful but imperfect system that highlighted the gap between agentic AI theory and production reality.

Deploying Agentic Code Review at Scale with GPT-5 Codex

OpenAI

OpenAI addresses the challenge of verifying AI-generated code at scale by deploying an autonomous code reviewer built on GPT-5-Codex and GPT-5.1-Codex-Max. As autonomous coding systems produce code volumes that exceed human oversight capacity, the risk of severe bugs and vulnerabilities increases. The solution involves training a dedicated agentic code reviewer with repository-wide tool access and code execution capabilities, optimizing for precision over recall to maintain developer trust and minimize false alarms. The system now reviews over 100,000 external PRs daily, with authors making code changes in response to 52.7% of comments internally, demonstrating actionable impact while maintaining a low "alignment tax" on developer workflows.

Deploying AI Agents for Scalable Immigration Automation

Navismart AI

Navismart AI developed a multi-agent AI system to automate complex immigration processes that traditionally required extensive human expertise. The platform addresses challenges including complex sequential workflows, varying regulatory compliance across different countries, and the need for human oversight in high-stakes decisions. Built on a modular microservices architecture with specialized agents handling tasks like document verification, form filling, and compliance checks, the system uses Kubernetes for orchestration and scaling. The solution integrates REST APIs for inter-agent communication, implements end-to-end encryption for security, and maintains human-in-the-loop capabilities for critical decisions. The team started with US immigration processes due to their complexity and is expanding to other countries and domains like education.

Deploying an AI SDR Chatbot for Lead Qualification with Production-Grade Observability

Lubu Labs

Lubu Labs deployed an AI SDR (Sales Development Representative) chatbot for a loyalty platform to qualify inbound leads, answer product questions, and route conversations appropriately. The implementation faced challenges around quality drift on real traffic, debugging complex tool and model interactions, and occasional duplicate CRM actions that could damage revenue operations. The team used LangSmith's tracing, feedback loops, and evaluation workflows to make the system debuggable and production-ready, implementing idempotent tool calls, structured state management with LangGraph, and regression testing against representative conversation datasets to ensure reliable operation.

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino

Two organizations operating in highly regulated industriesโ€”Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operatorโ€”share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

Detecting and Mitigating Prompt Injection via Control Characters in ChatGPT

Dropbox

Dropbox's security team discovered that control characters like backspace and carriage return can be used to circumvent prompt constraints in OpenAI's GPT-3.5 and GPT-4 models. By inserting large sequences of these characters, they were able to make the models forget context and instructions, leading to prompt injection vulnerabilities. This research revealed previously undocumented behavior that could be exploited in LLM-powered applications, highlighting the importance of proper input sanitization for secure LLM deployments.

Distributed Agent Systems Architecture for AI Agent Platform

Dust.tt

Dust.tt, an AI agent platform that allows users to build custom AI agents connected to their data and tools, presented their technical approach to building distributed agent systems at scale. The company faced challenges with their original synchronous, stateless architecture when deploying AI agents that could run for extended periods, handle tool orchestration, and maintain state across failures. Their solution involved redesigning their infrastructure around a continuous orchestration loop with versioning systems for idempotency, using Temporal workflows for coordination, and implementing a database-driven communication protocol between agent components. This architecture enables reliable, scalable deployment of AI agents that can handle complex multi-step tasks while surviving infrastructure failures and preventing duplicate actions.

Document Processing Automation with LLMs: Evolution of Evaluation Strategies

Tola Capital / Klarity

Klarity, a document processing automation company, transformed their approach to evaluating LLM systems in production as they moved from traditional ML to generative AI. The company processes over half a million documents for B2B SaaS customers, primarily handling complex financial and accounting workflows. Their journey highlights the challenges and solutions in developing robust evaluation frameworks for LLM-powered systems, particularly focusing on non-deterministic performance, rapid feature development, and the gap between benchmark performance and real-world results.

Dogfooding AI Features in GitLab's Development Workflow

Gitlab

GitLab shares their experience of integrating and testing their AI-powered features suite, GitLab Duo, within their own development workflows. The case study demonstrates how different teams within GitLab leverage AI capabilities for various tasks including code review, documentation, incident response, and feature testing. The implementation has resulted in significant efficiency gains, reduced manual effort, and improved quality across their development processes.

Domain-Native LLM Application for Healthcare Insurance Administration

Anterior

Anterior, a clinician-led healthcare technology company, developed an AI system called Florence to automate medical necessity reviews for health insurance providers covering 50 million lives in the US. The company addressed the "last mile problem" in LLM applications by building an adaptive domain intelligence engine that enables domain experts to continuously improve model performance through systematic failure analysis, domain knowledge injection, and iterative refinement. Through this approach, they achieved 99% accuracy in care request approvals, moving beyond the 95% baseline achieved through model improvements alone.

Domain-Specific LLMs for Drug Discovery Biomarker Identification

BenchSci

BenchSci developed an AI platform for drug discovery that combines domain-specific LLMs with extensive scientific data processing to assist scientists in understanding disease biology. They implemented a RAG architecture that integrates their structured biomedical knowledge base with Google's Med-PaLM model to identify biomarkers in preclinical research, resulting in a reported 40% increase in productivity and reduction in processing time from months to days.

DragonCrawl: Uber's Journey to AI-Powered Mobile Testing Using Small Language Models

Uber

Uber developed DragonCrawl, an innovative AI-powered mobile testing system that uses a small language model (110M parameters) to automate app testing across multiple languages and cities. The system addressed critical challenges in mobile testing, including high maintenance costs and scalability issues across Uber's global operations. Using an MPNet-based architecture with a retriever-ranker approach, DragonCrawl achieved 99%+ stability in production, successfully operated in 85 out of 89 tested cities, and demonstrated remarkable adaptability to UI changes without requiring manual updates. The system proved particularly valuable by blocking ten high-priority bugs from reaching customers while significantly reducing developer maintenance time. Most notably, DragonCrawl exhibited human-like problem-solving behaviors, such as retrying failed operations and implementing creative solutions like app restarts to overcome temporary issues.

Dynamic Knowledge and Instruction RAG System for Production Chatbots

Wix

Wix developed an innovative approach to enhance their AI Site-Chat system by creating a hybrid framework that combines LLMs with traditional machine learning classifiers. They introduced DDKI-RAG (Dynamic Domain Knowledge and Instruction Retrieval-Augmented Generation), which addresses limitations of traditional RAG systems by enabling real-time learning and adaptability based on site owner feedback. The system uses a novel classification approach combining LLMs for feature extraction with CatBoost for final classification, allowing chatbots to continuously improve their responses and incorporate unwritten domain knowledge.

Dynamic Prompt Injection for Reliable AI Agent Behavior

Control Plain

Control Plain addressed the challenge of unreliable AI agent behavior in production environments by developing "intentional prompt injection," a technique that dynamically injects relevant instructions at runtime based on semantic matching rather than bloating system prompts with edge cases. Using an airline customer support agent as their test case, they demonstrated that this approach improved reliability from 80% to 100% success rates on challenging passenger modification scenarios while maintaining clean, maintainable prompts and avoiding "prompt debt."

Email Classification System Using Foundation Models and Prompt Engineering

Travelers Insurance

Travelers Insurance developed an automated email classification system using Amazon Bedrock and Anthropic's Claude models to categorize millions of service request emails into 13 different categories. Through advanced prompt engineering techniques and without model fine-tuning, they achieved 91% classification accuracy, potentially saving tens of thousands of manual processing hours. The system combines email text analysis, PDF processing using Amazon Textract, and foundation model-based classification in a serverless architecture.

End-to-End LLM Observability for RAG-Powered AI Assistant

Splunk

Splunk built an AI Assistant leveraging Retrieval-Augmented Generation (RAG) to answer FAQs using curated public content from .conf24 materials. The system was developed in a hackathon-style sprint using their internal CIRCUIT platform. To operationalize this LLM-powered application at scale, Splunk integrated comprehensive observability across the entire RAG pipelineโ€”from prompt handling and document retrieval to LLM generation and output evaluation. By instrumenting structured logs, creating unified dashboards in Splunk Observability Cloud, and establishing proactive alerts for quality degradation, hallucinations, and cost overruns, they achieved full visibility into response quality, latency, source document reliability, and operational health. This approach enabled rapid iteration, reduced mean time to resolution for quality issues, and established reproducible governance practices for production LLM deployments.

Engineering Principles and Practices for Production LLM Systems

Langchain

This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.

Enhancing Healthcare Service Delivery with RAG and LLM-Powered Search

Accolade

Accolade, facing challenges with fragmented healthcare data across multiple platforms, implemented a Retrieval Augmented Generation (RAG) solution using Databricks' DBRX model to improve their internal search capabilities and customer service. By consolidating their data in a lakehouse architecture and leveraging LLMs, they enabled their teams to quickly access accurate information and better understand customer commitments, resulting in improved response times and more personalized care delivery.

Enhancing Memory Retrieval Systems Using LangSmith Testing and Evaluation

New Computer

New Computer improved their AI assistant Dot's memory retrieval system using LangSmith for testing and evaluation. By implementing synthetic data testing, comparison views, and prompt optimization, they achieved 50% higher recall and 40% higher precision in their dynamic memory retrieval system compared to their baseline implementation.

Enterprise Agent Orchestration Platform for Secure LLM Deployment

Airia

This case study explores how Airia developed an orchestration platform to help organizations deploy AI agents in production environments. The problem addressed is the significant complexity and security challenges that prevent businesses from moving beyond prototype AI agents to production-ready systems. The solution involves a comprehensive platform that provides agent building capabilities, security guardrails, evaluation frameworks, red teaming, and authentication controls. Results include successful deployments across multiple industries including hospitality (customer profiling across hotel chains), HR, legal (contract analysis), marketing (personalized content generation), and operations (real-time incident response through automated data aggregation), with customers reporting significant efficiency gains while maintaining enterprise security standards.

Enterprise Agentic AI Deployment: Panel Discussion on Production Realities and Technical Bottlenecks

Various

This panel discussion features leaders from Writer, You.com, Glean, and Google discussing the current state of deploying agentic AI systems in enterprise environments. The panelists address the gap between prototype development (which can now take 90 seconds) and production-ready systems that Fortune 500 companies can rely on. They identify key technical bottlenecks including data quality and governance issues, information retrieval challenges, function calling limitations, security vulnerabilities, and the difficulty of verifying agent actions. The consensus is that while every large enterprise has built some AI agents adding business value, they are far from having 50% of enterprise work handled by AI, with action agents for larger enterprises likely requiring several more years for major adoption.

Enterprise AI Adoption Journey: From Experimentation to Core Operations

Credal

A comprehensive analysis of how enterprises adopt and scale AI/LLM technologies, based on observations from multiple companies. The journey typically progresses through four stages: early experimentation, chat with docs workflows, enterprise search, and core operations integration. The case study explores key challenges including data security, use case discovery, and technical implementation hurdles, while providing insights into critical decisions around build vs. buy, platform selection, and LLM provider strategy.

Enterprise AI Agents with Structured and Unstructured Data Integration

Snowflake

This case study explores the challenges and solutions for deploying AI agents in enterprise environments, focusing on the integration of structured database data with unstructured documents through retrieval augmented generation (RAG). The presentation by Snowflake's Jeff Holland outlines a comprehensive agentic workflow that addresses common enterprise challenges including semantic mapping, ambiguity resolution, data model complexity, and query classification. The solution demonstrates a working prototype with fitness wearable company Whoop, showing how agents can combine sales data, manufacturing data, and forecasting information with unstructured Slack conversations to provide real-time business intelligence and recommendations for product launches.

Enterprise AI Platform Deployment for Multi-Company Productivity Enhancement

Payfit, Alan

This case study presents the deployment of Dust.tt's AI platform across multiple companies including Payfit and Alan, focusing on enterprise-wide productivity improvements through LLM-powered assistants. The companies implemented a comprehensive AI strategy involving both top-down leadership support and bottom-up adoption, creating custom assistants for various workflows including sales processes, customer support, performance reviews, and content generation. The implementation achieved significant productivity gains of approximately 20% across teams, with some specific use cases reaching 50% improvements, while addressing challenges around security, model selection, and user adoption through structured rollout processes and continuous iteration.

Enterprise Data Extraction Evolution from Simple RAG to Multi-Agent Architecture

Box

Box, a B2B unstructured data platform serving Fortune 500 companies, initially built a straightforward LLM-based metadata extraction system that successfully processed 10 million pages but encountered limitations with complex documents, OCR challenges, and scale requirements. They evolved from a simple pre-process-extract-post-process pipeline to a sophisticated multi-agent architecture that intelligently handles document complexity, field grouping, and quality feedback loops, resulting in a more robust and easily evolving system that better serves enterprise customers' diverse document processing needs.

Enterprise Document Data Extraction Using Agentic AI Workflows

Box

Box, an enterprise content platform serving over 115,000 customers including two-thirds of the Fortune 500, transformed their document data extraction capabilities by evolving from simple single-shot LLM prompting to sophisticated agentic AI workflows. Initially successful with basic document extraction using off-the-shelf models like GPT, Box encountered significant challenges when customers demanded extraction from complex 300-page documents with hundreds of fields, multilingual content, and poor OCR quality. The company implemented an agentic architecture using directed graphs that orchestrate multiple AI models, tools for validation and cross-checking, and iterative refinement processes. This approach dramatically improved accuracy and reliability while maintaining the flexibility to handle diverse document types and complex extraction requirements across their enterprise customer base.

Enterprise Infrastructure Challenges for Agentic AI Systems in Production

Various (Meta / Google / Monte Carlo / Azure)

A panel discussion featuring engineers from Meta, Google, Monte Carlo, and Microsoft Azure explores the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion reveals that agentic workloads differ dramatically from traditional software systems, requiring complete reimagining of reliability, security, networking, and observability approaches. Key challenges include non-deterministic behavior leading to incidents like chatbots selling cars for $1, massive scaling requirements as agents work continuously, and the need for new health checking mechanisms, semantic caching, and comprehensive evaluation frameworks to manage systems where 95% of outcomes are unknown unknowns.

Enterprise LLM Application Development: GitHub Copilot's Journey

Github

GitHub shares their three-year journey of developing and scaling GitHub Copilot, their enterprise-grade AI code completion tool. The case study details their approach through three stages: finding the right problem space, nailing the product experience through rapid iteration and testing, and scaling the solution for enterprise deployment. The result was a successful launch that showed developers coding up to 55% faster and reporting 74% less frustration when coding.

Enterprise LLM Implementation Panel: Lessons from Box, Glean, Tyace, Security AI and Citibank

Various

A panel discussion featuring leaders from multiple enterprises sharing their experiences implementing LLMs in production. The discussion covers key challenges including data privacy, security, cost management, and enterprise integration. Speakers from Box discuss content management challenges, Glean covers enterprise search implementations, Tyace shares content generation experiences, Security AI addresses data safety, and Citibank provides CIO perspective on enterprise-wide AI deployment. The panel emphasizes the importance of proper data governance, security controls, and the need for systematic approach to move from POCs to production.

Enterprise LLMOps Platform with Focus on Model Customization and API Optimization

IBM

IBM's Watson X platform addresses enterprise LLMOps challenges by providing a comprehensive solution for model access, deployment, and customization. The platform offers both open-source and proprietary models, focusing on specialized use cases like banking and insurance, while emphasizing API optimization for LLM interactions and robust evaluation capabilities. The case study highlights how enterprises are implementing LLMOps at scale with particular attention to data security, model evaluation, and efficient API design for LLM consumption.

Enterprise LLMOps: Development, Operations and Security Framework

Cisco

At Cisco, the challenge of integrating LLMs into enterprise-scale applications required developing new DevSecOps workflows and practices. The presentation explores how Cisco approached continuous delivery, monitoring, security, and on-call support for LLM-powered applications, showcasing their end-to-end model for LLMOps in a large enterprise environment.

Enterprise Neural Machine Translation at Scale

DeepL

DeepL, a translation company founded in 2017, has built a successful enterprise-focused business using neural machine translation models to tackle the language barrier problem at scale. The company handles hundreds of thousands of customers by developing specialized neural translation models that balance accuracy and fluency, training them on curated parallel and monolingual corpora while leveraging context injection rather than per-customer fine-tuning for scalability. By building their own GPU infrastructure early on and developing custom frameworks for inference optimization, DeepL maintains a competitive edge over general-purpose LLMs and established players like Google Translate, demonstrating strong product-market fit in high-stakes enterprise use cases where translation quality directly impacts legal compliance, customer experience, and business operations.

Enterprise RAG-Based Virtual Assistant with LLM Evaluation Pipeline

Santalucรญa Seguros

Santalucรญa Seguros implemented a GenAI-based Virtual Assistant to improve customer service and agent productivity in their insurance operations. The solution uses a RAG framework powered by Databricks and Microsoft Azure, incorporating MLflow for LLMOps and Mosaic AI Model Serving for LLM deployment. They developed a sophisticated LLM-based evaluation system that acts as a judge for quality assessment before new releases, ensuring consistent performance and reliability of the virtual assistant.

Enterprise Unstructured Data Quality Management for Production AI Systems

Anomalo

Anomalo addresses the critical challenge of unstructured data quality in enterprise AI deployments by building an automated platform on AWS that processes, validates, and cleanses unstructured documents at scale. The solution automates OCR and text parsing, implements continuous data observability to detect anomalies, enforces governance and compliance policies including PII detection, and leverages Amazon Bedrock for scalable LLM-based document quality analysis. This approach enables enterprises to transform their vast collections of unstructured text data into trusted assets for production AI applications while reducing operational burden, optimizing costs, and maintaining regulatory compliance.

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

Enterprise-Scale Cloud Event Management with Generative AI for Operational Intelligence

Fidelity Investments

Fidelity Investments faced the challenge of managing massive volumes of AWS health events and support case data across 2,000+ AWS accounts and 5 million resources in their multi-cloud environment. They built CENTS (Cloud Event Notification Transport Service), an event-driven data pipeline that ingests, enriches, routes, and acts on AWS health and support data at scale. Building upon this foundation, they developed and published the MAKI (Machine Augmented Key Insights) framework using Amazon Bedrock, which applies generative AI to analyze support cases and health events, identify trends, provide remediation guidance, and enable agentic workflows for vulnerability detection and automated code fixes. The solution reduced operational costs by 57%, improved stakeholder engagement through targeted notifications, and enabled proactive incident prevention by correlating patterns across their infrastructure.

Enterprise-Scale Deployment of AI Ambient Scribes Across Multiple Healthcare Systems

Memorial Sloan Kettering / McLeod Health / UCLA

This panel discussion features three major healthcare systemsโ€”McLeod Health, Memorial Sloan Kettering Cancer Center, and UCLA Healthโ€”discussing their experiences deploying generative AI-powered ambient clinical documentation (AI scribes) at scale. The organizations faced challenges in vendor evaluation, clinician adoption, and demonstrating ROI while addressing physician burnout and documentation burden. Through rigorous evaluation processes including randomized controlled trials, head-to-head vendor comparisons, and structured pilots, these systems successfully deployed AI scribes to hundreds to thousands of physicians. Results included significant reductions in burnout (20% at UCLA), improved patient satisfaction scores (5-6% increases at McLeod), time savings of 1.5-2 hours per day, and positive financial ROI through improved coding and RVU capture. Key learnings emphasized the importance of robust training, encounter-based pricing models, workflow integration, and managing expectations that AI scribes are not a universal solution for all specialties and clinicians.

Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

John Snow Labs

John Snow Labs developed a comprehensive healthcare LLM system that integrates multimodal medical data (structured, unstructured, FHIR, and images) into unified patient journeys. The system enables natural language querying across millions of patient records while maintaining data privacy and security. It uses specialized healthcare LLMs for information extraction, reasoning, and query understanding, deployed on-premises via Kubernetes. The solution significantly improves clinical decision support accuracy and enables broader access to patient data analytics while outperforming GPT-4 in medical tasks.

Enterprise-Scale LLM Deployment with Licensed Content for Business Intelligence

Factiva

Factiva, a Dow Jones business intelligence platform, implemented a secure, enterprise-scale LLM solution for their content aggregation service. They developed "Smart Summaries" that allows natural language querying across their vast licensed content database of nearly 3 billion articles. The implementation required securing explicit GenAI licensing agreements from thousands of publishers, ensuring proper attribution and royalty tracking, and deploying a secure cloud infrastructure using Google's Gemini model. The solution successfully launched in November 2023 with 4,000 publishers, growing to nearly 5,000 publishers by early 2024.

Enterprise-Scale LLM Deployment with Self-Evolving Models and Graph-Based RAG

Writer

Writer, an enterprise AI company founded in 2020, has evolved from building basic transformer models to delivering full-stack GenAI solutions for Fortune 500 companies. They've developed a comprehensive approach to enterprise LLM deployment that includes their own Palmera model series, graph-based RAG systems, and innovative self-evolving models. Their platform focuses on workflow automation and "action AI" in industries like healthcare and financial services, achieving significant efficiency gains through a hybrid approach that combines both no-code interfaces for business users and developer tools for IT teams.

Enterprise-Scale LLM Integration into CRM Platform

Salesforce

Salesforce developed Einstein GPT, the first generative AI system for CRM, to address customer expectations for faster, personalized responses and automated tasks. The solution integrates LLMs across sales, service, marketing, and development workflows while ensuring data security and trust. The implementation includes features like automated email generation, content creation, code generation, and analytics, all grounded in customer-specific data with human-in-the-loop validation.

Enterprise-Scale LLM Platform with Multi-Model Support and Copilot Customization

Telus

Telus developed Fuel X, an enterprise-scale LLM platform that provides centralized management of multiple AI models and services. The platform enables creation of customized copilots for different use cases, with over 30,000 custom copilots built and 35,000 active users. Key features include flexible model switching, enterprise security, RAG capabilities, and integration with workplace tools like Slack and Google Chat. Results show significant impact, including 46% self-resolution rate for internal support queries and 21% reduction in agent interactions.

Enterprise-Scale Prompt Engineering Toolkit with Lifecycle Management and Production Integration

Uber

Uber developed a comprehensive prompt engineering toolkit to address the challenges of managing and deploying LLMs at scale. The toolkit provides centralized prompt template management, version control, evaluation frameworks, and production deployment capabilities. It includes features for prompt creation, iteration, testing, and monitoring, along with support for both offline batch processing and online serving. The system integrates with their existing infrastructure and supports use cases like rider name validation and support ticket summarization.

Enterprise-Wide LLM Framework for Manufacturing and Knowledge Management

Toyota

Toyota implemented a comprehensive LLMOps framework to address multiple production challenges, including battery manufacturing optimization, equipment maintenance, and knowledge management. The team developed a unified framework combining LangChain and LlamaIndex capabilities, with special attention to data ingestion pipelines, security, and multi-language support. Key applications include Battery Brain for manufacturing expertise, Gear Pal for equipment maintenance, and Project Cura for knowledge management, all showing significant operational improvements including reduced downtime and faster problem resolution.

Enterprise-Wide RAG Implementation with Amazon Q Business

Principal Financial

Principal Financial implemented Amazon Q Business to address challenges with scattered enterprise knowledge and inefficient search capabilities across multiple repositories. The solution integrated QnABot on AWS with Amazon Q Business to enable natural language querying of over 9,000 pages of work instructions. The implementation resulted in 84% accuracy in document retrieval, with 97% of queries receiving positive feedback and users reporting 50% reduction in some workloads. The project demonstrated successful scaling from proof-of-concept to enterprise-wide deployment while maintaining strict governance and security requirements.

Error Handling in LLM Systems

Uber

This case study examines a common scenario in LLM systems where proper error handling and response validation is essential. The "Not Acceptable" error demonstrates the importance of implementing robust error handling mechanisms in production LLM applications to maintain system reliability and user experience.

Eval-Driven Development for AI Applications

Vercel

Vercel presents their approach to building and deploying AI applications through eval-driven development, moving beyond traditional testing methods to handle AI's probabilistic nature. They implement a comprehensive evaluation system combining code-based grading, human feedback, and LLM-based assessments to maintain quality in their v0 product, an AI-powered UI generation tool. This approach creates a positive feedback loop they call the "AI-native flywheel," which continuously improves their AI systems through data collection, model optimization, and user feedback.

Evaluating Long Context Performance in Legal AI Applications

Thomson Reuters

Thomson Reuters details their comprehensive approach to evaluating and deploying long-context LLMs in their legal AI assistant CoCounsel. They developed rigorous testing protocols to assess LLM performance with lengthy legal documents, implementing a multi-LLM strategy rather than relying on a single model. Through extensive benchmarking and testing, they found that using full document context generally outperformed RAG for most document-based legal tasks, leading to strategic decisions about when to use each approach in production.

Evaluating Product Image Integrity in AI-Generated Advertising Content

Microsoft

Microsoft worked with an advertising customer to enable 1:1 ad personalization while ensuring product image integrity in AI-generated content. They developed a comprehensive evaluation system combining template matching, Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR), and Cosine Similarity to verify that AI-generated backgrounds didn't alter the original product images. The solution successfully enabled automatic verification of product image fidelity in AI-generated advertising materials.

Evaluation Driven Development for LLM Reliability at Scale

Dosu

Dosu, a company providing an AI teammate for software development and maintenance, implemented Evaluation Driven Development (EDD) to ensure reliability of their LLM-based product. As their system scaled to thousands of repositories, they integrated LangSmith for monitoring and evaluation, enabling them to identify failure modes, maintain quality, and continuously improve their AI assistant's performance through systematic testing and iteration.

Evaluation Patterns for Deep Agents in Production

Langchain

LangChain built and deployed four production applications powered by "Deep Agents" - stateful, long-running AI agents capable of complex tasks including coding, email assistance, and agent building. The challenge was developing comprehensive evaluation strategies for these agents that went beyond traditional LLM evaluation approaches. Their solution involved five key patterns: bespoke test logic for each datapoint with custom assertions, single-step evaluations for validating specific decision points, full agent turn testing for end-to-end behavior, multi-turn conversations with conditional logic to simulate realistic interactions, and proper environment setup with clean, reproducible test conditions. Using LangSmith's Pytest and Vitest integrations, they implemented flexible evaluation frameworks that could assess agent trajectories, final responses, and state artifacts while maintaining fast, debuggable test suites through techniques like API mocking and containerized environments.

Evaluation-Driven LLM Production Workflows with Morgan Stanley and Grab Case Studies

OpenAI

OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.

Evaluations Driven Development for Production LLM Applications

Anaconda

Anaconda developed a systematic approach called Evaluations Driven Development (EDD) to improve their AI coding assistant's performance through continuous testing and refinement. Using their in-house "llm-eval" framework, they achieved dramatic improvements in their assistant's ability to handle Python debugging tasks, increasing success rates from 0-13% to 63-100% across different models and configurations. The case study demonstrates how rigorous evaluation, prompt engineering, and automated testing can significantly enhance LLM application reliability in production.

Evolution from Monolithic to Task-Oriented LLM Pipelines in a Developer Assistant Product

Outropy

The case study details how Outropy evolved their LLM inference pipeline architecture while building an AI-powered assistant for engineering leaders. They started with simple pipelines for daily briefings and context-aware features, but faced challenges with context windows, relevance, and error cascades. The team transitioned from monolithic pipelines to component-oriented design, and finally to task-oriented pipelines using Temporal for workflow management. The product successfully scaled to 10,000 users and expanded from a Slack-only tool to a comprehensive browser extension.

Evolution from Open-Ended LLM Agents to Guided Workflows

Lindy.ai

Lindy.ai evolved from an open-ended LLM agent platform to a more structured workflow-based approach, demonstrating how constraining LLM behavior through visual workflows and rails leads to more reliable and usable AI agents. The company found that by moving away from free-form prompts to guided, step-by-step workflows, they achieved better reliability and user adoption while maintaining the flexibility to handle complex automation tasks like meeting summaries, email processing, and customer support.

Evolution from Task-Specific Models to Multi-Agent Orchestration Platform

AI21

AI21 Labs evolved their production AI systems from task-specific models (2022-2023) to RAG-as-a-Service, and ultimately to Maestro, a multi-agent orchestration platform. The company identified that while general-purpose LLMs demonstrated impressive capabilities, they weren't optimized for specific business use cases that enterprises actually needed, such as contextual question answering and summarization. AI21 developed smaller language models fine-tuned for specific tasks, wrapped them with pre- and post-processing operations (including hallucination filters), and eventually built a comprehensive RAG system when customers struggled to identify relevant context from large document corpora. The Maestro platform emerged to handle complex multi-hop queries by automatically breaking them into subtasks, parallelizing execution, and orchestrating multiple agents and tools, achieving dramatically improved quality with full traceability for enterprise requirements.

Evolution of AI Agents: From Manual Workflows to End-to-End Training

OpenAI

OpenAI's journey in developing agentic products showcases the evolution from manually designed workflows with LLMs to end-to-end trained agents. The company has developed three main agentic products - Deep Research, Operator, and Codeex CLI - each addressing different use cases from web research to code generation. These agents demonstrate how end-to-end training with reinforcement learning enables better error recovery and more natural interaction compared to traditional manually designed workflows.

Evolution of Code Assistant Integration in a Cloud Development Platform

Val Town

Val Town's journey in implementing and evolving code assistance features showcases the challenges and opportunities in productionizing LLMs for code generation. Through iterative improvements and fast-following industry innovations, they progressed from basic ChatGPT integration to sophisticated features including error detection, deployment automation, and multi-file code generation, while addressing key challenges like generation speed and accuracy.

Evolution of Industrial AI: From Traditional ML to Multi-Agent Systems

Hitachi

Hitachi's journey in implementing AI across industrial applications showcases the evolution from traditional machine learning to advanced generative AI solutions. The case study highlights how they transformed from focused applications in maintenance, repair, and operations to a more comprehensive approach integrating LLMs, focusing particularly on reliability, small data scenarios, and domain expertise. Key implementations include repair recommendation systems for fleet management and fault tree extraction from manuals, demonstrating the practical challenges and solutions in industrial AI deployment.

Evolution of ML Model Deployment Infrastructure at Scale

Faire

Faire, a wholesale marketplace, evolved their ML model deployment infrastructure from a monolithic approach to a streamlined platform. Initially struggling with slow deployments, limited testing, and complex workflows across multiple systems, they developed an internal Machine Learning Model Management (MMM) tool that unified model deployment processes. This transformation reduced deployment time from 3+ days to 4 hours, enabled safe deployments with comprehensive testing, and improved observability while supporting various ML workloads including LLMs.

Evolution of ML Platform to Support GenAI Infrastructure

Lyft

Lyft's journey of evolving their ML platform to support GenAI infrastructure, focusing on how they adapted their existing ML serving infrastructure to handle LLMs and built new components for AI operations. The company transitioned from self-hosted models to vendor APIs, implemented comprehensive evaluation frameworks, and developed an AI assistants interface, while maintaining their established ML lifecycle principles. This evolution enabled various use cases including customer support automation and internal productivity tools.

Evolving a Conversational AI Platform for Production LLM Applications

AirBnB

AirBnB evolved their Automation Platform from a static workflow-based conversational AI system to a comprehensive LLM-powered platform. The new version (v2) combines traditional workflows with LLM capabilities, introducing features like Chain of Thought reasoning, robust context management, and a guardrails framework. This hybrid approach allows them to leverage LLM benefits while maintaining control over sensitive operations, ultimately enabling customer support agents to work more efficiently while ensuring safe and reliable AI interactions.

Evolving Agent Architecture Through Model Capability Improvements

Aomni

David from Aomni discusses how their company evolved from building complex agent architectures with multiple guardrails to simpler, more model-centric approaches as LLM capabilities improved. The company provides AI agents for revenue teams, helping automate research and sales workflows while keeping humans in the loop for customer relationships. Their journey demonstrates how LLMOps practices need to continuously adapt as model capabilities expand, leading to removal of scaffolding and simplified architectures.

Evolving GitHub Copilot with LLM Experimentation Across the Developer Lifecycle

GitHub

GitHub details their internal experimentation process with GPT-4 and other large language models to extend GitHub Copilot beyond code completion into multiple stages of the software development lifecycle. The GitHub Next research team received early access to GPT-4 and prototyped numerous AI-powered features including Copilot for Pull Requests, Copilot for Docs, Copilot for CLI, and GitHub Copilot Chat. Through iterative experimentation and internal testing with GitHub employees, the team discovered that user experience design, particularly how AI suggestions are presented and allow for developer control, is as critical as model accuracy for successful adoption. The experiments resulted in technical previews released in March 2023 that demonstrated AI integration across documentation, command-line interfaces, and pull request workflows, with key learnings around making AI outputs predictable, tolerable, steerable, and verifiable.

Evolving LLMOps Architecture for Enterprise Supplier Discovery

Various

A detailed case study of implementing LLMs in a supplier discovery product at Scoutbee, evolving from simple API integration to a sophisticated LLMOps architecture. The team tackled challenges of hallucinations, domain adaptation, and data quality through multiple stages: initial API integration, open-source LLM deployment, RAG implementation, and finally a comprehensive data expansion phase. The result was a production-ready system combining knowledge graphs, Chain of Thought prompting, and custom guardrails to provide reliable supplier discovery capabilities.

Evolving ML Infrastructure for Production Systems: From Traditional ML to LLMs

Doordash

A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.

Evolving Quality Control AI Agents with LangGraph

Rexera

Rexera transformed their real estate transaction quality control process by evolving from single-prompt LLM checks to a sophisticated LangGraph-based solution. The company initially faced challenges with single-prompt LLMs and CrewAI implementations, but by migrating to LangGraph, they achieved significant improvements in accuracy, reducing false positives from 8% to 2% and false negatives from 5% to 2% through more precise control and structured decision paths.

Fact-Centric Legal Document Review with Custom AI Pipeline

Mary Technology

Mary Technology, a Sydney-based legal tech firm, developed a specialized AI platform to automate document review for law firms handling dispute resolution cases. Recognizing that standard large language models (LLMs) with retrieval-augmented generation (RAG) are insufficient for legal work due to their compression nature, lack of training data access for sensitive documents, and inability to handle the nuanced fact extraction required for litigation, Mary built a custom "fact manufacturing pipeline" that treats facts as first-class citizens. This pipeline extracts entities, events, actors, and issues with full explainability and metadata, allowing lawyers to verify information before using downstream AI applications. Deployed across major firms including A&O Shearman, the platform has achieved a 75-85% reduction in document review time and a 96/100 Net Promoter Score.

Fine-tuned LLM for Message Content Moderation and Trust & Safety

Thumbtack

Thumbtack implemented a fine-tuned LLM solution to enhance their message review system for detecting policy violations in customer-professional communications. After experimenting with prompt engineering and finding it insufficient (AUC 0.56), they successfully fine-tuned an LLM model achieving an AUC of 0.93. The production system uses a cost-effective two-tier approach: a CNN model pre-filters messages, with only suspicious ones (20%) processed by the LLM. Using LangChain for deployment, the system has processed tens of millions of messages, improving precision by 3.7x and recall by 1.5x compared to their previous system.

Fine-tuning Custom Embedding Models for Enterprise Search

Glean

Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.

Fine-Tuning LLMs for Multi-Agent Orchestration in Code Generation

Cosine

Cosine, a company building enterprise coding agents, faced the challenge of deploying high-performance AI systems in highly constrained environments including on-premise and air-gapped deployments where large frontier models were not viable. They developed a multi-agent architecture using specialized orchestrator and worker models, leveraging model distillation, supervised fine-tuning, preference optimization, and reinforcement fine-tuning to create smaller models that could match or exceed the performance of much larger models. The result was a 31% performance increase on the SWE-bench Freelancer benchmark, 3X latency improvement, 60% reduction in GPU footprint, and 20% fewer errors in generated code, all while operating on as few as 4 H100 GPUs and maintaining full deployment flexibility across cloud, VPC, and on-premise environments.

Fine-tuning LLMs for Toxic Speech Classification in Gaming

Large Gaming Company

AWS Professional Services helped a major gaming company build an automated toxic speech detection system by fine-tuning Large Language Models. Starting with only 100 labeled samples, they experimented with different BERT-based models and data augmentation techniques, ultimately moving from a two-stage to a single-stage classification approach. The final solution achieved 88% precision and 83% recall while reducing operational complexity and costs compared to the initial proof of concept.

Five Critical Lessons for LLM Production Deployment

Amberflo

A former Apple messaging team lead shares five crucial insights for deploying LLMs in production, based on real-world experience. The presentation covers essential aspects including handling inappropriate queries, managing prompt diversity across different LLM providers, dealing with subtle technical changes that can impact performance, understanding the current limitations of function calling, and the critical importance of data quality in LLM applications.

Forward Deployed Engineering for Enterprise LLM Deployments

OpenAI

OpenAI's Forward Deployed Engineering (FDE) team embeds with enterprise customers to solve high-value problems using LLMs, aiming for production deployments that generate tens of millions to billions in value. The team works on complex use cases across industriesโ€”from wealth management at Morgan Stanley to semiconductor verification and automotive supply chain optimizationโ€”building custom solutions while extracting generalizable patterns that inform OpenAI's product development. Through an "eval-driven development" approach combining LLM capabilities with deterministic guardrails, the FDE team has grown from 2 to 52 engineers in 2025, successfully bridging the gap between AI capabilities and enterprise production requirements while maintaining focus on zero-to-one problem solving rather than long-term consulting engagements.

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

Four Critical Lessons from Building 50+ Global Chatbots: A Practitioner's Guide to Real-World Implementation

Campfire AI

Drawing from experience building over 50 chatbots across five continents, this case study outlines four crucial lessons for successful chatbot implementation. Key insights include treating chatbot projects as AI initiatives rather than traditional IT projects, anticipating out-of-scope queries through "99-intents", organizing intents hierarchically for more natural interactions, planning for unusual user expressions, and eliminating unhelpful "I don't understand" responses. The study emphasizes that successful chatbots require continuous optimization, aiming for 90-95% recognition rates for in-scope questions, while maintaining effective fallback mechanisms for edge cases.

Framework for Evaluating LLM Production Use Cases

Scale Venture Partners

Barak Turovsky, drawing from his experience leading Google Translate and other AI initiatives, presents a framework for evaluating LLM use cases in production. The framework analyzes use cases based on two key dimensions: accuracy requirements and fluency needs, along with consideration of stakes involved. This helps organizations determine which applications are suitable for current LLM deployment versus those that need more development. The framework suggests creative and workplace productivity applications are better immediate fits for LLMs compared to high-stakes information/decision support use cases.

From Mega-Prompts to Production: Lessons Learned Scaling LLMs in Enterprise Customer Support

GoDaddy

GoDaddy has implemented large language models across their customer support infrastructure, particularly in their Digital Care team which handles over 60,000 customer contacts daily through messaging channels. Their journey implementing LLMs for customer support revealed several key operational insights: the need for both broad and task-specific prompts, the importance of structured outputs with proper validation, the challenges of prompt portability across models, the necessity of AI guardrails for safety, handling model latency and reliability issues, the complexity of memory management in conversations, the benefits of adaptive model selection, the nuances of implementing RAG effectively, optimizing data for RAG through techniques like Sparse Priming Representations, and the critical importance of comprehensive testing approaches. Their experience demonstrates both the potential and challenges of operationalizing LLMs in a large-scale enterprise environment.

From Pilot to Profit: Three Enterprise GenAI Case Studies in Manufacturing, Aviation, and Telecommunications

Various

A comprehensive analysis of three enterprise GenAI implementations showcasing the journey from pilot to profit. The cases cover a top 10 automaker's use of GenAI for manufacturing maintenance, an aviation entertainment company's predictive maintenance system, and a telecom provider's sales automation solution. Each case study reveals critical "hidden levers" for successful GenAI deployment: adoption triggers, lean workflows, and revenue accelerators. The analysis demonstrates that while GenAI projects typically cost between $200K to $1M and take 15-18 months to achieve ROI, success requires careful attention to implementation details, user adoption, and business process integration.

From Simple RAG to Multi-Agent Architecture for Document Data Extraction

Box

Box evolved their document data extraction system from a simple single-model approach to a sophisticated multi-agent architecture to handle enterprise-scale unstructured data processing. The initial straightforward approach of preprocessing documents and feeding them to an LLM worked well for basic use cases but failed when customers presented complex challenges like 300-page documents, poor OCR quality, hundreds of extraction fields, and confidence scoring requirements. By redesigning the system using an agentic approach with specialized sub-agents for different tasks, Box achieved better accuracy, easier system evolution, and improved maintainability while processing millions of pages for enterprise customers.

GenAI Agent for Partner-Guest Messaging Automation

Booking.com

Booking.com developed a GenAI agent to assist accommodation partners in responding to guest inquiries more efficiently. The problem was that manual responses through their messaging platform were time-consuming, especially during busy periods, potentially leading to delayed responses and lost bookings. The solution involved building a tool-calling agent using LangGraph and GPT-4 Mini that can suggest relevant template responses, generate custom free-text answers, or abstain from responding when appropriate. The system includes guardrails for PII redaction, retrieval tools using embeddings for template matching, and access to property and reservation data. Early results show the system handles tens of thousands of daily messages, with pilots demonstrating 70% improvement in user satisfaction, reduced follow-up messages, and faster response times.

GenAI Agent for Partner-Guest Messaging in Travel Accommodation

Booking

Booking.com developed a GenAI agent to assist accommodation partners in responding to guest inquiries more efficiently. The problem addressed was the manual effort required by partners to search for and select response templates, particularly during busy periods, which could lead to delayed responses and potential booking cancellations. The solution is a tool-calling agent built with LangGraph and GPT-4 Mini that autonomously decides whether to suggest a predefined template, generate a custom response, or refrain from answering. The system retrieves relevant templates using semantic search with embeddings stored in Weaviate, accesses property and reservation data via GraphQL, and implements guardrails for PII redaction and topic filtering. Deployed as a microservice on Kubernetes with FastAPI, the agent processes tens of thousands of daily messages and achieved a 70% increase in user satisfaction in live pilots, along with reduced follow-up messages and faster response times.

GenAI Governance in Practice: Access Control, Data Quality, and Monitoring for Production LLM Systems

Xomnia

Martin Der, a data scientist at Xomnia, presents practical approaches to GenAI governance addressing the challenge that only 5% of GenAI projects deliver immediate ROI. The talk focuses on three key pillars: access and control (enabling self-service prototyping through tools like Open WebUI while avoiding shadow AI), unstructured data quality (detecting contradictions and redundancies in knowledge bases through similarity search and LLM-based validation), and LLM ops monitoring (implementing tracing platforms like LangFuse and creating dynamic golden datasets for continuous testing). The solutions include deploying Chrome extensions for workflow integration, API gateways for centralized policy enforcement, and developing a knowledge agent called "Genie" for internal use cases across telecom, healthcare, logistics, and maritime industries.

GenAI-Powered Automated Resource Leak Fixing in Java Codebases

Uber

Uber developed FixrLeak, a generative AI-based framework to automate the detection and repair of resource leaks in their Java codebase. Resource leaksโ€”where files, database connections, or streams aren't properly releasedโ€”cause performance degradation and system failures, and while tools like SonarQube detect them, fixing remains manual and error-prone. FixrLeak combines Abstract Syntax Tree (AST) analysis with generative AI (specifically OpenAI ChatGPT-4O) to produce accurate, idiomatic fixes following Java best practices like try-with-resources. When tested on 124 resource leaks in Uber's codebase, FixrLeak successfully automated fixes for 93 out of 102 eligible cases (after filtering out deprecated code and complex inter-procedural leaks), significantly reducing manual effort and improving code quality at scale.

GenAI-Powered Invoice Document Processing and Automation

Uber

Uber faced significant challenges processing a high volume of invoices daily from thousands of global suppliers, with diverse formats, 25+ languages, and varying templates requiring substantial manual intervention. The company developed TextSense, a GenAI-powered document processing platform that leverages OCR, computer vision, and large language models (specifically OpenAI GPT-4 after evaluating multiple options including fine-tuned Llama 2 and Flan T5) to automate invoice data extraction. The solution achieved 90% overall accuracy, reduced manual processing by 2x, cut average handling time by 70%, and delivered 25-30% cost savings compared to manual processes, while providing a scalable, configuration-driven platform adaptable to diverse document types.

GenAI-Powered Work Order Management System POC

NTT Data

An international infrastructure company partnered with NTT Data to evaluate whether GenAI could improve their work order management system that handles 500,000+ annual maintenance requests. The POC focused on automating classification, urgency assessment, and special handling requirements identification. Using a privately hosted LLM with company-specific knowledge base, the solution demonstrated improved accuracy and consistency in work order processing compared to the manual approach, while providing transparent reasoning for classifications.

Generating Production-Ready MCP Servers from OpenAPI Specifications

SpeakEasy

SpeakEasy tackled the challenge of enabling AI agents to interact with existing APIs by developing a tool that automatically generates Model Context Protocol (MCP) servers from OpenAPI documents. The company identified critical issues when generating over 50 production MCP servers for customers, including tool explosion (too many exposed operations), verbose descriptions consuming excessive tokens, complex data formats confusing LLMs, and inadequate access controls. Their solution involved a three-layer optimization approach: pruning OpenAPI documents with custom extensions, building intelligence into the generator to handle complex formats and streaming responses, and providing customization files for precise tool control. The result is production-ready MCP servers that balance LLM context window constraints with functional completeness, using techniques like scope-based access control, automatic data transformation, and optimized descriptions.

Generative AI Contact Center Solution with Amazon Bedrock and Claude

DoorDash

DoorDash implemented a generative AI-powered self-service contact center solution using Amazon Bedrock, Amazon Connect, and Anthropic's Claude to handle hundreds of thousands of daily support calls. The solution leverages RAG with Knowledge Bases for Amazon Bedrock to provide accurate responses to Dasher inquiries, achieving response latency of 2.5 seconds or less. The implementation reduced development time by 50% and increased testing capacity 50x through automated evaluation frameworks.

Generative AI Customer Service Agent Assist with RAG Implementation

Newday

NewDay, a UK financial services company handling 2.5 million customer calls annually, developed NewAssist, a real-time generative AI assistant to help customer service agents quickly find answers from nearly 200 knowledge articles. Starting as a hackathon project, the solution evolved from a voice assistant concept to a chatbot implementation using Amazon Bedrock and Claude 3 Haiku. Through iterative experimentation and custom data processing, the team achieved over 90% accuracy, reducing answer retrieval time from 90 seconds to 4 seconds while maintaining costs under $400 per month using a serverless AWS architecture.

Generative AI for Secondary Manuscript Generation in Life Sciences

Sorcero

Sorcero, a life sciences AI company, addresses the challenge of generating secondary manuscripts (particularly patient-reported outcomes manuscripts) from clinical study reports, a process that traditionally takes months and is costly, inconsistent, and delays patient access to treatments. Their solution uses generative AI to create foundational manuscript drafts within hours from source materials including clinical study reports, statistical analysis plans, and protocols. The system emphasizes trust, traceability, and regulatory compliance through rigorous validation frameworks, industry benchmarks (like CONSORT guidelines), comprehensive audit trails, and human oversight. The approach generates complete manuscripts with proper structure, figures, and tables while ensuring all assertions are traceable to source data, hallucinations are controlled, and industry standards are met.

Generative AI Implementation in Banking Customer Service and Knowledge Management

Various

Multiple banks, including Discover Financial Services, Scotia Bank, and others, share their experiences implementing generative AI in production. The case study focuses particularly on Discover's implementation of gen AI for customer service, where they achieved a 70% reduction in agent search time by using RAG and summarization for procedure documentation. The implementation included careful consideration of risk management, regulatory compliance, and human-in-the-loop validation, with technical writers and agents providing continuous feedback for model improvement.

Generative AI Integration in Financial Crime Detection Platform

NICE Actimize

NICE Actimize implemented generative AI into their financial crime detection platform "Excite" to create an automated machine learning model factory and enhance MLOps capabilities. They developed a system that converts natural language requests into analytical artifacts, helping analysts create aggregations, features, and models more efficiently. The solution includes built-in guardrails and validation pipelines to ensure safe deployment while significantly reducing time to market for analytical solutions.

GitHub Copilot Deployment at Scale: Enhancing Developer Productivity

Mercado Libre

Mercado Libre, Latin America's largest e-commerce platform, implemented GitHub Copilot across their development team of 9,000+ developers to address the need for more efficient development processes. The solution resulted in approximately 50% reduction in code writing time, improved developer satisfaction, and enhanced productivity by automating repetitive tasks. The implementation was part of a broader GitHub Enterprise strategy that includes security features and automated workflows.

Global News Organization's AI-Powered Content Production and Verification System

Reuters

Reuters has implemented a comprehensive AI strategy to enhance its global news operations, focusing on reducing manual work, augmenting content production, and transforming news delivery. The organization developed three key tools: a press release fact extraction system, an AI-integrated CMS called Leon, and a content packaging tool called LAMP. They've also launched the Reuters AI Suite for clients, offering transcription and translation capabilities while maintaining strict ethical guidelines around AI-generated imagery and maintaining journalistic integrity.

GPT Integration for SQL Stored Procedure Optimization in CI/CD Pipeline

Agoda

Agoda integrated GPT into their CI/CD pipeline to automate SQL stored procedure optimization, addressing a significant operational bottleneck where database developers were spending 366 man-days annually on manual optimization tasks. The system provides automated analysis and suggestions for query improvements, index recommendations, and performance optimizations, leading to reduced manual review time and improved merge request processing. While achieving approximately 25% accuracy, the solution demonstrates practical benefits in streamlining database development workflows despite some limitations in handling complex stored procedures.

GPT-4 Visit Notes System

Summer Health

Summer Health successfully deployed GPT-4 to revolutionize pediatric visit note generation, addressing both provider burnout and parent communication challenges. The implementation reduced note-writing time from 10 to 2 minutes per visit (80% reduction) while making medical information more accessible to parents. By carefully considering HIPAA compliance through BAAs and implementing robust clinical review processes, they demonstrated how LLMs can be safely and effectively deployed in healthcare settings. The case study showcases how AI can simultaneously improve healthcare provider efficiency and patient experience, while maintaining high standards of medical accuracy and regulatory compliance.

Graph RAG and Multi-Agent Systems for Legal Case Discovery and Document Analysis

WhyHow

WhyHow.ai, a legal technology company, developed a system that combines graph databases, multi-agent architectures, and retrieval-augmented generation (RAG) to identify class action and mass tort cases before competitors by scraping web data, structuring it into knowledge graphs, and generating personalized reports for law firms. The company claims to find potential cases within 15 minutes compared to the industry standard of 8-9 months, using a pipeline that processes complaints from various online sources, applies lawyer-specific filtering schemas, and generates actionable legal intelligence through automated multi-agent workflows backed by graph-structured knowledge representation.

Hardening AI Agents for E-commerce at Scale: Multi-Company Perspectives on RL Alignment and Reliability

Prosus / Microsoft / Inworld AI / IUD

This panel discussion features experts from Microsoft, Google Cloud, InWorld AI, and Brazilian e-commerce company IUD (Prosus partner) discussing the challenges of deploying reliable AI agents for e-commerce at scale. The panelists share production experiences ranging from Google Cloud's support ticket routing agent that improved policy adherence from 45% to 90% using DPO adapters, to Microsoft's shift away from prompt engineering toward post-training methods for all Copilot models, to InWorld AI's voice agent architecture optimization through cascading models, and IUD's struggles with personalization balance in their multi-channel shopping agent. Key challenges identified include model localization for UI elements, cost efficiency, real-time voice adaptation, and finding the right balance between automation and user control in commerce experiences.

Harness Engineering for Agentic Coding Systems

Langchain

LangChain improved their coding agent (deepagents-cli) from 52.8% to 66.5% on Terminal Bench 2.0, advancing from Top 30 to Top 5 performance, solely through harness engineering without changing the underlying model (gpt-5.2-codex). The solution focused on three key areas: system prompts emphasizing self-verification loops, enhanced tools and context injection to help agents understand their environment, and middleware hooks to detect problematic patterns like doom loops. The approach leveraged LangSmith tracing at scale to identify failure modes and iteratively optimize the harness through automated trace analysis, demonstrating that systematic engineering around the model can yield significant performance improvements in production agentic systems.

Healthcare Conversational AI and Multi-Model Cost Management in Production

Amberflo / Interactly.ai

A panel discussion featuring Interactly.ai's development of conversational AI for healthcare appointment management, and Amberflo's approach to usage tracking and cost management for LLM applications. The case study explores how Interactly.ai handles the challenges of deploying LLMs in healthcare settings with privacy and latency constraints, while Amberflo addresses the complexities of monitoring and billing for multi-model LLM applications in production.

Healthcare NLP Pipeline for HIPAA-Compliant Patient Data De-identification

Dandelion Health

Dandelion Health developed a sophisticated de-identification pipeline for processing sensitive patient healthcare data while maintaining HIPAA compliance. The solution combines John Snow Labs' Healthcare NLP with custom pre- and post-processing steps to identify and transform protected health information (PHI) in free-text patient notes. Their approach includes risk categorization by medical specialty, context-aware processing, and innovative "hiding in plain sight" techniques to achieve high-quality de-identification while preserving data utility for medical research.

Healthcare Patient Journey Analysis Platform with Multimodal LLMs

John Snow Labs

John Snow Labs developed a comprehensive healthcare analytics platform that uses specialized medical LLMs to process and analyze patient data across multiple modalities including unstructured text, structured EHR data, FIR resources, and images. The platform enables healthcare professionals to query patient histories and build cohorts using natural language, while handling complex medical terminology mapping and temporal reasoning. The system runs entirely within the customer's infrastructure for security, uses Kubernetes for deployment, and significantly outperforms GPT-4 on medical tasks while maintaining consistency and explainability in production.

HIPAA-Compliant LLM-Based Chatbot for Pharmacy Customer Service

Amazon

Amazon Pharmacy developed a HIPAA-compliant LLM-based chatbot to help customer service agents quickly retrieve and provide accurate information to patients. The solution uses a Retrieval Augmented Generation (RAG) pattern implemented with Amazon SageMaker JumpStart foundation models, combining embedding-based search and LLM-based response generation. The system includes agent feedback collection for continuous improvement while maintaining security and compliance requirements.

Human-AI Co-Annotation System for Efficient Data Labeling

Appen

Appen developed a hybrid approach combining LLMs with human annotators to address the growing challenges in data annotation for AI models. They implemented a co-annotation engine that uses model uncertainty metrics to efficiently route annotation tasks between LLMs and human annotators. Using GPT-3.5 Turbo for initial annotations and entropy-based confidence scoring, they achieved 87% accuracy while reducing costs by 62% and annotation time by 63% compared to purely human annotation, demonstrating an effective balance between automation and human expertise.

Hybrid AI System for Large-Scale Product Categorization

Walmart

Walmart developed Ghotok, an innovative AI system that combines predictive and generative AI to improve product categorization across their digital platforms. The system addresses the challenge of accurately mapping relationships between product categories and types across 400 million SKUs. Using an ensemble approach with both predictive and generative AI models, along with sophisticated caching and deployment strategies, Ghotok successfully reduces false positives and improves the efficiency of product categorization while maintaining fast response times in production.

Implementing Effective Safety Filters in a Game-Based LLM Application

JOBifAI

JOBifAI, a game leveraging LLMs for interactive gameplay, encountered significant challenges with LLM safety filters in production. The developers implemented a retry-based solution to handle both technical failures and safety filter triggers, achieving a 99% success rate after three retries. However, the experience highlighted fundamental issues with current safety filter implementations, including lack of transparency, inconsistent behavior, and potential cost implications, ultimately limiting the game's development from proof-of-concept to full production.

Implementing Evaluation Framework for MCP Server Tool Selection

Neon

Neon developed a comprehensive evaluation framework to test their Model Context Protocol (MCP) server's ability to correctly use database migration tools. The company faced challenges with LLMs selecting appropriate tools from a large set of 20+ tools, particularly for complex stateful workflows involving database migrations. Their solution involved creating automated evals using Braintrust, implementing "LLM-as-a-judge" scoring techniques, and establishing integrity checks to ensure proper tool usage. Through iterative prompt engineering guided by these evaluations, they improved their tool selection success rate from 60% to 100% without requiring code changes.

Implementing Generative AI in Manufacturing: A Multi-Use Case Study

Accenture

Accenture's Industry X division conducted extensive experiments with generative AI in manufacturing settings throughout 2023. They developed and validated nine key use cases including operations twins, virtual mentors, test case generation, and technical documentation automation. The implementations showed significant efficiency gains (40-50% effort reduction in some cases) while maintaining a human-in-the-loop approach. The study emphasized the importance of using domain-specific data, avoiding generic knowledge management solutions, and implementing multi-agent orchestrated solutions rather than standalone models.

Implementing LLM Fallback Mechanisms for Production Incident Response System

Vespper

When Vespper's incident response system faced an unexpected OpenAI account deactivation, they needed to quickly implement a fallback mechanism to maintain service continuity. Using LiteLLM's fallback feature, they implemented a solution that could automatically switch between different LLM providers. During implementation, they discovered and fixed a bug in LiteLLM's fallback handling, ultimately contributing the fix back to the open-source project while ensuring their production system remained operational.

Implementing LLMOps in Restricted Networks with Long-Running Evaluations

Microsoft

A case study detailing Microsoft's experience implementing LLMOps in a restricted network environment using Azure Machine Learning. The team faced challenges with long-running evaluations (6+ hours) and network restrictions, developing solutions including opt-out mechanisms for lengthy evaluations, implementing Git Flow for controlled releases, and establishing a comprehensive CI/CE/CD pipeline. Their approach balanced the needs of data scientists, engineers, and platform teams while maintaining security and evaluation quality.

Implementing MCP Gateway for Large-Scale LLM Integration Infrastructure

Anthropic

Anthropic faced the challenge of managing an explosion of LLM-powered services and integrations across their organization, leading to duplicated functionality and integration chaos. They solved this by implementing a standardized MCP (Model Context Protocol) gateway that provides a single point of entry for all LLM integrations, handling authentication, credential management, and routing to both internal and external services. This approach reduced engineering overhead, improved security by centralizing credential management, and created a "pit of success" where doing the right thing became the easiest thing to do for their engineering teams.

Implementing Multi-Agent RAG Architecture for Customer Care Automation

Doctolib

Doctolib evolved their customer care system from basic RAG to a sophisticated multi-agent architecture using LangGraph. The system employs a primary assistant for routing and specialized agents for specific tasks, incorporating safety checks and API integrations. While showing promise in automating customer support tasks like managing calendar access rights, they faced challenges with LLM behavior variance, prompt size limitations, and unstructured data handling, highlighting the importance of robust data structuration and API documentation for production deployment.

Implementing Product Comparison and Discovery Features with LLMs at Scale

idealo

idealo, a major European price comparison platform, implemented LLM-powered features to enhance product comparison and discovery. They developed two key applications: an intelligent product comparison tool that extracts and compares relevant attributes from extensive product specifications, and a guided product finder that helps users navigate complex product categories. The company focused on using LLMs as language interfaces rather than knowledge bases, relying on proprietary data to prevent hallucinations. They implemented thorough evaluation frameworks and A/B testing to measure business impact.

Implementing RAG and RagRails for Reliable Conversational AI in Insurance

GEICO

GEICO explored using LLMs for customer service chatbots through a hackathon initiative in 2023. After discovering issues with hallucinations and "overpromising" in their initial implementation, they developed a comprehensive RAG (Retrieval Augmented Generation) solution enhanced with their novel "RagRails" approach. This method successfully reduced incorrect responses from 12 out of 20 to zero in test cases by providing structured guidance within retrieved context, demonstrating how to safely deploy LLMs in a regulated insurance environment.

Improving Error Handling for AI Agents in Production

fewsats

A case study exploring how fewsats improved their domain management AI agents by enhancing error handling in their HTTP SDK. They discovered that while different LLM models (Claude, Llama 3, Replit Agent) could interact with their domain management API, the agents often failed due to incomplete error information. By modifying their SDK to surface complete error details instead of just status codes, they enabled the AI agents to self-correct and handle API errors more effectively, demonstrating the importance of error visibility in production LLM systems.

Improving LLM Accuracy and Evaluation in Enterprise Customer Analytics

Various

Echo.ai and Log10 partnered to solve accuracy and evaluation challenges in deploying LLMs for enterprise customer conversation analysis. Echo.ai's platform analyzes millions of customer conversations using multiple LLMs, while Log10 provides infrastructure for improving LLM accuracy through automated feedback and evaluation. The partnership resulted in a 20-point F1 score increase in accuracy and enabled Echo.ai to successfully deploy large enterprise contracts with improved prompt optimization and model fine-tuning.

Improving LLM Food Tracking Accuracy through Systematic Evaluation and Few-Shot Learning

Taralli

A case study of Taralli's food tracking application that initially used a naive approach with GPT-4-mini for calorie and nutrient estimation, resulting in significant accuracy issues. Through the implementation of systematic evaluation methods, creation of a golden dataset, and optimization using DSPy's BootstrapFewShotWithRandomSearch technique, they improved accuracy from 17% to 76% while maintaining reasonable response times with Gemini 2.5 Flash.

Improving Multilingual Search with Few-Shot LLM Translations

Delivery Hero

Delivery Hero operates across 68 countries and faced significant challenges with multilingual search due to dialectal variations, transliterations, spelling errors, and multiple languages within single markets. Traditional machine translation systems struggled with user intent and contextual nuances, leading to poor search results. The company implemented a solution using Large Language Models (LLMs), specifically Gemini, with few-shot learning to provide context-aware translations that handle regional dialects, correct spelling mistakes, and understand transliterations. By combining LLM-generated translations with Elastic Search and Vector Search in a hybrid approach, they achieved over 90% translation accuracy for restaurant queries and demonstrated positive improvements in user engagement through A/B testing, with the solution being rolled out to their Talabat and Hungerstation brands.

Infrastructure Challenges and Solutions for Agentic AI Systems in Production

Meta / Google / Monte Carlo / Microsoft

A panel discussion featuring experts from Meta, Google, Monte Carlo, and Microsoft examining the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion covers how agentic workloads differ from traditional software systems, requiring new approaches to networking, load balancing, caching, security, and observability, while highlighting specific challenges like non-deterministic behavior, massive search spaces, and the need for comprehensive evaluation frameworks to ensure reliable and secure AI agent operations at scale.

Infrastructure Noise in Agentic Coding Evaluations

Anthropic

Anthropic discovered that infrastructure configuration alone can produce differences in agentic coding benchmark scores that exceed the typical margins between top models on leaderboards. Through systematic experiments running Terminal-Bench 2.0 across six resource configurations on Google Kubernetes Engine, they found a 6 percentage point gap between the most- and least-resourced setups. The research revealed that while moderate resource headroom (up to 3x specifications) primarily improves infrastructure stability by preventing spurious failures, more generous allocations actively help agents solve problems they couldn't solve before. These findings challenge the notion that small leaderboard differences represent pure model capability measurements and led to recommendations for specifying both guaranteed allocations and hard kill thresholds, calibrating resource bands empirically, and treating resource configuration as a first-class experimental variable in LLMOps practices.

Integrating Gemini for Natural Language Analytics in IoT Fleet Management

Cox 2M

Cox 2M, facing challenges with a lean analytics team and slow insight generation (taking up to a week per request), partnered with Thoughtspot and Google Cloud to implement Gemini-powered natural language analytics. The solution reduced time to insights by 88% while enabling non-technical users to directly query complex IoT and fleet management data using natural language. The implementation includes automated insight generation, change analysis, and natural language processing capabilities.

Intelligent Document Processing for Education Quality Assessment Reports

BQA

BQA, Bahrain's Education and Training Quality Authority, faced challenges with manual review of self-evaluation reports from educational institutions. They implemented a solution using Amazon Bedrock and other AWS services to automate and streamline the analysis of these reports. The system leverages the Amazon Titan Express model for intelligent document processing, combining document analysis, summarization, and compliance checking. The solution achieved 70% accuracy in standards-compliant report generation and reduced evidence analysis time by 30%.

Interactive AI-Powered Chess Tutoring System

Interweb Alchemy

A chess tutoring application that leverages LLMs and traditional chess engines to provide real-time analysis and feedback during gameplay. The system combines GPT-4 mini for move generation with Stockfish for position evaluation, offering features like positional help, outcome analysis, and real-time commentary. The project explores the practical application of different LLM models for chess tutoring, focusing on helping beginners improve their game through interactive feedback and analysis.

Iterative Development Process for Production AI Features

Zapier

Zapier's journey in developing and deploying AI products demonstrates a pragmatic, iterative approach to LLMOps. Their methodology focuses on rapid prototyping with advanced models like GPT-4 Turbo and Claude Opus, followed by quick deployment of initial versions (even with sub-50% accuracy), systematic collection of user feedback, and establishment of comprehensive evaluation frameworks. This approach has enabled them to improve their AI products from sub-50% to over 90% accuracy within 2-3 months, while successfully managing costs and maintaining product quality.

Iterative Prompt Optimization and Model Selection for Nutritional Calorie Estimation

Taralli

Taralli, a calorie tracking application, demonstrates systematic LLM improvement through rigorous evaluation and prompt optimization. The developer addressed the challenge of accurate nutritional estimation by creating a 107-example evaluation dataset, testing multiple prompt optimization techniques (vanilla, few-shot bootstrapping, MIPROv2, and GEPA) across several models (Gemini 2.5 Flash, Gemini 3 Flash, and DeepSeek v3.2). Through this methodical approach, they achieved a 15% accuracy improvement by switching from Gemini 2.5 Flash to Gemini 3 Flash while using a few-shot learning approach with 16 examples, reaching 60% accuracy within a 10% calorie prediction threshold. The system was deployed with fallback model configurations and extended to support fully offline on-device inference for iOS.

Journey Towards Autonomous Network Operations with AI/ML and Dark NOC

BT

BT is undertaking a major transformation of their network operations, moving from traditional telecom engineering to a software-driven approach with the goal of creating an autonomous "Dark NOC" (Network Operations Center). The initiative focuses on handling massive amounts of network data, implementing AI/ML for automated analysis and decision-making, and consolidating numerous specialized tools into a comprehensive intelligent system. The project involves significant organizational change, including upskilling teams and partnering with AWS to build data foundations and AI capabilities for predictive maintenance and autonomous network management.

LangSmith Integration for Automated Feedback and Improved Iteration in SDLC

Factory

Factory AI implemented self-hosted LangSmith to address observability challenges in their SDLC automation platform, particularly for their Code Droid system. By integrating LangSmith with AWS CloudWatch logs and utilizing its Feedback API, they achieved comprehensive LLM pipeline monitoring, automated feedback collection, and streamlined prompt optimization. This resulted in a 2x improvement in iteration speed, 20% reduction in open-to-merge time, and 3x reduction in code churn.

Large Bank LLMOps Implementation: Lessons from Deutsche Bank and Others

Various

A discussion between banking technology leaders about their implementation of generative AI, focusing on practical applications, regulatory challenges, and strategic considerations. Deutsche Bank's CTO and other banking executives share their experiences in implementing gen AI across document processing, risk modeling, research analysis, and compliance use cases, while emphasizing the importance of responsible deployment and regulatory compliance.

Large Language Models for Retail Customer Feedback Analysis

Microsoft

A retail organization was facing challenges in analyzing large volumes of daily customer feedback manually. Microsoft implemented an LLM-based solution using Azure OpenAI to automatically extract themes, sentiments, and competitor comparisons from customer feedback. The system uses carefully engineered prompts and predefined themes to ensure consistent analysis, enabling the creation of actionable insights and reports at various organizational levels.

Large-Scale AI Assistant Deployment with Safety-First Evaluation Approach

Discord

Discord implemented Clyde AI, a chatbot assistant that was deployed to over 200 million users, focusing heavily on safety, security, and evaluation practices. The team developed a comprehensive evaluation framework using simple, deterministic tests and metrics, implemented through their open-source tool PromptFu. They faced unique challenges in preventing harmful content and jailbreaks, leading to innovative solutions in red teaming and risk assessment, while maintaining a balance between casual user interaction and safety constraints.

Large-Scale AI Red Teaming Competition Platform for Production Model Security

HackAPrompt, LearnPrompting

Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.

Large-Scale Enterprise Copilot Deployment: Lessons from Einstein Copilot Implementation

Salesforce

Salesforce shares their experience deploying Einstein Copilot, their conversational AI assistant for CRM, across their internal organization. The deployment process focused on starting simple with standard actions before adding custom capabilities, implementing comprehensive testing protocols, and establishing clear feedback loops. The rollout began with 100 sellers before expanding to thousands of users, resulting in significant time savings and improved user productivity.

Large-Scale Enterprise Data Platform Migration Using AI and Generative AI Automation

CommBank

Commonwealth Bank of Australia (CBA), Australia's largest bank serving 17.5 million customers, faced the challenge of modernizing decades of rich data spread across hundreds of on-premise source systems that lacked interoperability and couldn't scale for AI workloads. In partnership with HCL Tech and AWS, CBA migrated 61,000 on-premise data pipelines (equivalent to 10 petabytes of data) to an AWS-based data mesh ecosystem in 9 months. The solution leveraged AI and generative AI to transform code, check for errors, and test outputs with 100% accuracy reconciliation, conducting 229,000 tests across the migration. This enabled CBA to establish a federated data architecture called CommBank.data that empowers 40 lines of business with self-service data access while maintaining strict governance, positioning the bank for AI-driven innovation at scale.

Large-Scale LLM Batch Processing Platform for Millions of Prompts

Instacart

Instacart faced challenges processing millions of LLM calls required by various teams for tasks like catalog data cleaning, item enrichment, fulfillment routing, and search relevance improvements. Real-time LLM APIs couldn't handle this scale effectively, leading to rate limiting issues and high costs. To solve this, Instacart built Maple, a centralized service that automates large-scale LLM batch processing by handling batching, encoding/decoding, file management, retries, and cost tracking. Maple integrates with external LLM providers through batch APIs and an internal AI Gateway, achieving up to 50% cost savings compared to real-time calls while enabling teams to process millions of prompts reliably without building custom infrastructure.

Large-Scale Test Framework Migration Using LLMs

AirBnB

AirBnB successfully migrated 3,500 React component test files from Enzyme to React Testing Library (RTL) using LLMs, reducing what was estimated to be an 18-month manual engineering effort to just 6 weeks. Through a combination of systematic automation, retry loops, and context-rich prompts, they achieved a 97% automated migration success rate, with the remaining 3% completed manually using the LLM-generated code as a baseline.

Launching an MCP Server for AI-Powered Debugging and Development

Multiplayer

Multiplayer, a provider of full-stack session recording and debugging tools, launched a Model Context Protocol (MCP) server to connect their platform's engineering context with AI coding agents like Cursor, Claude Code, and Windsurf. The challenge was enabling AI agents to access session recordings, backend server calls, and debugging data to provide contextually-aware assistance for bug fixes and feature development. By designing use-case-driven MCP tools that abstract multiple API calls, Multiplayer created a streamlined integration that has shown good adoption among developers. The gradual rollout to power users revealed best practices such as keeping tools minimal and scoped, focusing on read-only operations for security, and providing human-readable data formats to LLMs.

Legacy PDF Document Processing with LLM

Five Sigma

The given text appears to be a PDF document with binary/encoded content that needs to be processed and analyzed. The case involves handling PDF streams, filters, and document structure, which could benefit from LLM-based processing for content extraction and understanding.

Lessons from Deploying an HR-Aware AI Assistant: Five Key Implementation Insights

Applaud

Applaud shares their experience implementing an AI assistant for HR service delivery, highlighting key challenges and solutions in areas including content management, personalization, testing methodologies, accuracy expectations, and continuous improvement. The case study explores practical solutions to common deployment challenges like content quality control, context-aware responses, testing for infinite possibilities, managing accuracy expectations, and post-deployment optimization.

Lessons from Enterprise LLM Deployment: Cross-functional Teams, Experimentation, and Security

Microsoft

A team of Microsoft engineers share their experiences helping strategic customers implement LLM solutions in production environments. They discuss the importance of cross-functional teams, continuous experimentation, RAG implementation challenges, and security considerations. The presentation emphasizes the need for proper LLMOps practices, including evaluation pipelines, guard rails, and careful attention to potential vulnerabilities like prompt injection and jailbreaking.

Lessons from Red Teaming 100+ Generative AI Products

Microsoft

Microsoft's AI Red Team (AIRT) conducted extensive red teaming operations on over 100 generative AI products to assess their safety and security. The team developed a comprehensive threat model ontology and leveraged both manual and automated testing approaches through their PyRIT framework. Through this process, they identified key lessons about AI system vulnerabilities, the importance of human expertise in red teaming, and the challenges of measuring responsible AI impacts. The findings highlight both traditional security risks and novel AI-specific attack vectors that need to be considered when deploying AI systems in production.

Lessons Learned from Deploying 30+ GenAI Agents in Production

Quic

Quic shares their experience deploying over 30 AI agents across various industries, focusing on customer experience and e-commerce applications. They developed a comprehensive approach to LLMOps that includes careful planning, persona development, RAG implementation, API integration, and robust testing and monitoring systems. The solution achieved 60% resolution of tier-one support issues with higher quality than human agents, while maintaining human involvement for complex cases.

Leveraging LangSmith for Debugging Tools & Actions in Production LLM Applications

Mendable

Mendable.ai enhanced their enterprise AI assistant platform with Tools & Actions capabilities, enabling automated tasks and API interactions. They faced challenges with debugging and observability of agent behaviors in production. By implementing LangSmith, they successfully debugged agent decision processes, optimized prompts, improved tool schema generation, and built evaluation datasets, resulting in a more reliable and efficient system that has already achieved $1.3 million in savings for a major tech company client.

Leveraging NLP and LLMs for Music Industry Royalty Recovery

Love Without Sound

Love Without Sound developed an AI-powered system to help the music industry recover lost royalties due to incorrect metadata and unauthorized usage. The solution combines NLP pipelines for metadata standardization, legal document processing, and is now expanding to include RAG-based querying and audio embedding models. The system processes billions of tracks, operates in real-time, and runs in a fully data-private environment, helping recover millions in revenue for artists.

LLM Evaluation Framework for Financial Crime Report Generation

Sumup

SumUp developed an LLM application to automate the generation of financial crime reports, along with a novel evaluation framework using LLMs as evaluators. The solution addresses the challenges of evaluating unstructured text output by implementing custom benchmark checks and scoring systems. The evaluation framework outperformed traditional NLP metrics and showed strong correlation with human reviewer assessments, while acknowledging and addressing potential LLM evaluator biases.

LLM Feature Extraction for Content Categorization and Search Query Understanding

Canva

Canva implemented LLMs as a feature extraction method for two key use cases: search query categorization and content page categorization. By replacing traditional ML classifiers with LLM-based approaches, they achieved higher accuracy, reduced development time from weeks to days, and lowered operational costs from $100/month to under $5/month for query categorization. For content categorization, LLM embeddings outperformed traditional methods in terms of balance, completion, and coherence metrics while simplifying the feature extraction process.

LLM Observability for Enhanced Audience Segmentation Systems

Acxiom

Acxiom developed an AI-driven audience segmentation system using LLMs but faced challenges in scaling and debugging their solution. By implementing LangSmith, they achieved robust observability for their LangChain-based application, enabling efficient debugging of complex workflows involving multiple LLM calls, improved audience segment creation, and better token usage optimization. The solution successfully handled conversational memory, dynamic updates, and data consistency requirements while scaling to meet growing user demands.

LLM Production Case Studies: Consulting Database Search, Automotive Showroom Assistant, and Banking Development Tools

Globant

A collection of LLM implementation case studies detailing challenges and solutions in various industries. Key cases include: a consulting firm's semantic search implementation for financial data, requiring careful handling of proprietary data and similarity definitions; an automotive company's showroom chatbot facing challenges with data consistency and hallucination control; and a bank's attempt to create a custom code copilot, highlighting the importance of clear requirements and technical understanding in LLM projects.

LLM Security: Discovering and Mitigating Repeated Token Attacks in Production Models

Dropbox

Dropbox's security research team discovered vulnerabilities in OpenAI's GPT-3.5 and GPT-4 models where repeated tokens could trigger model divergence and extract training data. They identified that both single-token and multi-token repetitions could bypass OpenAI's initial security controls, leading to potential data leakage and denial of service risks. The findings were reported to OpenAI, who subsequently implemented improved filtering mechanisms and server-side timeouts to address these vulnerabilities.

LLM Validation and Testing at Scale: GitLab's Comprehensive Model Evaluation Framework

Gitlab

GitLab developed a robust framework for validating and testing LLMs at scale for their GitLab Duo AI features. They created a Centralized Evaluation Framework (CEF) that uses thousands of prompts across multiple use cases to assess model performance. The process involves creating a comprehensive prompt library, establishing baseline model performance, iterative feature development, and continuous validation using metrics like Cosine Similarity Score and LLM Judge, ensuring consistent improvement while maintaining quality across all use cases.

LLM-as-a-Judge Framework for Automated LLM Evaluation at Scale

Booking.com

Booking.com developed a comprehensive framework to evaluate LLM-powered applications at scale using an LLM-as-a-judge approach. The solution addresses the challenge of evaluating generative AI applications where traditional metrics are insufficient and human evaluation is impractical. The framework uses a more powerful LLM to evaluate target LLM outputs based on carefully annotated "golden datasets," enabling continuous monitoring of production GenAI applications. The approach has been successfully deployed across multiple use cases at Booking.com, providing automated evaluation capabilities that significantly reduce the need for human oversight while maintaining evaluation quality.

LLM-Based Dasher Support Automation with RAG and Quality Controls

Doordash

DoorDash implemented an LLM-based chatbot system to improve their Dasher support automation, replacing a traditional flow-based system. The solution uses RAG (Retrieval Augmented Generation) to leverage their knowledge base, along with sophisticated quality control systems including LLM Guardrail for real-time response validation and LLM Judge for quality monitoring. The system successfully handles thousands of support requests daily while achieving a 90% reduction in hallucinations and 99% reduction in compliance issues.

LLM-Driven Developer Experience and Code Migrations at Scale

Uber

Uber's Developer Platform team explored three major initiatives using LLMs in production: a custom IDE coding assistant (which was later abandoned in favor of GitHub Copilot), an AI-powered test generation system called Auto Cover, and an automated Java-to-Kotlin code migration system. The team combined deterministic approaches with LLMs to achieve significant developer productivity gains while maintaining code quality and safety. They found that while pure LLM approaches could be risky, hybrid approaches combining traditional software engineering practices with AI showed promising results.

LLM-Enhanced Topic Modeling System for Qualitative Text Analysis

QualIT

QualIT developed a novel topic modeling system that combines large language models with traditional clustering techniques to analyze qualitative text data more effectively. The system uses LLMs to extract key phrases and employs a two-stage hierarchical clustering approach, demonstrating significant improvements over baseline methods with 70% topic coherence (vs 65% and 57% for benchmarks) and 95.5% topic diversity (vs 85% and 72%). The system includes safeguards against LLM hallucinations and has been validated through human evaluation.

LLM-Enhanced Trust and Safety Platform for E-commerce Content Moderation

Whatnot

Whatnot, a live shopping marketplace, implemented LLMs to enhance their trust and safety operations by moving beyond traditional rule-based systems. They developed a sophisticated system combining LLMs with their existing rule engine to detect scams, moderate content, and enforce platform policies. The system achieved over 95% detection rate of scam attempts with 96% precision by analyzing conversational context and user behavior patterns, while maintaining a human-in-the-loop approach for final decisions.

LLM-Powered 3D Model Generation for 3D Printing

Build Great AI

Build Great AI developed a prototype application that leverages multiple LLM models to generate 3D printable models from text descriptions. The system uses various models including LLaMA 3.1, GPT-4, and Claude 3.5 to generate OpenSCAD code, which is then converted to STL files for 3D printing. The solution demonstrates rapid prototyping capabilities, reducing design time from hours to minutes, while handling the challenges of LLMs' spatial reasoning limitations through multiple simultaneous generations and iterative refinement.

LLM-Powered Analysis of User Bug Reports for Product Quality Improvement

Meta

Meta's Facebook product team faced challenges in analyzing large volumes of unstructured user bug reports at scale using traditional methods. They developed an LLM-based system that classifies user feedback into predefined categories, monitors trends through automated dashboards, and performs root cause analysis to identify product issues. Through iterative prompt engineering and integration with data pipelines, the system successfully detected major outages in real-time, identified less visible bugs that might have been missed, and contributed to reducing overall bug reports by double digits over several months by enabling targeted product improvements and cross-functional collaboration.

LLM-Powered Customer Service Agent Copilot for E-commerce Support

Wayfair

Wayfair developed Wilma, an LLM-based copilot system to assist customer service agents in responding to customer inquiries about product issues. The system uses models like Gemini and GPT to draft contextual messages that agents can review and edit before sending. Through an iterative evolution from a single monolithic prompt to over 40 specialized prompt templates and multiple coordinated LLM calls, Wilma helps agents respond 12% faster while improving policy adherence by 2-5% depending on issue type. The system pulls real-time customer, order, and product data from Wayfair's systems to generate appropriate responses, with particular sophistication in handling complex resolution negotiation scenarios through a multi-LLM routing and analysis framework.

LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation

Grab

Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.

LLM-Powered Data Labeling Quality Assurance System

Uber

Uber AI Solutions developed a production LLM-based quality assurance system called Requirement Adherence to improve data labeling accuracy for their enterprise clients. The system addresses the costly and time-consuming problem of post-labeling rework by identifying quality issues during the labeling process itself. It works in two phases: first extracting atomic rules from client Standard Operating Procedure (SOP) documents using LLMs with reflection capabilities, then performing real-time validation during the labeling process by routing different rule types to appropriately-sized models with optimization techniques like prefix caching. This approach resulted in an 80% reduction in required audits, significantly improving timelines and reducing costs while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered GraphQL Mock Data Generation for Developer Productivity

Airbnb

Airbnb developed an innovative solution to address the persistent challenge of creating and maintaining realistic GraphQL mock data for testing and prototyping. Engineers traditionally spent significant time manually writing and updating mock data, which would often drift out of sync with evolving queries. Airbnb introduced the @generateMock directive, which combines GraphQL schema validation, product context (including design mockups), and LLMs (specifically Gemini 2.5 Pro) to automatically generate type-safe, realistic mock data. The solution integrates seamlessly into their existing code generation workflow (Niobe CLI), keeping engineers in their local development loops. A companion @respondWithMock directive enables client engineers to prototype features before server implementations are complete. Since deployment, Airbnb engineers have generated and merged over 700 mocks across iOS, Android, and Web platforms, significantly reducing manual effort and accelerating development cycles.

LLM-Powered In-Tool Quality Validation for Data Labeling

Uber

Uber AI Solutions developed a Requirement Adherence system to address quality issues in data labeling workflows, which traditionally relied on post-labeling checks that resulted in costly rework and delays. The solution uses LLMs in a two-phase approach: first extracting atomic rules from Standard Operating Procedure (SOP) documents and categorizing them by complexity, then performing real-time validation during the labeling process within their uLabel tool. By routing different rule types to appropriate LLM models (non-reasoning models for deterministic checks, reasoning models for subjective checks) and leveraging techniques like prefix caching and parallel execution, the system achieved an 80% reduction in required audits while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered Investment Document Analysis and Processing

AngelList

AngelList transformed their investment document processing from manual classification to an automated system using LLMs. They initially used AWS Comprehend for news article classification but transitioned to OpenAI's models, which proved more accurate and cost-effective. They built Relay, a product that automatically extracts and organizes investment terms and company updates from documents, achieving 99% accuracy in term extraction while significantly reducing operational costs compared to manual processing.

LLM-Powered Multi-Tool Architecture for Oil & Gas Data Exploration

DXC

DXC developed an AI assistant to accelerate oil and gas data exploration by integrating multiple specialized LLM-powered tools. The solution uses a router to direct queries to specialized tools optimized for different data types including text, tables, and industry-specific formats like LAS files. Built using Anthropic's Claude on Amazon Bedrock, the system includes conversational capabilities and semantic search to help users efficiently analyze complex datasets, reducing exploration time from hours to minutes.

LLM-Powered Mutation Testing for Automated Test Generation

Meta

Meta developed ACH (Automated Compliance Hardening), an LLM-powered system that revolutionizes software testing by combining mutation-guided test generation with large language models. Traditional mutation testing required manual test writing and generated unrealistic faults, creating a labor-intensive process with no guarantees of catching relevant bugs. ACH addresses this by allowing engineers to describe bug concerns in plain text, then automatically generating both realistic code mutations (faults) and the tests needed to catch them. The system has been deployed across Meta's platforms including Facebook Feed, Instagram, Messenger, and WhatsApp, particularly for privacy compliance testing, marking the first large-scale industrial deployment combining LLM-based mutant and test generation with verifiable assurances that generated tests will catch the specified fault types.

LLM-Powered Product Attribute Extraction from Unstructured Marketplace Data

Etsy

Etsy faced the challenge of understanding and categorizing over 100 million unique, handmade items listed by 5 million sellers, where most product information existed only as unstructured text and images rather than structured attributes. The company deployed large language models to extract product attributes at scale from listing titles, descriptions, and photos, transforming unstructured data into structured attributes that could power search filters and product comparisons. The implementation increased complete attribute coverage from 31% to 91% in target categories, improved engagement with search filters, and increased overall post-click conversion rates, while establishing robust evaluation frameworks using both human-annotated ground truth and LLM-generated silver labels.

LLM-Powered Relevance Assessment for Search Results

Pinterest

Pinterest Search faced significant limitations in measuring search relevance due to the high cost and low availability of human annotations, which resulted in large minimum detectable effects (MDEs) that could only identify significant topline metric movements. To address this, they fine-tuned open-source multilingual LLMs on human-annotated data to predict relevance scores on a 5-level scale, then deployed these models to evaluate ranking results across A/B experiments. This approach reduced labeling costs dramatically, enabled stratified query sampling designs, and achieved an order of magnitude reduction in MDEs (from 1.3-1.5% down to โ‰ค0.25%), while maintaining strong alignment with human labels (73.7% exact match, 91.7% within 1 point deviation) and enabling rapid evaluation of 150,000 rows within 30 minutes on a single GPU.

LLM-Powered Voice Assistant for Restaurant Operations and Personalized Alcohol Recommendations

Doordash

DoorDash implemented two major LLM-powered features during their 2025 summer intern program: a voice AI assistant for verifying restaurant hours and personalized alcohol recommendations with carousel generation. The voice assistant replaced rigid touch-tone phone systems with natural language conversations, allowing merchants to specify detailed hours information in advance while maintaining backward compatibility with legacy infrastructure through factory patterns and feature flags. The alcohol recommendation system leveraged LLMs to generate personalized product suggestions and engaging carousel titles using chain-of-thought prompting and a two-stage generation pipeline. Both systems were integrated into production using DoorDash's existing frameworks, with the voice assistant achieving structured data extraction through prompt engineering and webhook processing, while the recommendations carousel utilized the company's Carousel Serving Framework and Discovery SDK for rapid deployment.

LLMOps Best Practices and Success Patterns Across Multiple Companies

HumanLoop

A comprehensive analysis of successful LLM implementations across multiple companies including Duolingo, GitHub, Fathom, and others, highlighting key patterns in team composition, evaluation strategies, and tooling requirements. The study emphasizes the importance of domain experts in LLMOps, proper evaluation frameworks, and the need for comprehensive logging and debugging tools, showcasing concrete examples of companies achieving significant ROI through proper LLMOps implementation.

LLMs for Cloud Incident Management and Root Cause Analysis

Microsoft

Microsoft Research explored using large language models (LLMs) to automate cloud incident management in Microsoft 365 services. The study focused on using GPT-3 and GPT-3.5 models to analyze incident reports and generate recommendations for root cause analysis and mitigation steps. Through rigorous evaluation of over 40,000 incidents across 1000+ services, they found that fine-tuned GPT-3.5 models significantly outperformed other approaches, with over 70% of on-call engineers rating the recommendations as useful (3/5 or better) in production settings.

LLMs for Enhanced Search Retrieval and Query Understanding

Doordash

Doordash implemented an advanced search system using LLMs to better understand and process complex food delivery search queries. They combined LLMs with knowledge graphs for query segmentation and entity linking, using retrieval-augmented generation (RAG) to constrain outputs to their controlled vocabulary. The system improved popular dish carousel trigger rates by 30%, increased whole page relevance by over 2%, and led to higher conversion rates while maintaining high precision in query understanding.

LLMs for Investigative Data Analysis in Journalism

ProPublica

ProPublica utilized LLMs to analyze a large database of National Science Foundation grants that were flagged as "woke" by Senator Ted Cruz's office. The AI helped journalists quickly identify patterns and assess why grants were flagged, while maintaining journalistic integrity through human verification. This approach demonstrated how AI can be used responsibly in journalism to accelerate data analysis while maintaining high standards of accuracy and accountability.

Long-Running Agent Harness for Multi-Context Software Development

Anthropic

Anthropic addressed the challenge of enabling AI coding agents to work effectively across multiple context windows when building complex software projects that span hours or days. The core problem was that agents would lose memory between sessions, leading to incomplete features, duplicated work, or premature project completion. Their solution involved a two-fold agent harness: an initializer agent that sets up structured environments (feature lists, git repositories, progress tracking files) on first run, and a coding agent that makes incremental progress session-by-session while maintaining clean code states. Combined with browser automation testing tools like Puppeteer, this approach enabled Claude to successfully build production-quality web applications through sustained, multi-session work.

Mainframe to Cloud Migration with AI-Powered Code Transformation

Mercedes-Benz

Mercedes-Benz faced the challenge of modernizing their Global Ordering system, a critical mainframe application handling over 5 million lines of code that processes every vehicle order and production request across 150 countries. The company partnered with Capgemini, AWS, and Rocket Software to migrate this system from mainframe to cloud using a hybrid approach: replatforming the majority of the application while using agentic AI (GenRevive tool) to refactor specific components. The most notable success was transforming 1.3 million lines of COBOL code in their pricing service to Java in just a few months, achieving faster performance, reduced mainframe costs, and a successful production deployment with zero incidents at go-live.

Managing Memory and Scaling Issues in Production AI Agent Systems

Gradient Labs

Gradient Labs experienced a series of interconnected production incidents involving their AI agent deployed on Google Cloud Run, starting with memory usage alerts that initially appeared to be memory leaks. The team discovered the root cause was Temporal workflow cache sizing issues causing container crashes, which they resolved by tuning cache parameters. However, this fix inadvertently caused auto-scaling problems that throttled their system's ability to execute activities, leading to increased latency. The incidents highlight the complex interdependencies in production AI systems and the need for careful optimization across all infrastructure layers.

Managing Model Updates and Robustness in Production Voice Assistants

Amazon (Alexa)

At Amazon Alexa, researchers tackled two key challenges in production NLP models: preventing performance degradation on common utterances during model updates and improving model robustness to input variations. They implemented positive congruent training to minimize negative prediction flips between model versions and used T5 models to generate synthetic training data variations, making the system more resilient to slight changes in user commands while maintaining consistent performance.

MCP Marketplace: Scaling AI Agents with Organizational Context

Intuit

Intuit, a global fintech platform, faced challenges scaling AI agents across their organization due to poor discoverability of Model Context Protocol (MCP) services, inconsistent security practices, and complex manual setup requirements. They built an MCP Marketplace, a centralized registry functioning as a package manager for AI capabilities, which standardizes MCP development through automated CI/CD pipelines for producers and provides one-click installation with enterprise-grade security for consumers. The platform leverages gRPC middleware for authentication, token management, and auditing, while collecting usage analytics to track adoption, service latency, and quality metrics, thereby democratizing secure context access across their developer organization.

MCP Protocol Development and Agent AI Foundation Launch

Anthropic / OpenAI / Goose

This podcast transcript covers the one-year journey of the Model Context Protocol (MCP) from its initial launch by Anthropic through to its donation to the newly formed Agent AI Foundation. The discussion explores how MCP evolved from a local-only protocol to support remote servers, authentication, and long-running tasks, addressing the fundamental challenge of connecting AI agents to external tools and data sources in production environments. The case study highlights extensive production usage of MCP both within Anthropic's internal systems and across major technology companies including OpenAI, Microsoft, and Google, demonstrating widespread adoption with millions of requests at scale. The formation of the Agent AI Foundation with founding members including Anthropic, OpenAI, and Block represents a significant industry collaboration to standardize agentic system protocols and ensure neutral governance of critical AI infrastructure.

Medical AI Assistant for Battlefield Care Using LLMs

Johns Hopkins

Johns Hopkins Applied Physics Laboratory (APL) is developing CPG-AI, a conversational AI system using Large Language Models to provide medical guidance to untrained soldiers in battlefield situations. The system interprets clinical practice guidelines and tactical combat casualty care protocols into plain English guidance, leveraging APL's RALF framework for LLM application development. The prototype successfully demonstrates capabilities in condition inference, natural dialogue, and algorithmic care guidance for common battlefield injuries.

Meta's Hardware Reliability Framework for AI Training and Inference at Scale

Meta

Meta addresses the critical challenge of hardware reliability in large-scale AI infrastructure, where hardware faults significantly impact training and inference workloads. The company developed comprehensive detection mechanisms including Fleetscanner, Ripple, and Hardware Sentinel to identify silent data corruptions (SDCs) that can cause training divergence and inference errors without obvious symptoms. Their multi-layered approach combines infrastructure strategies like reductive triage and hyper-checkpointing with stack-level solutions such as gradient clipping and algorithmic fault tolerance, achieving industry-leading reliability for AI operations across thousands of accelerators and globally distributed data centers.

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

Mission-Critical LLM Inference Platform Architecture

Baseten

Baseten has built a production-grade LLM inference platform focusing on three key pillars: model-level performance optimization, horizontal scaling across regions and clouds, and enabling complex multi-model workflows. The platform supports various frameworks including SGLang and TensorRT-LLM, and has been successfully deployed by foundation model companies and enterprises requiring strict latency, compliance, and reliability requirements. A key differentiator is their ability to handle mission-critical inference workloads with sub-400ms latency for complex use cases like AI phone calls.

ML-Powered Interactive Voice Response System for Customer Support

Airbnb

Airbnb transformed their traditional button-based Interactive Voice Response (IVR) system into an intelligent, conversational AI-powered solution that allows customers to describe their issues in natural language. The system combines automated speech recognition, intent detection, LLM-based article retrieval and ranking, and paraphrasing models to understand customer queries and either provide relevant self-service resources via SMS/app notifications or route calls to appropriate agents. This resulted in significant improvements including a reduction in word error rate from 33% to 10%, sub-50ms intent detection latency, increased user engagement with help articles, and reduced dependency on human customer support agents.

MLflow's Production-Ready Agent Framework and LLM Tracing

MLflow

MLflow addresses the challenges of moving LLM agents from demo to production by introducing comprehensive tooling for tracing, evaluation, and experiment tracking. The solution includes LLM tracing capabilities to debug black-box agent systems, evaluation tools for retrieval relevance and prompt engineering, and integrations with popular agent frameworks like Autogen and LlamaIndex. This enables organizations to effectively monitor, debug, and improve their LLM-based applications in production environments.

MLOps Evolution and LLM Integration at a Major Bank

Barclays

Discussion of MLOps practices and the evolution towards LLM integration at Barclays, focusing on the transition from traditional ML to GenAI workflows while maintaining production stability. The case study highlights the importance of balancing innovation with regulatory requirements in financial services, emphasizing ROI-driven development and the creation of reusable infrastructure components.

MLOps Maturity Levels and Enterprise Implementation Challenges

Various

The case study explores MLOps maturity levels (0-2) in enterprise settings, discussing how organizations progress from manual ML deployments to fully automated systems. It covers the challenges of implementing MLOps across different team personas (data scientists, ML engineers, DevOps), highlighting key considerations around automation, monitoring, compliance, and business value metrics. The study particularly emphasizes the differences between traditional ML and LLM deployments, and how organizations need to adapt their MLOps practices for each.

Model Context Protocol (MCP) Server for Error Monitoring and AI Observability

Sentry

Sentry developed a Model Context Protocol (MCP) server to enable Large Language Models (LLMs) to access real-time error monitoring and application performance data directly within AI-powered development environments. The solution addresses the challenge of LLMs lacking current context about application issues by providing 16 different tool calls that allow AI assistants to retrieve project information, analyze errors, and even trigger their AI agent Seer for root cause analysis, ultimately enabling more informed debugging and issue resolution workflows within modern development environments.

Model Context Protocol (MCP): A Universal Standard for AI Application Extensions

Anthropic

Anthropic developed the Model Context Protocol (MCP) to solve the challenge of extending AI applications with plugins and external functionality in a standardized way. Inspired by the Language Server Protocol (LSP), MCP provides a universal connector that enables AI applications to interact with various tools, resources, and prompts through a client-server architecture. The protocol has gained significant community adoption and contributions from companies like Shopify, Microsoft, and JetBrains, demonstrating its potential as an open standard for AI application integration.

Modernizing DevOps with Generative AI: Challenges and Best Practices in Production

Various (Bundesliga, Harness, Trice)

A panel of experts from various organizations discusses the current state and challenges of integrating generative AI into DevOps workflows and production environments. The discussion covers how companies are balancing productivity gains with security concerns, the importance of having proper testing and evaluation frameworks, and strategies for successful adoption of AI tools in production DevOps processes while maintaining code quality and security.

Multi-Agent AI Development Assistant for Clinical Trial Data Analysis

AstraZeneca

AstraZeneca developed a "Development Assistant" - an interactive AI agent that enables researchers to query clinical trial data using natural language. The system evolved from a single-agent approach to a multi-agent architecture using Amazon Bedrock, allowing users across different R&D domains to access insights from their 3DP data platform. The solution went from concept to production MVP in six months, addressing the challenge of scaling AI initiatives beyond isolated proof-of-concepts while ensuring proper governance and user adoption through comprehensive change management practices.

Multi-Agent AI System for Automated Test Case Generation in Payment Systems

Amazon AMET Payments

Amazon AMET Payments team developed SAARAM, a multi-agent AI solution using Amazon Bedrock with Claude Sonnet and Strands Agents SDK to automate test case generation for payment features across five Middle Eastern and North African countries. The manual process previously required one week of QA engineer effort per feature, consuming approximately one full-time employee annually. By implementing a human-centric approach that mirrors how experienced testers analyze requirements through specialized agents, the team reduced test case generation time from one week to hours while improving test coverage by 40% and reducing QA effort from 1.0 FTE to 0.2 FTE for validation activities.

Multi-Agent AI Systems for IT Operations and Incident Management

Kolomolo / DeLaval / Arelion

Kolomolo, an AWS advanced partner, implemented two distinct AI-powered solutions for their customers DeLaval (dairy farm equipment manufacturer) and Arelion (global internet infrastructure provider). For DeLaval, they built Unity Ops, a multi-agent system that automates incident response and root cause analysis across 3,000+ connected dairy farms, processing alerts from monitoring systems and generating enriched incident tickets automatically. For Arelion, they developed a hybrid ML/LLM solution to classify and extract critical information from thousands of maintenance notification emails from over 100 vendors, reducing manual classification workload by 80%. Both solutions achieved over 95% accuracy while maintaining cost efficiency through strategic use of classical ML techniques combined with selective LLM invocation, demonstrating significant operational efficiency improvements and enabling engineering teams to focus on higher-value tasks rather than reactive incident management.

Multi-Agent Architecture for Automated Advertising Media Planning

Spotify

Spotify faced a structural problem where multiple advertising buying channels (Direct, Self-Serve, Programmatic) relied on consolidated backend services but implemented fragmented, channel-specific workflow logic, creating duplicated decision-making and technical debt. To address this, they built "Ads AI," a multi-agent system using Google's Agent Development Kit (ADK) and Vertex AI that transforms media planning from a manual 15-30 minute process requiring 20+ form fields into a conversational interface that generates optimized, data-driven media plans in 5-10 seconds using 1-3 natural language messages. The system decomposes media planning into specialized agents (RouterAgent, GoalResolverAgent, AudienceResolverAgent, BudgetAgent, ScheduleAgent, and MediaPlannerAgent) that execute in parallel, leverage historical campaign performance data via function calling tools, and produce recommendations based on cost optimization, delivery rates, and budget matching heuristics.

Multi-Agent Architecture for Automating Commercial Real Estate Development Workflows

Build.inc

Build.inc developed a sophisticated multi-agent system called Dougie to automate complex commercial real estate development workflows, particularly for data center projects. Using LangGraph for orchestration, they implemented a hierarchical system of over 25 specialized agents working in parallel to perform land diligence tasks. The system reduces what traditionally took human consultants four weeks to complete down to 75 minutes, while maintaining high quality and depth of analysis.

Multi-Agent Customer Support Automation Platform for Fintech

Gradient Labs

Gradient Labs, an AI-native startup founded after ChatGPT's release, built a comprehensive customer support automation platform for fintech companies featuring three coordinated AI agents: inbound, outbound, and back office. The company addresses the challenge that traditional customer support automation only handles the "tip of the iceberg" - frontline queries - while missing the complex back-office tasks like fraud disputes and KYC compliance that consume most human agent time. Their solution uses a modular agent architecture with natural language procedures, deterministic skill-based orchestration, multi-layer guardrails for regulatory compliance, and sophisticated state management to handle complex, multi-turn conversations across email, chat, and voice channels. This approach enables end-to-end automation where agents coordinate seamlessly, such as an inbound agent receiving a dispute claim, triggering a back-office agent to process it, and an outbound agent proactively following up with customers for additional information.

Multi-Agent Customer Support System for E-commerce

Minimal

Minimal developed a sophisticated multi-agent customer support system for e-commerce businesses using LangGraph and LangSmith, achieving 80%+ efficiency gains in ticket resolution. Their system combines three specialized agents (Planner, Research, and Tool-Calling) to handle complex support queries, automate responses, and execute order management tasks while maintaining compliance with business protocols. The system successfully automates up to 90% of support tickets, requiring human intervention for only 10% of cases.

Multi-Agent DBT Development Workflow for Data Engineering Consulting

Mammoth Growth

Mammoth Growth, a boutique data consultancy specializing in marketing and customer data, developed a multi-agent AI system to automate DBT development workflows in response to data teams struggling to deliver analytics at the speed of business. The solution employs a team of specialized AI agents (orchestrator, analyst, architect, and analytics engineer) that leverage the DBT Model Context Protocol (MCP) to autonomously write, document, and test production-grade DBT code from detailed specifications. The system enabled the delivery of a complete enterprise-grade data lineage with 15 data models and two gold-layer models in just 3 weeks for a pilot client, compared to an estimated 10 weeks using traditional manual development approaches, while maintaining code quality standards through human-led requirements gathering and mandatory code review before production deployment.

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

Multi-Agent Framework for Automated Telecom Change Request Processing

Totogi

Totogi, an AI company serving the telecommunications industry, faced challenges with traditional Business Support Systems (BSS) that required lengthy change request processingโ€”typically taking 7 days and involving costly, specialized engineering talent. To address this, Totogi developed BSS Magic, which combines a comprehensive telco ontology with a multi-agent AI framework powered by Anthropic Claude models on Amazon Bedrock. The solution orchestrates five specialized AI agents (Business Analyst, Technical Architect, Developer, QA, and Tester) through AWS Step Functions and Lambda, automating the entire software development lifecycle from requirements analysis to code generation and testing. In collaboration with the AWS Generative AI Innovation Center, Totogi achieved significant results: reducing change request processing time from 7 days to a few hours, achieving 76% code coverage in automated testing, and delivering production-ready telecom-grade code with minimal human intervention.

Multi-Agent GenAI System for Developer Support and Documentation

Northwestern Mutual

Northwestern Mutual implemented a GenAI-powered developer support system to address challenges with their internal developer support chat system, which suffered from long response times and repetitive basic queries. Using Amazon Bedrock Agents, they developed a multi-agent system that could automatically handle common developer support requests, documentation queries, and user management tasks. The system went from pilot to production in just three months and successfully reduced support engineer workload while maintaining strict compliance with internal security and risk management requirements.

Multi-Agent LLM Systems: Implementation Patterns and Production Case Studies

Nimble Gravity, Hiflylabs

A research study conducted by Nimble Gravity and Hiflylabs examining GenAI adoption patterns across industries, revealing that approximately 28-30% of GenAI projects successfully transition from assessment to production. The study explores various multi-agent LLM architectures and their implementation in production, including orchestrator-based, agent-to-agent, and shared message pool patterns, demonstrating practical applications like automated customer service systems that achieved significant cost savings.

Multi-Agent System Architecture for Autonomous Recruiting Agents

LinkedIn

LinkedIn developed a multi-agent system called Hiring Assistant to help recruiters work more efficiently, launching in October 2024. The system comprises four specialized agents (intake, sourcing, evaluation, and outreach) coordinated by a supervisor agent, with personalization driven by a preference model trained on recruiter behaviors. The presentation focuses on the operational challenges of scaling from specialized multi-agent systems to truly autonomous agents, addressing critical production issues including memory isolation across users, tool discovery and validation, safety considerations for destructive tool calls, and computational efficiency through complexity classification to route simpler tasks to completion models rather than expensive reasoning models.

Multi-Agent System for Prediction Market Resolution Using LangChain and LangGraph

Chaos Labs

Chaos Labs developed Edge AI Oracle, a decentralized multi-agent system built on LangChain and LangGraph for resolving queries in prediction markets. The system utilizes multiple LLM models from providers like OpenAI, Anthropic, and Meta to ensure objective and accurate resolutions. Through a sophisticated workflow of specialized agents including research analysts, web scrapers, and bias analysts, the system processes queries and provides transparent, traceable results with configurable consensus requirements.

Multi-Company Panel Discussion on Enterprise AI and Agentic AI Deployment Challenges

Glean / Deloitte / Docusign

This panel discussion at AWS re:Invent brings together practitioners from Glean, Deloitte, and DocuSign to discuss the practical realities of deploying AI and agentic AI systems in enterprise environments. The panelists explore challenges around organizational complexity, data silos, governance, agent creation and sharing, value measurement, and the tension between autonomous capabilities and human oversight. Key themes include the need for cross-functional collaboration, the importance of security integration from day one, the difficulty of measuring AI-driven productivity gains, and the evolution from individual AI experimentation to governed enterprise-wide agent deployment. The discussion emphasizes that successful AI transformation requires reimagining workflows rather than simply bolting AI onto legacy systems, and that business value should drive technical decisions rather than focusing solely on which LLM model to use.

Multi-Label Red Flag Detection System for Fraud Prevention

Feedzai

Feedzai developed ScamAlert, a generative AI-based system that moves beyond traditional binary scam classification to identify specific red flags in suspected fraud attempts. The system addresses the limitations of binary classifiers that only output risk scores without explanation by using multimodal LLMs to analyze screenshots of suspected scams (emails, text messages, listings) and identify observable warning signs like suspicious links, urgency tactics, or unusual communication channels. The team created a comprehensive benchmarking framework to evaluate multiple commercial multimodal models across four dimensions: red flag detection accuracy (precision/recall/F1), instruction adherence, cost, and latency. Their results showed significant performance variations across models, with GPT-5, Gemini 3 Pro, and Gemini 2.5 Pro leading in accuracy, though with notable tradeoffs in cost and latency, while also revealing instruction-following issues in some models that generated hallucinated red flags not in the predefined taxonomy.

Multi-Layered Caching Architecture for AI Metadata Service Scalability

Salesforce

Salesforce faced critical performance and reliability issues with their AI Metadata Service (AIMS), experiencing 400ms P90 latency bottlenecks and system outages during database failures that impacted all AI inference requests including Agentforce. The team implemented a multi-layered caching strategy with L1 client-side caching and L2 service-level caching, reducing metadata retrieval latency from 400ms to sub-millisecond response times and improving end-to-end request latency by 27% while maintaining 65% availability during backend outages.

Multi-Layered LLM Evaluation Pipeline for Production Content Generation

Treater

Treater developed a comprehensive evaluation pipeline for production LLM workflows that combines deterministic rule-based checks, LLM-based evaluations, automatic rewriting systems, and human edit analysis to ensure high-quality content generation at scale. The system addresses the challenge of maintaining consistent quality in LLM-generated outputs by implementing a multi-layered defense approach that catches errors early, provides interpretable feedback, and continuously improves through human feedback loops, resulting in under 2% failure rates at the deterministic level and measurable improvements in content acceptance rates over time.

Multi-LLM Orchestration for Product Matching at Scale

Mercado Libre

Mercado Libre tackled the classic e-commerce product-matching challenge where sellers create listings with inconsistent titles, attributes, and identifiers, making it difficult to identify identical products across the platform. The team developed a sophisticated multi-LLM orchestration system that evolved from a simple 2-node architecture to a complex 7-node pipeline, incorporating adaptive prompts, context-aware decision-making, and collaborative consensus mechanisms. Through systematic iteration and careful orchestration alongside existing ML models and embedding systems, they achieved human-level performance with 95% precision and over 50% recall at a cost-effective rate of less than $0.001 per request, enabling scalable autonomous product matching across millions of items for critical use cases including pricing, personalization, and inventory optimization.

Multi-Model LLM Orchestration with Rate Limit Management

Bito

Bito, an AI coding assistant startup, faced challenges with API rate limits while scaling their LLM-powered service. They developed a sophisticated load balancing system across multiple LLM providers (OpenAI, Anthropic, Azure) and accounts to handle rate limits and ensure high availability. Their solution includes intelligent model selection based on context size, cost, and performance requirements, while maintaining strict guardrails through prompt engineering.

Multi-Tenant MCP Server Authentication with Redis Session Management

BrainGrid

BrainGrid faced the challenge of transforming their Model Context Protocol (MCP) server from a local development tool into a production-ready, multi-tenant service that could be deployed to customers. The core problem was that serverless platforms like Cloud Run and Vercel don't maintain session state, causing users to re-authenticate repeatedly as instances scaled to zero or requests hit different instances. BrainGrid solved this by implementing a Redis-based session store with AES-256-GCM encryption, OAuth integration via WorkOS, and a fast-path/slow-path authentication pattern that caches validated JWT sessions. The solution reduced authentication overhead from 50-100ms per request to near-instantaneous for cached sessions, eliminated re-authentication fatigue, and enabled the MCP server to scale from single-user to multi-tenant deployment while maintaining security and performance.

Multilingual Content Navigation and Localization System

Intercom

YouTube, a Google company, implements a comprehensive multilingual navigation and localization system for its global platform. The source text appears to be in Dutch, demonstrating the platform's localization capabilities, though insufficient details are provided about the specific LLMOps implementation.

Multimodal Healthcare Data Integration with Specialized LLMs

John Snow Labs

John Snow Labs developed a comprehensive healthcare data integration system that leverages multiple specialized LLMs to unify and analyze patient data from various sources. The system processes structured, unstructured, and semi-structured medical data (including EHR, PDFs, HL7, FHIR) to create complete patient journeys, enabling natural language querying while maintaining consistency, accuracy, and scalability. The solution addresses key healthcare challenges like terminology mapping, date normalization, and data deduplication, all while operating within secure environments and handling millions of patient records.

Multimodal RAG Architecture Optimization for Production

Microsoft

Microsoft explored optimizing a production Retrieval-Augmented Generation (RAG) system that incorporates both text and image content to answer domain-specific queries. The team conducted extensive experiments on various aspects of the system including prompt engineering, metadata inclusion, chunk structure, image enrichment strategies, and model selection. Key improvements came from using separate image chunks, implementing a classifier for image relevance, and utilizing GPT-4V for enrichment while using GPT-4o for inference. The resulting system achieved better search precision and more relevant LLM-generated responses while maintaining cost efficiency.

National-Scale AI Deployment in UK Public Sector: Contact Center Automation and Citizen Information Retrieval

Capita / UK Department of Science

Two UK government organizations, Capita and the Government Digital Service (GDS), deployed large-scale AI solutions to serve millions of citizens. Capita implemented AWS Connect and Amazon Bedrock with Claude to automate contact center operations handling 100,000+ daily interactions, achieving 35% productivity improvements and targeting 95% automation by 2027. GDS launched GOV.UK Chat, the UK's first national-scale RAG implementation using Amazon Bedrock, providing instant access to 850,000+ pages of government content for 67 million citizens. Both organizations prioritized safety, trust, and human oversight while scaling AI solutions to handle millions of interactions with zero tolerance for errors in this high-stakes public sector environment.

Natural Language Analytics Assistant Using Amazon Bedrock Agents

Skai

Skai, an omnichannel advertising platform, developed Celeste, an AI agent powered by Amazon Bedrock Agents, to transform how customers access and analyze complex advertising data. The solution addresses the challenge of time-consuming manual report generation (taking days or weeks) by enabling natural language queries that automatically collect data from multiple sources, synthesize insights, and provide actionable recommendations. The implementation reduced report generation time by 50%, case study creation by 75%, and transformed weeks-long processes into minutes while maintaining enterprise-grade security and privacy for sensitive customer data.

Natural Language Analytics with Snowflake Cortex for Self-Service BI

Gitlab

GitLab implemented conversational analytics using Snowflake Cortex to enable non-technical business users to query structured data using natural language, eliminating the traditional dependency on data analysts and reducing analytics backlog. The solution evolved from a basic proof-of-concept with 60% accuracy to a production system achieving 85-95% accuracy for simple queries and 75% for complex queries, utilizing semantic models, prompt engineering, verified query feedback loops, and role-based access controls. The implementation reduced analytics requests by approximately 50% for some teams, decreased time-to-insight from weeks to seconds, and democratized data access while maintaining enterprise-grade security through Snowflake's native governance features.

Natural Language Interface for Healthcare Data Analytics using LLMs

Aachen Uniklinik / Aurea Software

A UK-based NLQ (Natural Language Query) company developed an AI-powered interface for Aachen Uniklinik to make intensive care unit databases more accessible to healthcare professionals. The system uses a hybrid approach combining vector databases, large language models, and traditional SQL to allow non-technical medical staff to query complex patient data using natural language. The solution includes features for handling dirty data, intent detection, and downstream complication analysis, ultimately improving clinical decision-making processes.

Natural Language Query Interface with Production LLM Integration

Honeycomb

Honeycomb implemented a natural language query interface for their observability platform to help users more easily analyze their production data. Rather than creating a chatbot, they focused on a targeted query translation feature using GPT-3.5, achieving a 94% success rate in query generation. The feature led to significant improvements in user activation metrics, with teams using the query assistant being 2-3x more likely to create complex queries and save them to boards.

Natural Language to SQL Query Generation at Scale

Uber

Uber developed QueryGPT to address the time-intensive process of SQL query authoring across its data platform, which handles 1.2 million interactive queries monthly. The system uses large language models, vector databases, and similarity search to generate complex SQL queries from natural language prompts, reducing query authoring time from approximately 10 minutes to 3 minutes. Starting from a hackathon prototype in May 2023, the system evolved through 20+ iterations into a production service featuring workspaces for domain-specific query generation, multiple specialized LLM agents (intent, table, and column pruning), and a comprehensive evaluation framework. The limited release achieved 300 daily active users with 78% reporting significant time savings, representing a major productivity gain particularly for Uber's Operations organization which contributes 36% of all queries.

Natural Language to SQL System with Production Safeguards for Contact Center Analytics

NICE

NICE implemented a system that allows users to query contact center metadata using natural language, which gets translated to SQL queries. The solution achieves 86% accuracy and includes critical production safeguards like tenant isolation, default time frames, data visualization, and context management for follow-up questions. The system also provides detailed explanations of query interpretations and results to users.

Next-Generation AI-Powered In-Vehicle Assistant with Hybrid Edge-Cloud Architecture

Bosch

Bosch Engineering, in collaboration with AWS, developed a next-generation conversational AI assistant for vehicles that operates through a hybrid edge-cloud architecture to address the limitations of traditional in-car voice assistants. The solution combines on-board AI components for simple queries with cloud-based processing for complex requests, enabling seamless integration with external APIs for services like restaurant booking, charging station management, and vehicle diagnostics. The system was implemented on Bosch's Software-Defined Vehicle (SDV) reference demonstrator platform, demonstrating capabilities ranging from basic vehicle control to sophisticated multi-service orchestration, with ongoing development focused on gradually moving more intelligence to the edge while maintaining robust connectivity fallback mechanisms.

NLP and Machine Learning for Early Sepsis Detection in Neonatal Care

Wroclaw Medical University

Wroclaw Medical University, in collaboration with the Institute of Mother and Child, is developing an AI-powered clinical decision support system to detect and manage sepsis in neonatal intensive care units. The system uses NLP to process unstructured medical records in real-time, combined with machine learning models to identify early sepsis symptoms before they become clinically apparent. Early results suggest the system can reduce diagnosis time from 24 hours to 2 hours while maintaining high sensitivity and specificity, potentially leading to reduced antibiotic usage and improved patient outcomes.

Observability Platform's Journey to Production GenAI Integration

New Relic

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

On-Device Personalized Lexicon Learning for Mobile Keyboard

Grammarly

Grammarly developed an on-device machine learning model for their iOS keyboard that learns users' personal vocabulary and provides personalized autocorrection suggestions without sending data to the cloud. The challenge was to build a model that could distinguish between valid personal vocabulary and typos while operating within severe mobile constraints (under 5 MB RAM, minimal latency). The solution involved memory-mapped storage, time-based decay functions for vocabulary management, noisy input filtering, and edit-distance-based frequency thresholding to verify new words. Deployed to over 5 million devices, the model demonstrated measurable improvements with decreased rates of reverted suggestions and increased acceptance rates, while maintaining minimal memory footprint and responsive performance.

On-Device Unified Spelling and Grammar Correction Model

Grammarly

Grammarly developed a compact 1B-parameter on-device LLM to provide offline spelling and grammar correction capabilities, addressing the challenge of maintaining writing assistance functionality without internet connectivity. The team selected Llama as the base model, created comprehensive synthetic training data covering diverse writing styles and error types, and applied extensive optimizations including Grouped Query Attention, MLX framework integration for Apple silicon, and 4-bit quantization. The resulting model achieves 210 tokens/second on M2 Mac hardware while maintaining correction quality, demonstrating that multiple specialized models can be consolidated into a single efficient on-device solution that preserves user voice and delivers real-time feedback.

Open Source Code Generation Model Release and Production Deployment Considerations

Meta

Meta released Code Llama, a family of specialized large language models for code generation built on top of Llama 2, aiming to assist developers with coding tasks and lower barriers to entry for new programmers. The solution includes multiple model sizes (7B, 13B, 34B, and 70B parameters) with three variants: a foundational code model, a Python-specialized version, and an instruction-tuned variant, all trained on 500B-1T tokens of code and supporting up to 100,000 token contexts. Benchmark testing showed Code Llama 34B achieved 53.7% on HumanEval and 56.2% on MBPP, matching ChatGPT performance while being released under an open license for both research and commercial use, with extensive safety evaluations and red teaming conducted to address responsible AI concerns.

Open Source vs. Closed Source Agentic Stacks: Panel Discussion on Production Deployment Strategies

Various (Alation, GrottoAI, Nvidia, OLX)

This panel discussion brings together experts from Nvidia, OLX, Alation, and GrottoAI to discuss practical considerations for deploying agentic AI systems in production. The conversation explores when to choose open source versus closed source tooling, the challenges of standardizing agent frameworks across enterprise organizations, and the tradeoffs between abstraction levels in agent orchestration platforms. Key themes include starting with closed source models for rapid prototyping before transitioning to open source for compliance and cost reasons, the importance of observability across heterogeneous agent frameworks, the difficulty of enabling non-technical users to build agents, and the critical difference between internal tooling with lower precision requirements versus customer-facing systems demanding 95%+ accuracy.

Optimizing Agent Behavior and Support Operations with LangSmith Testing and Observability

Podium

Podium, a communication platform for small businesses, implemented LangSmith to improve their AI Employee agent's performance and support operations. Through comprehensive testing, dataset curation, and fine-tuning workflows, they achieved a 98.6% F1 score in response quality and reduced engineering intervention needs by 90%. The implementation enabled their Technical Product Specialists to troubleshoot issues independently and improved overall customer satisfaction.

Optimizing Agent Harness for OpenAI Codex Models in Production

Cursor

Cursor, an AI-powered code editor, details their approach to integrating OpenAI's GPT-5.1-Codex-Max model into their production agent harness. The problem involved adapting their existing agent framework to work optimally with Codex's specific training and behavioral patterns, which differed from other frontier models. Their solution included prompt engineering adjustments, tool naming conventions aligned with shell commands, reasoning trace preservation, strategic instructions to bias the model toward autonomous action, and careful message ordering to prevent contradictory instructions. The results demonstrated significant performance improvements, with their experiments showing that dropping reasoning traces caused a 30% performance degradation for Codex, highlighting the critical importance of their implementation decisions.

Optimizing Email Engagement Using LLMs and Rejection Sampling

Nextdoor

Nextdoor developed a novel system to improve email engagement by generating optimized subject lines using a combination of ChatGPT API and a custom reward model. The system uses prompt engineering to generate authentic subject lines without hallucination, and employs rejection sampling with a reward model to select the most engaging options. The solution includes robust engineering components for cost optimization and model performance maintenance, resulting in a 1% lift in sessions and 0.4% increase in Weekly Active Users.

Optimizing Production LLM Chatbot Performance Through Multi-Model Classification

IDIADA

IDIADA developed AIDA, an intelligent chatbot powered by Amazon Bedrock, to assist their workforce with various tasks. To optimize performance, they implemented specialized classification pipelines using different approaches including LLMs, k-NN, SVM, and ANN with embeddings from Amazon Titan and Cohere models. The optimized system achieved 95% accuracy in request routing and drove a 20% increase in team productivity, handling over 1,000 interactions daily.

Optimizing RAG Latency Through Model Racing and Self-Hosted Infrastructure

ElevenLabs

ElevenLabs faced significant latency challenges in their production RAG system, where query rewriting accounted for over 80% of RAG latency due to reliance on a single externally-hosted LLM. They redesigned their architecture to implement model racing, where multiple models (including self-hosted Qwen 3-4B and 3-30B-A3B models) process queries in parallel, with the first valid response winning. This approach reduced median RAG latency from 326ms to 155ms (a 50% improvement), while also improving system resilience by providing fallbacks during provider outages and reducing dependency on external services.

Optimizing RAG Systems: Lessons from Production

AWS GenAIIC

AWS GenAIIC shares comprehensive lessons learned from implementing Retrieval-Augmented Generation (RAG) systems across multiple industries. The case study covers key challenges in RAG implementation and provides detailed solutions for improving retrieval accuracy, managing context, and ensuring response reliability. Solutions include hybrid search techniques, metadata filtering, query rewriting, and advanced prompting strategies to reduce hallucinations.

Optimizing RAG-based Search Results for Production: A Journey from POC to Production

Statista

Statista, a global data platform, developed and optimized a RAG-based AI search system to enhance their platform's search capabilities. Working with Urial Labs and Talent Formation, they transformed a basic prototype into a production-ready system that improved search quality by 140%, reduced costs by 65%, and decreased latency by 10%. The resulting Research AI product has seen growing adoption among paying customers and demonstrates superior performance compared to general-purpose LLMs for domain-specific queries.

Optimizing Security Incident Response with LLMs at Google

Google

Google implemented LLMs to streamline their security incident response workflow, particularly focusing on incident summarization and executive communications. They used structured prompts and careful input processing to generate high-quality summaries while ensuring data privacy and security. The implementation resulted in a 51% reduction in time spent on incident summaries and 53% reduction in executive communication drafting time, while maintaining or improving quality compared to human-written content.

Overcoming LLM Production Deployment Challenges

Neeva

A comprehensive analysis of the challenges and solutions in deploying LLMs to production, presented by a machine learning expert from Neeva. The presentation covers both infrastructural challenges (speed, cost, API reliability, evaluation) and output-related challenges (format variability, reproducibility, trust and safety), along with practical solutions and strategies for successful LLM deployment, emphasizing the importance of starting with non-critical workflows and planning for scale.

Panel Discussion on LLM Evaluation and Production Deployment Best Practices

Various

Industry experts from Gantry, Structured.ie, and NVIDIA discuss the challenges and approaches to evaluating LLMs in production. They cover the transition from traditional ML evaluation to LLM evaluation, emphasizing the importance of domain-specific benchmarks, continuous monitoring, and balancing automated and human evaluation methods. The discussion highlights how LLMs have lowered barriers to entry while creating new challenges in ensuring accuracy and reliability in production deployments.

Panel Discussion on LLMOps Challenges: Model Selection, Ethics, and Production Deployment

Google, Databricks,

A panel discussion featuring leaders from various AI companies discussing the challenges and solutions in deploying LLMs in production. Key topics included model selection criteria, cost optimization, ethical considerations, and architectural decisions. The discussion highlighted practical experiences from companies like Interact.ai's healthcare deployment, Inflection AI's emotionally intelligent models, and insights from Google and Databricks on responsible AI deployment and tooling.

Panel Discussion: Best Practices for LLMs in Production

Various

A panel of industry experts from companies including Titan ML, YLabs, and Outer Bounds discuss best practices for deploying LLMs in production. They cover key challenges including prototyping, evaluation, observability, hardware constraints, and the importance of iteration. The discussion emphasizes practical advice for teams moving from prototype to production, highlighting the need for proper evaluation metrics, user feedback, and robust infrastructure.

Panel Discussion: Scaling Generative AI in Enterprise - Challenges and Best Practices

Various

A panel discussion featuring leaders from Google Cloud AI, Symbol AI, Chain ML, and Deloitte discussing the adoption, scaling, and implementation challenges of generative AI across different industries. The panel explores key considerations around model selection, evaluation frameworks, infrastructure requirements, and organizational readiness while highlighting practical approaches to successful GenAI deployment in production.

Parallel Asynchronous AI Coding Agents for Development Workflows

Google

Google Labs introduced Jules, an asynchronous coding agent designed to execute development tasks in parallel in the background while developers focus on higher-value work. The product addresses the challenge of serial development workflows by enabling developers to spin up multiple cloud-based agents simultaneously to handle tasks like SDK updates, testing, accessibility audits, and feature development. Launched two weeks prior to the presentation, Jules had already generated 40,000 public commits. The demonstration showcased how a developer could parallelize work on a conference schedule website by simultaneously running multiple test framework implementations, adding features like calendar integration and AI summaries, while conducting accessibility and security auditsโ€”all managed through a VM-based cloud infrastructure powered by Gemini 2.5 Pro.

PerfInsights: AI-Powered Performance Optimization for Go Services

Uber

Uber developed PerfInsights to address the unsustainable compute costs of their Go services, where the top 10 services alone accounted for multi-million dollars in monthly compute spend. The solution combines runtime profiling with GenAI-powered static analysis to automatically detect performance antipatterns in Go code, validate findings through LLM juries and rule-based checking (LLMCheck), and generate optimization recommendations. Results include a 93% reduction in time required to detect and fix performance issues (from 14.5 hours to 1 hour), over 80% reduction in false positives, hundreds of merged optimization diffs, and a 33.5% reduction in detected antipatterns over four months, translating to approximately 3,800 hours of engineering time saved annually.

Personalized Meal Plan Generator with LLM-Powered Recommendations

Cherrypick

Cherrypick, a meal planning service, launched an LLM-powered meal generator to create personalized meal plans with natural language explanations for recipe selections. The company faced challenges around cost management, interface design, and output reliability when moving from a traditional rule-based system to an LLM-based approach. By carefully constraining the problem space, avoiding chatbot interfaces in favor of structured interactions, implementing multi-layered evaluation frameworks, and working with rather than against model randomness, they achieved significant improvements: customers changed their plans 30% less and used plans in their baskets 14% more compared to the previous system.

Pitfalls and Best Practices for Production LLM Applications

Humanloop

A comprehensive overview from Human Loop's experience helping hundreds of companies deploy LLMs in production. The talk covers key challenges and solutions around evaluation, prompt management, optimization strategies, and fine-tuning. Major lessons include the importance of objective evaluation, proper prompt management infrastructure, avoiding premature optimization with agents/chains, and leveraging fine-tuning effectively. The presentation emphasizes taking lessons from traditional software engineering while acknowledging the unique needs of LLM applications.

Pivoting from GPU Infrastructure to Building an AI-Powered Development Environment

Windsurf

Windsurf began as a GPU virtualization company but pivoted in 2022 when they recognized the transformative potential of large language models. They developed an AI-powered development environment that evolved from a VS Code extension to a full-fledged IDE, incorporating advanced code understanding and generation capabilities. The product now serves hundreds of thousands of daily active users, including major enterprises, and has achieved significant success in automating software development tasks while maintaining high precision through sophisticated evaluation systems.

Practical Challenges in Building Production RAG Systems

Prolego

A detailed technical discussion between Prolego engineers about the practical challenges of implementing Retrieval Augmented Generation (RAG) systems in production. The conversation covers key challenges including document processing, chunking strategies, embedding techniques, and evaluation methods. The team shares real-world experiences about how RAG implementations differ from tutorial examples, particularly in handling complex document structures and different data formats.

Practical Lessons from Deploying LLMs in Production at Scale

Mercado Libre

Mercado Libre explored multiple production applications of Large Language Models across their e-commerce and technology platform, tackling challenges in knowledge retrieval, documentation generation, and natural language processing. The company implemented a RAG system for developer documentation using Llama Index, automated documentation generation for thousands of database tables, and built natural language input interpretation systems using function calling. Through iterative development, they learned critical lessons about the importance of underlying data quality, prompt engineering iteration, quality assurance for generated outputs, and the necessity of simplifying tasks for LLMs through proper data preprocessing and structured output formats.

Practical Lessons Learned from Building and Deploying GenAI Applications

Bolbeck

A comprehensive overview of lessons learned from building GenAI applications over 1.5 years, focusing on the complexities and challenges of deploying LLMs in production. The presentation covers key aspects of LLMOps including model selection, hosting options, ensuring response accuracy, cost considerations, and the importance of observability in AI applications. Special attention is given to the emerging role of AI agents and the critical balance between model capability and operational costs.

Practical LLM Deployment: From Evaluation to Fine-tuning

Parlance Labs

A comprehensive discussion of LLM deployment challenges and solutions across multiple industries, focusing on practical aspects like evaluation, fine-tuning, and production deployment. The case study covers experiences from GitHub's Copilot development, real estate CRM implementation, and consulting work at Parlance Labs, highlighting the importance of rigorous evaluation, data inspection, and iterative development in LLM deployments.

Pragmatic Product-Led Approach to LLM Integration and Prompt Engineering

LinkedIn

Pan Cha, Senior Product Manager at LinkedIn, shares insights on integrating LLMs into products effectively. He advocates for a pragmatic approach: starting with simple implementations using existing LLM APIs to validate use cases, then iteratively improving through robust prompt engineering and evaluation. The focus is on solving real user problems rather than adding AI for its own sake, with particular attention to managing user trust and implementing proper evaluation frameworks.

Privacy-Preserving LLM Usage Analysis System for Production AI Safety

Anthropic

Anthropic developed Clio, a privacy-preserving analysis system to understand how their Claude AI models are used in production while maintaining strict user privacy. The system performs automated clustering and analysis of conversations to identify usage patterns, detect potential misuse, and improve safety measures. Initial analysis of 1 million conversations revealed insights into usage patterns across different languages and domains, while helping identify both false positives and negatives in their safety systems.

Privacy-Preserving University Chatbot with LiteLLM Proxy for Multi-Model Governance and Cost Control

Unnamed private university

A private university sought to implement a privacy-preserving chatbot accessible to students and employees with requirements for model flexibility, potential self-hosting, and budget control. The solution leveraged LiteLLM's proxy server as an OpenAI-compatible gateway to manage multiple LLM providers, implement automatic cost tracking and budgeting per user/team, handle load balancing across model instances, and provide a unified API. While the system successfully delivered basic cost control and multi-provider support, the implementation revealed limitations in handling complex custom budgeting requirements, provider-specific features, and stability issues with newer features, requiring workarounds and custom implementations for advanced use cases.

Private Equity AI Transformation: Lessons from Portfolio Companies

PwC / Warburg Pincus / Abrigo

This panel discussion featuring executives from PwC, Warburg Pincus, Abrigo (a Carlyle portfolio company), and AWS explores the practical implementation of generative AI and LLMs in production across private equity portfolio companies. The conversation covers the journey from the ChatGPT launch in late 2022 through 2025, addressing real-world challenges including prioritization, talent gaps, data readiness, and organizational alignment. Key themes include starting with high-friction business problems rather than technology-first approaches, the importance of leadership alignment over technical infrastructure, rapid experimentation cycles, and the shift from viewing AI as optional to mandatory in investment diligence. The panelists emphasize practical successes such as credit memo generation, fraud alert summarization, loan workflow optimization, and e-commerce catalog enrichment, while cautioning against over-hyped transformation projects and highlighting the need for organizational cultural change alongside technical implementation.

Production Agents: Real-world Implementations of LLM-powered Autonomous Systems

Various

A panel discussion featuring three practitioners implementing LLM-powered agents in production: Sam's personal assistant with real-time feedback and router agents, Div's browser automation system Melton with reliability and monitoring features, and Devin's GitHub repository assistant that helps with code understanding and feature requests. Each presenter shared their architecture choices, testing strategies, and approaches to handling challenges like latency, reliability, and model selection in production environments.

Production AI Agents for Accounting Automation: Engineering Process Daemons at Scale

Digits

Digits, an AI-native accounting platform, shares their experience running AI agents in production for over 2 years, addressing real-world challenges in deploying LLM-based systems. The team reframes "agents" as "process daemons" to set appropriate expectations and details their implementation across three use cases: vendor data enrichment, client onboarding, and complex query handling. Their solution emphasizes building lightweight custom infrastructure over dependency-heavy frameworks, reusing existing APIs as agent tools, implementing comprehensive observability with OpenTelemetry, and establishing robust guardrails. The approach has enabled reliable automation while maintaining transparency, security, and performance through careful engineering rather than relying on framework abstractions.

Production AI Agents with Dynamic Planning and Reactive Evaluation

Hex

Hex successfully implemented AI agents in production for data science notebooks by developing a unique approach to agent orchestration. They solved key challenges around planning, tool usage, and latency by constraining agent capabilities, building a reactive DAG structure, and optimizing context windows. Their success came from iteratively developing individual capabilities before combining them into agents, keeping humans in the loop, and maintaining tight feedback cycles with users.

Production AI Deployment: Lessons from Real-World Agentic AI Systems

Databricks / Various

This case study presents lessons learned from deploying generative AI applications in production, with a specific focus on Flo Health's implementation of a women's health chatbot on the Databricks platform. The presentation addresses common failure points in GenAI projects including poor constraint definition, over-reliance on LLM autonomy, and insufficient engineering discipline. The solution emphasizes deterministic system architecture over autonomous agents, comprehensive observability and tracing, rigorous evaluation frameworks using LLM judges, and proper DevOps practices. Results demonstrate that successful production deployments require treating agentic AI as modular system architectures following established software engineering principles rather than monolithic applications, with particular emphasis on cost tracking, quality monitoring, and end-to-end deployment pipelines.

Production Deployment Challenges and Infrastructure Gaps for Multi-Agent AI Systems

GetOnStack

GetOnStack's team deployed a multi-agent LLM system for market data research that initially cost $127 weekly but escalated to $47,000 over four weeks due to an infinite conversation loop between agents running undetected for 11 days. This experience exposed critical gaps in production infrastructure for multi-agent systems using Agent-to-Agent (A2A) communication and Anthropic's Model Context Protocol (MCP). In response, the company spent six weeks building comprehensive production infrastructure including message queues, monitoring, cost controls, and safeguards. GetOnStack is now developing a platform to provide one-command deployment and production-ready infrastructure specifically designed for multi-agent systems, aiming to help other teams avoid similar costly production failures.

Production Deployment of Toqan Data Analyst Agent: From Prototype to Production Scale

Toqan

Toqan developed and deployed a data analyst agent that allows users to ask questions in natural language and receive SQL-generated answers with visualizations. The team faced significant challenges transitioning from a working prototype to a production system serving hundreds of users, including behavioral inconsistencies, infinite loops, and unreliable outputs. They solved these issues through four key approaches: implementing deterministic workflows for predictable behaviors, leveraging domain experts for setup and monitoring, building resilient systems to handle edge cases and abuse, and optimizing agent tools to reduce complexity. The result was a stable production system that successfully scaled to serve hundreds of users with improved reliability and user experience.

Production Evolution of an AI-Powered Medical Consultation Assistant

Doctolib

Doctolib developed and deployed an AI-powered consultation assistant for healthcare professionals that combines speech recognition, summarization, and medical content codification. Through a comprehensive approach involving simulated consultations, extensive testing, and careful metrics tracking, they evolved from MVP to production while maintaining high quality standards. The system achieved widespread adoption and positive feedback through iterative improvements based on both explicit and implicit user feedback, combining short-term prompt engineering optimizations with longer-term model and data improvements.

Production GenAI for User Safety and Enhanced Matching Experience

Tinder

Tinder implemented two production GenAI applications to enhance user safety and experience: a username detection system using fine-tuned Mistral 7B to identify social media handles in user bios with near-perfect recall, and a personalized match explanation feature using fine-tuned Llama 3.1 8B to help users understand why recommended profiles are relevant. Both systems required sophisticated LLMOps infrastructure including multi-model serving with LoRA adapters, GPU optimization, extensive monitoring, and iterative fine-tuning processes to achieve production-ready performance at scale.

Production Intent Recognition System for Enterprise Chatbots

FeedYou

FeedYou developed a sophisticated intent recognition system for their enterprise chatbot platform, addressing challenges in handling complex conversational flows and out-of-domain queries. They experimented with different NLP approaches before settling on a modular architecture using NLP.js, implementing hierarchical intent recognition with local and global intents, and integrating generative models for handling edge cases. The system achieved a 72% success rate for local intent matching and effectively handled complex conversational scenarios across multiple customer deployments.

Production Lessons from Building and Deploying AI Agents

Rasgo

Rasgo's journey in building and deploying AI agents for data analysis reveals key insights about production LLM systems. The company developed a platform enabling customers to use standard data analysis agents and build custom agents for specific tasks, with focus on database connectivity and security. Their experience highlights the importance of agent-computer interface design, the critical role of underlying model selection, and the significance of production-ready infrastructure over raw agent capabilities.

Production LLM Implementation for Customer Support Response Generation

Stripe

Stripe implemented a large language model system to help support agents answer customer questions more efficiently. They developed a sequential framework that combined fine-tuned models for question filtering, topic classification, and response generation. While the system achieved good accuracy in offline testing, they discovered challenges with agent adoption and the importance of monitoring online metrics. Key learnings included breaking down complex problems into manageable ML steps, prioritizing online feedback mechanisms, and maintaining high-quality training data.

Production LLM Systems at Scale - Lessons from Financial Services, Legal Tech, and ML Infrastructure

Nubank, Harvey AI, Galileo and Convirza

A panel discussion featuring leaders from Nubank, Harvey AI, Galileo, and Convirza discussing their experiences implementing LLMs in production. The discussion covered key challenges and solutions around model evaluation, cost optimization, latency requirements, and the transition from large proprietary models to smaller fine-tuned models. Participants shared insights on modularizing LLM applications, implementing human feedback loops, and balancing the tradeoffs between model size, cost, and performance in production environments.

Production Monitoring and Issue Discovery for AI Agents

Raindrop

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

Production RAG Best Practices: Implementation Lessons at Scale

Kapa.ai

Based on experience with over 100 technical teams including Docker, CircleCI, and Reddit, this case study examines key challenges and solutions in implementing production-grade RAG systems. The analysis covers critical aspects from data curation and refresh pipelines to evaluation frameworks and security practices, highlighting how most RAG implementations fail at the POC stage while providing concrete guidance for successful production deployments.

Production-Ready Agent Behavior: Identity, Intent, and Governance

Oso

Oso, a SaaS company that governs actions in B2B applications, presents a comprehensive framework for productionizing AI agents through three critical stages: prototype to QA, QA to production, and running in production. The company addresses fundamental challenges including agent identity (requiring user, agent, and session context), intent-based tool filtering to prevent unwanted behaviors like prompt injection attacks, and real-time governance mechanisms for monitoring and quarantining misbehaving agents. Using LangChain 1.0 middleware capabilities, Oso demonstrates how to implement deterministic guardrails that wrap both tool calls and model calls, preventing data exfiltration scenarios and ensuring agents only execute actions aligned with user intent. The solution enables security teams and product managers to dynamically control agent behavior in production without code changes, limiting blast radius when agents misbehave.

Production-Ready Question Generation System Using Fine-Tuned T5 Models

Digits

Digits implemented a production system for generating contextual questions for accountants using fine-tuned T5 models. The system helps accountants interact with clients by automatically generating relevant questions about transactions. They addressed key challenges like hallucination and privacy through multiple validation checks, in-house fine-tuning, and comprehensive evaluation metrics. The solution successfully deployed using TensorFlow Extended on Google Cloud Vertex AI with careful attention to training-serving skew and model performance monitoring.

Production-Scale NLP Suggestion System with Real-Time Text Processing

Grammarly

Grammarly built a sophisticated production system for delivering writing suggestions to 30 million users daily. The company developed an extensible operational transformation protocol using Delta format to represent text changes, user edits, and AI-generated suggestions in a unified manner. The system addresses critical challenges in managing ML-generated suggestions at scale: maintaining suggestion relevance as users edit text in real-time, rebasing suggestion positions according to ongoing edits without waiting for backend updates, and applying multiple suggestions simultaneously without UI freezing. The architecture includes a Suggestions Repository, Delta Manager for rebasing operations, and Highlights Manager, all working together to ensure suggestions remain accurate and applicable as document state changes dynamically.

Productionizing Generative AI Applications: From Exploration to Scale

LinkedIn

A LinkedIn product manager shares insights on bringing LLMs to production, focusing on their implementation of various generative AI features across the platform. The case study covers the complete lifecycle from idea exploration to production deployment, highlighting key considerations in prompt engineering, GPU resource management, and evaluation frameworks. The presentation emphasizes practical approaches to building trust-worthy AI products while maintaining scalability and user focus.

Productionizing LLM-Powered Data Governance with LangChain and LangSmith

Grab

Grab enhanced their LLM-powered data governance system (Metasense V2) by improving model performance and operational efficiency. The team tackled challenges in data classification by splitting complex tasks, optimizing prompts, and implementing LangChain and LangSmith frameworks. These improvements led to reduced misclassification rates, better collaboration between teams, and streamlined prompt experimentation and deployment processes while maintaining robust monitoring and safety measures.

Quantitative Framework for Production LLM Evaluation in Security Applications

Elastic

Elastic developed a comprehensive framework for evaluating and improving GenAI features in their security products, including an AI Assistant and Attack Discovery tool. The framework incorporates test scenarios, curated datasets, tracing capabilities using LangGraph and LangSmith, evaluation rubrics, and a scoring mechanism to ensure quantitative measurement of improvements. This systematic approach enabled them to move from manual to automated evaluations while maintaining high quality standards for their production LLM applications.

RAG System for Investment Policy Search and Advisory at RBC

Arcane

RBC developed an internal RAG (Retrieval Augmented Generation) system called Arcane to help financial advisors quickly access and interpret complex investment policies and procedures. The system addresses the challenge of finding relevant information across semi-structured documents, reducing the time specialists spend searching through documentation. The solution combines advanced parsing techniques, vector databases, and LLM-powered generation with a chat interface, while implementing robust evaluation methods to ensure accuracy and prevent hallucinations.

RAG-Based Industry Classification System for Customer Segmentation

Ramp

Ramp faced challenges with inconsistent industry classification across teams using homegrown taxonomies that were inaccurate, too generic, and not auditable. They solved this by building an in-house RAG (Retrieval-Augmented Generation) system that migrated all industry classification to standardized NAICS codes, featuring a two-stage process with embedding-based retrieval and LLM-based selection. The system improved data quality, enabled consistent cross-team communication, and provided interpretable results with full control over the classification process.

RAG-Based Industry Classification System for Financial Services

Ramp

Ramp, a financial services company, replaced their fragmented homegrown industry classification system with a standardized NAICS-based taxonomy powered by an in-house RAG model. The old system relied on stitched-together third-party data and multiple non-auditable sources of truth, leading to inconsistent, overly broad, and sometimes incorrect business categorizations. By building a custom RAG system that combines embeddings-based retrieval with LLM-based re-ranking, Ramp achieved significant improvements in classification accuracy (up to 60% in retrieval metrics and 5-15% in final prediction accuracy), gained full control over the model's behavior and costs, and enabled consistent cross-team usage of industry data for compliance, risk assessment, sales targeting, and product analytics.

RAG-Based System for Climate Finance Document Analysis

ClimateAligned

ClimateAligned, an early-stage startup, developed a RAG-based system to analyze climate-related financial documents and assess their "greenness." Starting with a small team of 2-3 engineers, they built a solution that combines LLMs, hybrid search, and human-in-the-loop processes to achieve 99% accuracy in document analysis. The system reduced analysis time from 2 hours to 20 minutes per company, even with human verification, and successfully evolved from a proof-of-concept to serving their first users while maintaining high accuracy standards.

RAG-powered Decision Intelligence Platform for Manufacturing Knowledge Management

Circuitry.ai

Circuitry.ai addressed the challenge of managing complex product information for manufacturers by developing an AI-powered decision intelligence platform. Using Databricks' infrastructure, they implemented RAG chatbots to process and serve proprietary customer data, resulting in a 60-70% reduction in information search time. The solution integrated Delta Lake for data management, Unity Catalog for governance, and custom knowledge bases with Llama and DBRX models for accurate response generation.

Rapid Development and Deployment of Enterprise LLM Features Through Centralized LLM Service Architecture

PagerDuty

PagerDuty successfully developed and deployed multiple GenAI features in just two months by implementing a centralized LLM API service architecture. They created AI-powered features including runbook generation, status updates, postmortem reports, and an AI assistant, while addressing challenges of rapid development with new technology. Their solution included establishing clear processes, role definitions, and a centralized LLM service with robust security, monitoring, and evaluation frameworks.

Rapid Development of AI-Powered Video Interview Analysis System

Vericant

Vericant, an educational testing company, developed and deployed an AI-powered video interview analysis system in just 30 days. The solution automatically processes 15-minute admission interview videos to generate summaries, key points, and topic analyses, enabling admissions teams to review interviews in 20-30 seconds instead of watching full recordings. The implementation was achieved through iterative prompt engineering and a systematic evaluation framework, without requiring significant engineering resources or programming expertise.

Real-Time Access Control and Credit System for High-Scale LLM Products

OpenAI

OpenAI encountered significant scaling challenges with Codex and Sora as rapid user adoption pushed usage beyond expected limits, creating frustrating experiences when users hit rate limits. To address this, they built an in-house real-time access engine that seamlessly blends rate limits with a credit-based pay-as-you-go system, enabling users to continue working without hard stops. The solution involved creating a distributed usage and balance system with provably correct billing, real-time decision-making, idempotent credit debits, and comprehensive audit trails that maintain user trust while ensuring fair access and system performance at scale.

Real-Time AI Chief of Staff for Product Teams

Earmark

Earmark built a productivity suite for product teams that transforms meeting conversations into finished work in real-time, addressing the problem of endless context-switching and manual follow-up work that plagues modern product development. Founded by Mark Barb and Sandon, who both came from the product management SaaS space, Earmark uses live transcription and multiple parallel AI agents to generate product specs, tickets, summaries, and other artifacts during meetings rather than after them. The company pivoted from an Apple Vision Pro communication training tool to a web-based real-time meeting assistant after discovering through 60 customer interviews that few people actually prepare for presentations. With 78% of survey respondents saying they'd be "super bummed" if the product disappeared, Earmark has achieved strong product-market fit by focusing specifically on product managers, engineering leaders, and adjacent roles who spend most of their time in back-to-back meetings with different audiences and deliverables.

Real-time Data Streaming Architecture for AI Customer Support

Clari

A fictional airline case study demonstrates how shifting from batch processing to real-time data streaming transformed their AI customer support system. By implementing a shift-left data architecture using Kafka and Flink, they eliminated data silos and delayed processing, enabling their AI agents to access up-to-date customer information across all channels. This resulted in improved customer satisfaction, reduced latency, and decreased operational costs while enabling their AI system to provide more accurate and contextual responses.

Real-Time Generative AI for Immersive Theater Performance

University of California Los Angeles

The University of California Los Angeles (UCLA) Office of Advanced Research Computing (OARC) partnered with UCLA's Center for Research and Engineering in Media and Performance (REMAP) to build an AI-powered system for an immersive production of the musical "Xanadu." The system enabled up to 80 concurrent audience members and performers to create sketches on mobile phones, which were processed in near real-time (under 2 minutes) through AWS generative AI services to produce 2D images and 3D meshes displayed on large LED screens during live performances. Using a serverless-first architecture with Amazon SageMaker AI endpoints, Amazon Bedrock foundation models, and AWS Lambda orchestration, the system successfully supported 7 performances in May 2025 with approximately 500 total audience members, demonstrating that cloud-based generative AI can reliably power interactive live entertainment experiences.

Real-Time Multilingual Chat Translation at Scale

Roblox

Roblox deployed a unified transformer-based translation LLM to enable real-time chat translation across all combinations of 16 supported languages for over 70 million daily active users. The company built a custom ~1 billion parameter model using pretraining on open source and proprietary data, then distilled it down to fewer than 650 million parameters to achieve approximately 100 millisecond latency while handling over 5,000 chats per second. The solution leverages a mixture-of-experts architecture, custom translation quality estimation models, back translation techniques for low-resource language pairs, and comprehensive integration with trust and safety systems to deliver contextually appropriate translations that understand Roblox-specific slang and terminology.

Real-World LLM Implementation: RAG, Documentation Generation, and Natural Language Processing at Scale

Mercado Libre

Mercado Libre implemented three major LLM use cases: a RAG-based documentation search system using Llama Index, an automated documentation generation system for thousands of database tables, and a natural language processing system for product information extraction and service booking. The project revealed key insights about LLM limitations, the importance of quality documentation, prompt engineering, and the effective use of function calling for structured outputs.

Rebuilding a Production Chatbot with Direct API Access and Multi-Agent Architecture

Langchain

LangChain rebuilt their public documentation chatbot after discovering their support engineers preferred using their own internal workflow over the existing tool. The original chatbot used traditional vector embedding retrieval, which suffered from fragmented context, constant reindexing, and vague citations. The solution involved building two distinct architectures: a fast CreateAgent for simple documentation queries delivering sub-15-second responses, and a Deep Agent with specialized subgraphs for complex queries requiring codebase analysis. The new approach replaced vector embeddings with direct API access to structured content (Mintlify for docs, Pylon for knowledge base, and ripgrep for codebase search), enabling the agent to search iteratively like a human. Results included dramatically faster response times, precise citations with line numbers, elimination of reindexing overhead, and internal adoption by support engineers for complex troubleshooting.

Reducing False Positives in AI Code Review Agents Through Architecture Refinement

cubic

cubic, an AI-native GitHub platform, developed an AI code review agent that initially suffered from excessive false positives and low-value comments, causing developers to lose trust in the system. Through three major architecture revisions and extensive offline testing, the team implemented explicit reasoning logs, streamlined tooling, and specialized micro-agents instead of a single monolithic agent. These changes resulted in a 51% reduction in false positives without sacrificing recall, significantly improving the agent's precision and usefulness in production code reviews.

Refining Input Guardrails for Safer LLM Applications Through Chain-of-Thought Fine-Tuning

Capital One

Capital One developed enhanced input guardrails to protect LLM-powered conversational assistants from adversarial attacks and malicious inputs. The company used chain-of-thought prompting combined with supervised fine-tuning (SFT) and alignment techniques like Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) to improve the accuracy of LLM-as-a-Judge moderation systems. Testing on four open-source models (Mistral 7B, Mixtral 8x7B, Llama2 13B, and Llama3 8B) showed significant improvements in F1 scores and attack detection rates of over 50%, while maintaining low false positive rates, demonstrating that effective guardrails can be achieved with small training datasets and minimal computational resources.

Responsible AI Implementation for Healthcare Form Automation

WellSky

WellSky, serving over 2,000 hospitals and handling 100 million forms annually, partnered with Google Cloud to address clinical documentation burden and clinician burnout. They developed an AI-powered solution focusing on form automation, implementing a comprehensive responsible AI framework with emphasis on evidence citation, governance, and technical foundations. The project aimed to reduce "pajama time" - where 75% of nurses complete documentation after hours - while ensuring patient safety through careful AI deployment.

Responsible LLM Adoption for Fraud Detection with RAG Architecture

Mastercard

Mastercard successfully implemented LLMs in their fraud detection systems, achieving up to 300% improvement in detection rates. They approached this by focusing on responsible AI adoption, implementing RAG (Retrieval Augmented Generation) architecture to handle their large amounts of unstructured data, and carefully considering access controls and security measures. The case study demonstrates how enterprise-scale LLM deployment requires careful consideration of technical debt, infrastructure scaling, and responsible AI principles.

Retrieval Augmented LLMs for Real-time CRM Account Linking

Schneider Electric

Schneider Electric partnered with AWS Machine Learning Solutions Lab to automate their CRM account linking process using Retrieval Augmented Generation (RAG) with Flan-T5-XXL model. The solution combines LangChain, Google Search API, and SEC-10K data to identify and maintain up-to-date parent-subsidiary relationships between customer accounts, improving accuracy from 55% to 71% through domain-specific prompt engineering.

Running LLM Agents in Production for Accounting Automation

Digits

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

Scalable Intelligent Document Processing with Multi-Tenant Serverless Architecture

Ricoh

Ricoh USA faced significant scalability challenges in their healthcare document processing operations, where each new customer implementation required 40-60 hours of custom engineering work involving unique prompt engineering, model fine-tuning, and integration testing. To address anticipated sevenfold growth in document volume (from 10,000 to 70,000 documents monthly), Ricoh partnered with AWS to implement the GenAI IDP Accelerator using a serverless architecture combining Amazon Textract for OCR and Amazon Bedrock foundation models for intelligent classification and extraction. The solution reduced customer onboarding time from 4-6 weeks to 2-3 days, decreased engineering hours per deployment by over 90% (from ~80 hours to <5 hours), and created a reusable, multi-tenant framework that maintains strict healthcare compliance standards (HITRUST, HIPAA, SOC 2) while enabling effective human-in-the-loop workflows through confidence scoring mechanisms.

Scaling a High-Traffic LLM Chat Application to 30,000 Messages Per Second

Character.ai

Character.ai scaled their open-domain conversational AI platform from 300 to over 30,000 generations per second within 18 months, becoming the third most-used generative AI application globally. They tackled unique engineering challenges around data volume, cost optimization, and connection management while maintaining performance. Their solution involved custom model architectures, efficient GPU caching strategies, and innovative prompt management tools, all while balancing performance, latency, and cost considerations at scale.

Scaling Agentic AI for Digital Accessibility and Content Intelligence

Siteimprove

Siteimprove, a SaaS platform provider for digital accessibility, analytics, SEO, and content strategy, embarked on a journey from generative AI to production-scale agentic AI systems. The company faced the challenge of processing up to 100 million pages per month for accessibility compliance while maintaining trust, speed, and adoption. By leveraging AWS Bedrock, Amazon Nova models, and developing a custom AI accelerator architecture, Siteimprove built a multi-agent system supporting batch processing, conversational remediation, and contextual image analysis. The solution achieved 75% cost reduction on certain workloads, enabled autonomous multi-agent orchestration across accessibility, analytics, SEO, and content domains, and was recognized as a leader in Forrester's digital accessibility platforms assessment. The implementation demonstrated how systematic progression through human-in-the-loop, human-on-the-loop, and autonomous stages can bridge the prototype-to-production chasm while delivering measurable business value.

Scaling AI Agents Across Enterprise Sales and Customer Service Operations

Salesforce

Salesforce deployed its Agentforce platform across the entire organization as "Customer Zero," learning critical lessons about agent deployment, testing, data quality, and human-AI collaboration over the course of one year. The company scaled AI agents across sales and customer service operations, with their service agent handling over 1.5 million support requests, the SDR agent generating $1.7 million in new pipeline from dormant leads after working on 43,000+ leads, and agents in Slack saving employees 500,000 hours annually. Early challenges included high "I don't know" response rates (30%), overly restrictive guardrails that prevented legitimate customer interactions, and data inconsistency issues across 650+ data streams, which were addressed through iterative refinement, data governance improvements using Salesforce Data Cloud, and a shift from prescriptive instructions to goal-oriented agent design.

Scaling AI Agents to Production: A Blueprint for Autonomous Customer Service

Cox Automotive

Cox Automotive, a dominant player in the automotive software industry with visibility into 5.1 trillion vehicle insights, faced the challenge of moving AI agents from prototype to production at scale. In response to an aggressive 5-week deadline set in summer 2024, the company launched five agentic AI products using Amazon Bedrock Agent Core and the Strands framework. The flagship product was a fully automated virtual assistant for dealership customer conversations that operates autonomously after hours without human oversight. By establishing foundational infrastructure with Agent Core, implementing comprehensive red teaming practices, designing both hard and soft guardrails, automating evaluation with LLM-as-judge techniques, and setting circuit breakers for cost and conversation limits, Cox Automotive successfully deployed three products to production beta, with dealers reporting that customers receive timely responses both during business hours and after hours.

Scaling AI Coding Agents Through Automated Verification and Specification-Driven Development

Factory AI

Factory AI presents a framework for enabling autonomous software engineering agents to operate at scale within production environments. The core challenge addressed is that most organizations lack sufficient automated validation infrastructure to support reliable AI agent deployment across the software development lifecycle. The proposed solution shifts from traditional specification-based development to verification-driven development, emphasizing the creation of rigorous automated validation criteria including comprehensive testing, opinionated linters, documentation, and continuous feedback loops. By investing in this validation infrastructure, organizations can achieve 5-7x productivity improvements rather than marginal gains, enabling fully autonomous workflows where AI agents can handle tasks from bug filing to production deployment with minimal human intervention.

Scaling AI Image Animation System with Optimized Latency and Traffic Management

Meta

Meta developed and deployed an AI-powered image animation feature that needed to serve billions of users efficiently. They tackled this challenge through a comprehensive optimization strategy including floating-point precision reduction, temporal-attention improvements, DPM-Solver implementation, and innovative distillation techniques. The system was further enhanced with sophisticated traffic management and load balancing solutions, resulting in a highly efficient, globally scalable service with minimal latency and failure rates.

Scaling AI Infrastructure for Legal AI Applications at Enterprise Scale

Harvey

Harvey, a legal AI platform company, developed a comprehensive AI infrastructure system to handle millions of daily requests across multiple AI models for legal document processing and analysis. The company built a centralized Python library that manages model deployments, implements load balancing, quota management, and real-time monitoring to ensure reliability and performance. Their solution includes intelligent model endpoint selection, distributed rate limiting using Redis-backed token bucket algorithms, a proxy service for developer access, and comprehensive observability tools, enabling them to process billions of prompt tokens while maintaining high availability and seamless scaling for their legal AI products.

Scaling AI Infrastructure: From Training to Inference at Meta

Meta

Meta shares their journey in scaling AI infrastructure to support massive LLM training and inference operations. The company faced challenges in scaling from 256 GPUs to over 100,000 GPUs in just two years, with plans to reach over a million GPUs by year-end. They developed solutions for distributed training, efficient inference, and infrastructure optimization, including new approaches to data center design, power management, and GPU resource utilization. Key innovations include the development of a virtual machine service for secure code execution, improvements in distributed inference, and novel approaches to reducing model hallucinations through RAG.

Scaling AI-Assisted Coding Infrastructure: From Auto-Complete to Global Deployment

Cursor

Cursor, an AI-assisted coding platform, scaled their infrastructure from handling basic code completion to processing 100 million model calls per day across a global deployment. They faced and overcame significant challenges in database management, model inference scaling, and indexing systems. The case study details their journey through major incidents, including a database crisis that led to a complete infrastructure refactor, and their innovative solutions for handling high-scale AI model inference across multiple providers while maintaining service reliability.

Scaling AI-Assisted Developer Tools and Agentic Workflows at Scale

Slack

Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.

Scaling AI-Powered Code Generation in Browser and Enterprise Environments

Qodo / Stackblitz

The case study examines two companies' approaches to deploying LLMs for code generation at scale: Stackblitz's Bolt.new achieving over $8M ARR in 2 months with their browser-based development environment, and Qodo's enterprise-focused solution handling complex deployment scenarios across 96 different configurations. Both companies demonstrate different approaches to productionizing LLMs, with Bolt.new focusing on simplified web app development for non-developers and Qodo targeting enterprise testing and code review workflows.

Scaling AI-Powered Student Support Chatbots Across Campus

UC Santa Barbara

UC Santa Barbara implemented an AI-powered chatbot platform called "Story" (powered by Gravity's Ivy and Ocelot services) to address challenges in student support after COVID-19, particularly helping students navigate campus services and reducing staff workload. Starting with a pilot of five departments in 2022, UCSB scaled to 19 chatbot instances across diverse student services over two and a half years. The implementation resulted in nearly 40,000 conversations, with 30% occurring outside business hours, significantly reducing phone and email volume to departments while enabling staff to focus on more complex student inquiries. The university took a phased cohort approach, training departments in groups over 10-week periods, with student testers providing crucial feedback on language and expectations before launch.

Scaling an Autonomous AI Customer Support Agent from Demo to Production

Intercom

Intercom developed Finn, an autonomous AI customer support agent, evolving it from early prototypes with GPT-3.5 to a production system using GPT-4 and custom architecture. Initially hampered by hallucinations and safety concerns, the system now successfully resolves 58-59% of customer support conversations, up from 25% at launch. The solution combines multiple AI processes including disambiguation, ranking, and summarization, with careful attention to brand voice control and escalation handling.

Scaling an MCP Server for Error Monitoring to 60 Million Monthly Requests

Sentry

Sentry, an error monitoring platform, built a Model Context Protocol (MCP) server to improve the workflow where developers would copy error details from Sentry's UI and paste them into AI coding assistants like Cursor. The MCP server provides direct integration with 10-15 tools, including retrieving issue details and triggering automated fix attempts through Sentry's AI agent. The implementation scaled from 30 million to 60 million requests per month, with over 5,000 organizations using it. The company learned critical lessons about treating MCP servers as production services, implementing comprehensive observability, managing context pollution, and taking responsibility for agent behavior through careful prompt engineering and tool description design.

Scaling and Operating Large Language Models at the Frontier

Anthropic

This case study examines Anthropic's journey in scaling and operating large language models, focusing on their transition from GPT-3 era training to current state-of-the-art systems like Claude. The company successfully tackled challenges in distributed computing, model safety, and operational reliability while growing 10x in revenue. Key innovations include their approach to constitutional AI, advanced evaluation frameworks, and sophisticated MLOps practices that enable running massive training operations with hundreds of team members.

Scaling Audio Content Generation with LLMs and TTS for Language Learning

Duolingo

Duolingo tackled the challenge of scaling their DuoRadio feature, a podcast-like audio learning experience, by implementing an AI-driven content generation pipeline. They transformed a labor-intensive manual process into an automated system using LLMs for script generation and evaluation, coupled with Text-to-Speech technology. This allowed them to expand from 300 to 15,000+ episodes across 25+ language courses in under six months, while reducing costs by 99% and growing daily active users from 100K to 5.5M.

Scaling Chatbot Platform with Hybrid LLM and Custom Model Approach

Voiceflow

Voiceflow, a chatbot and voice assistant platform, integrated large language models into their existing infrastructure while maintaining custom language models for specific tasks. They used OpenAI's API for generative features but kept their custom NLU model for intent/entity detection due to superior performance and cost-effectiveness. The company implemented extensive testing frameworks, prompt engineering, and error handling while dealing with challenges like latency variations and JSON formatting issues.

Scaling Contact Center Operations with AI Agents in Fintech and Travel Industries

Propel Holdings / Xanterra Travel Collection

Propel Holdings (fintech) and Xanterra Travel Collection (travel/hospitality) implemented Cresta's AI agent solutions to address scaling challenges and operational efficiency in their contact centers. Both organizations started with agent assist capabilities before deploying conversational AI agents for chat and voice channels. Propel Holdings needed to support 40% year-over-year growth without proportionally scaling human agents, while Xanterra sought to reduce call volume for routine inquiries and provide 24/7 coverage. Starting with FAQ-based use cases and later integrating APIs for transactional capabilities, both companies achieved significant results: Propel Holdings reached 58% chat containment after API integration, while Xanterra achieved 60-90% containment on chat and 20-30% on voice channels. Within five months, Xanterra deployed 12 AI agents across different properties and channels, demonstrating rapid scaling capability while maintaining customer satisfaction and redeploying human agents to higher-value interactions.

Scaling Content Production and Fan Engagement with Gen AI

Bundesliga

Bundesliga (DFL), Germany's premier soccer league, deployed multiple Gen AI solutions to address two key challenges: scaling content production for over 1 billion global fans across 200 countries, and enhancing personalized fan engagement to reduce "second screen chaos" during live matches. The organization implemented three main production-scale solutions: automated match report generation that saves editors 90% of their time, AI-powered story creation from existing articles that reduces production time by 80%, and on-demand video localization that cuts processing time by 75% while reducing costs by 3.5x. Additionally, they developed MatchMade, an AI-powered fan companion featuring dynamic text-to-SQL workflows and proactive content nudging. By leveraging Amazon Nova for cost-performance optimization alongside other models like Anthropic's Claude, Bundesliga achieved a 70% cost reduction in image assignment tasks, 35% cost reduction through dynamic routing, and scaled personalized content delivery by 5x per user while serving over 100,000 fans in production.

Scaling Customer Support AI Chatbot to Production with Multiple LLM Providers

Intercom

Intercom developed Fin, an AI customer support chatbot that resolves up to 86% of conversations instantly. They faced challenges scaling from proof-of-concept to production, particularly around reliability and cost management. The team successfully improved their system from 99% to 99.9%+ reliability by implementing cross-region inference, strategic use of streaming, and multiple model fallbacks while using Amazon Bedrock and other LLM providers. The solution has processed over 13 million conversations for 4,000+ customers with most achieving over 50% automated resolution rates.

Scaling Customer Support, Compliance, and Developer Productivity with Gen AI

Coinbase

Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.

Scaling Document Processing with LLMs and Human Review

Vendr / Extend

Vendr partnered with Extend to extract structured data from SaaS order forms and contracts using LLMs. They implemented a hybrid approach combining LLM processing with human review to achieve high accuracy in entity recognition and data extraction. The system successfully processed over 100,000 documents, using techniques such as document embeddings for similarity clustering, targeted human review, and robust entity mapping. This allowed Vendr to unlock valuable pricing insights for their customers while maintaining high data quality standards.

Scaling Financial Software with GenAI and Production ML

Ramp

Ramp, a financial technology company, has integrated AI and ML throughout their operations, from their core financial products to their sales and customer service. They evolved from traditional ML use cases like fraud detection and underwriting to more advanced generative AI applications. Their Ramp Intelligence suite now includes features like automated price comparison, expense categorization, and an experimental AI agent that can guide users through the platform's interface. The company has achieved significant productivity gains, with their sales development representatives booking 3-4x more meetings than competitors through AI augmentation.

Scaling Game Content Production with LLMs and Data Augmentation

Ubisoft

Ubisoft leveraged AI21 Labs' LLM capabilities to automate tedious scriptwriting tasks and generate training data for their internal models. By implementing a writer-in-the-loop workflow for NPC dialogue generation and using AI21's models for data augmentation, they successfully scaled their content production while maintaining creative control. The solution included optimized token pricing for extensive prompt experimentation and resulted in significant efficiency gains in their game development process.

Scaling Generative AI in Gaming: From Safety to Creation Tools

Roblox

Roblox has implemented a comprehensive suite of generative AI features across their gaming platform, addressing challenges in content moderation, code assistance, and creative tools. Starting with safety features using transformer models for text and voice moderation, they expanded to developer tools including AI code assistance, material generation, and specialized texture creation. The company releases new AI features weekly, emphasizing rapid iteration and public testing, while maintaining a balance between automation and creator control. Their approach combines proprietary solutions with open-source contributions, demonstrating successful large-scale deployment of AI in a production gaming environment serving 70 million daily active users.

Scaling Image Generation to 100M New Users in One Week

OpenAI

OpenAI's launch of ChatGPT Images faced unprecedented scale, attracting 100 million new users generating 700 million images in the first week. The engineering team had to rapidly adapt their synchronous image generation system to an asynchronous one while handling production load, implementing system isolation, and managing resource constraints. Despite the massive scale and technical challenges, they maintained service availability by prioritizing access over latency and successfully scaled their infrastructure.

Scaling Knowledge Management with LLM-powered Chatbot in Manufacturing

OSRAM

OSRAM, a century-old lighting technology company, faced challenges with preserving institutional knowledge amid workforce transitions and accessing scattered technical documentation across their manufacturing operations. They partnered with Adastra to implement an AI-powered chatbot solution using Amazon Bedrock and Claude, incorporating RAG and hybrid search approaches. The solution achieved over 85% accuracy in its initial deployment, with expectations to exceed 90%, successfully helping workers access critical operational information more efficiently across different departments.

Scaling LLM and ML Models to 300M Monthly Requests with Self-Hosting

StoryGraph

StoryGraph, a book recommendation platform, successfully scaled their AI/ML infrastructure to handle 300M monthly requests by transitioning from cloud services to self-hosted solutions. The company implemented multiple custom ML models, including book recommendations, similar users, and a large language model, while maintaining data privacy and reducing costs significantly compared to using cloud APIs. Through innovative self-hosting approaches and careful infrastructure optimization, they managed to scale their operations despite being a small team, though not without facing significant challenges during high-traffic periods.

Scaling LLM-Powered Financial Insights with Continuous Evaluation

Fintool

Fintool, an AI equity research assistant, faced the challenge of processing massive amounts of financial data (1.5 billion tokens across 70 million document chunks) while maintaining high accuracy and trust for institutional investors. They implemented a comprehensive LLMOps evaluation workflow using Braintrust, combining automated LLM-based evaluation, golden datasets, format validation, and human-in-the-loop oversight to ensure reliable and accurate financial insights at scale.

Scaling Local News Coverage with AI-Powered Newsletter Generation

Patch

Patch transformed its local news coverage by implementing AI-powered newsletter generation, enabling them to expand from 1,100 to 30,000 communities while maintaining quality and trust. The system combines curated local data sources, weather information, event calendars, and social media content, processed through AI to create relevant, community-specific newsletters. This approach resulted in over 400,000 new subscribers and a 93.6% satisfaction rating, while keeping costs manageable and maintaining editorial standards.

Scaling Meta AI's Feed Deep Dive from Launch to Product-Market Fit

Meta

Meta launched Feed Deep Dive as an AI-powered feature on Facebook in April 2024 to address information-seeking and context enrichment needs when users encounter posts they want to learn more about. The challenge was scaling from launch to product-market fit while maintaining high-quality responses at Meta scale, dealing with LLM hallucinations and refusals, and providing more value than users would get from simply scrolling Facebook Feed. Meta's solution involved evolving from traditional orchestration to agentic models with planning, tool calling, and reflection capabilities; implementing auto-judges for online quality evaluation; using smart caching strategies focused on high-traffic posts; and leveraging ML-based user cohort targeting to show the feature to users who derived the most value. The results included achieving product-market fit through improved quality and engagement, with the team now moving toward monetization and expanded use cases.

Scaling ML Annotation Platform with LLMs for Content Classification

Spotify

Spotify needed to generate high-quality training data annotations at massive scale to support ML models covering hundreds of millions of tracks and podcast episodes for tasks like content relations detection and platform policy violation identification. They built a comprehensive annotation platform centered on three pillars: scaling human expertise through tiered workforce structures, implementing flexible annotation tooling with custom interfaces and quality metrics, and establishing robust infrastructure for integration with ML workflows. A key innovation was deploying a configurable LLM-based system running in parallel with human annotators. This approach increased their annotation corpus by 10x while improving annotator productivity by 3x, enabling them to generate millions of annotations and significantly reduce ML model development time.

Scaling Open-Ended Customer Service Analysis with Foundation Models

MaestroQA

MaestroQA enhanced their customer service quality assurance platform by integrating Amazon Bedrock to analyze millions of customer interactions at scale. They implemented a solution that allows customers to ask open-ended questions about their service interactions, enabling sophisticated analysis beyond traditional keyword-based approaches. The system successfully processes high volumes of transcripts across multiple regions while maintaining low latency, leading to improved compliance detection and customer sentiment analysis for their clients across various industries.

Scaling Order Processing Automation Using Modular LLM Architecture

Choco

Choco developed an AI system to automate the order intake process for food and beverage distributors, handling unstructured orders from various channels (email, voicemail, SMS, WhatsApp). By implementing a modular LLM architecture with specialized components for transcription, information extraction, and product matching, along with comprehensive evaluation pipelines and human feedback loops, they achieved over 95% prediction accuracy. One customer reported 60% reduction in manual order entry time and 50% increase in daily order processing capacity without additional staffing.

Scaling Privacy Infrastructure for GenAI Product Innovation

Meta

Meta addresses the challenge of maintaining user privacy while deploying GenAI-powered products at scale, using their AI glasses as a primary example. The company developed Privacy Aware Infrastructure (PAI), which integrates data lineage tracking, automated policy enforcement, and comprehensive observability across their entire technology stack. This infrastructure automatically tracks how user data flows through systemsโ€”from initial collection through sensor inputs, web processing, LLM inference calls, data warehousing, to model trainingโ€”enabling Meta to enforce privacy controls programmatically while accelerating product development. The solution allows engineering teams to innovate rapidly with GenAI capabilities while maintaining auditable, verifiable privacy guarantees across thousands of microservices and products globally.

Scaling Product Categorization from Manual Tagging to LLM-Based Classification

GetYourGuide

GetYourGuide, a global marketplace for travel experiences, evolved their product categorization system from manual tagging to an LLM-based solution to handle 250,000 products across 600 categories. The company progressed through rule-based systems and semantic NLP models before settling on a hybrid approach using OpenAI's GPT-4-mini with structured outputs, combined with embedding-based ranking and batch processing with early stopping. This solution processes one product-category pair at a time, incorporating reasoning and confidence fields to improve decision quality. The implementation resulted in significant improvements: Matthew's Correlation Coefficient increased substantially, 50 previously excluded categories were reintroduced, 295 new categories were enabled, and A/B testing showed a 1.3% increase in conversion rate, improved quote rate, and reduced bounce rate.

Scaling Product Categorization with Batch Inference and Prompt Engineering

GoDaddy

GoDaddy sought to improve their product categorization system that was using Meta Llama 2 for generating categories for 6 million products but faced issues with incomplete/mislabeled categories and high costs. They implemented a new solution using Amazon Bedrock's batch inference capabilities with Claude and Llama 2 models, achieving 97% category coverage (exceeding their 90% target), 80% faster processing time, and 8% cost reduction while maintaining high quality categorization as verified by subject matter experts.

Scaling RAG Accuracy from 49% to 86% in Finance Q&A Assistant

Amazon Finance

Amazon Finance Automation developed a RAG-based Q&A chat assistant using Amazon Bedrock to help analysts quickly retrieve answers to customer queries. Through systematic improvements in document chunking, prompt engineering, and embedding model selection, they increased the accuracy of responses from 49% to 86%, significantly reducing query response times from days to minutes.

Secure Authentication for AI Agents using Model Context Protocol

Arcade

Arcade identified a critical security gap in the Model Context Protocol (MCP) where AI agents needed secure access to third-party APIs like Gmail but lacked proper OAuth 2.0 authentication mechanisms. They developed two solutions: first introducing user interaction capabilities (PR #475), then extending MCP's elicitation framework with URL mode (PR #887) to enable secure OAuth flows while maintaining proper security boundaries between trusted servers and untrusted clients. This work addresses fundamental production deployment challenges for AI agents that need authenticated access to real-world systems.

Security Learnings from LLM Production Deployments

NVIDIA

Based on a year of experience with NVIDIA's product security and AI red team, this case study examines real-world security challenges in LLM deployments, particularly focusing on RAG systems and plugin architectures. The study reveals common vulnerabilities in production LLM systems, including data leakage through RAG, prompt injection risks, and plugin security issues, while providing practical mitigation strategies for each identified threat vector.

Sequence-Tagging Approach to Grammatical Error Correction in Production

Grammarly

Grammarly developed GECToR, a novel grammatical error correction (GEC) system that treats error correction as a sequence-tagging problem rather than the traditional neural machine translation approach. Instead of rewriting entire sentences through encoder-decoder models, GECToR tags individual tokens with custom transformations (like $DELETE, $APPEND, $REPLACE) using a BERT-like encoder with linear layers. This approach achieved state-of-the-art F0.5 scores (65.3 on CoNLL-2014, 72.4 on BEA-2019) while running up to 10 times faster than NMT-based systems, with inference speeds of 0.20-0.40 seconds compared to 0.71-4.35 seconds for transformer-NMT approaches. The system uses iterative correction over multiple passes and custom g-transformations for complex operations like verb conjugation and noun number changes, making it more suitable for real-world production deployment in Grammarly's writing assistant.

Six Principles for Building Production AI Agents

App.build

App.build shared six empirical principles learned from building production AI agents that help overcome common challenges in agentic system development. The principles focus on investing in system prompts with clear instructions, splitting context to manage costs and attention, designing straightforward tools with limited parameters, implementing feedback loops with actor-critic patterns, using LLMs for error analysis, and recognizing that frustrating agent behavior often indicates system design issues rather than model limitations. These guidelines emerged from practical experience in developing software engineering agents and emphasize systematic approaches to building reliable, recoverable agents that fail gracefully.

Source-Grounded LLM Assistant with Multi-Modal Output Capabilities

Google / NotebookLLM

Google's NotebookLM tackles the challenge of making large language models more focused and personalized by introducing source grounding - allowing users to upload their own documents to create a specialized AI assistant. The system combines Gemini 1.5 Pro with sophisticated audio generation to create human-like podcast-style conversations about user content, complete with natural speech patterns and disfluencies. The solution includes built-in safety features, privacy protections through transient context windows, and content watermarking, while enabling users to generate insights from personal documents without contributing to model training data.

SQL Query Agent for Data Democratization

Prosus

Prosus developed a SQL-generating agent called "Token Data Analyst" to help democratize data access across their portfolio companies. The agent serves as a first-line support for data queries, allowing non-technical users to get insights from databases through natural language questions in Slack. The system achieved a 74% reduction in query response time and significantly increased the total number of data insights generated, while maintaining high accuracy through careful prompt engineering and context management.

State of Production Machine Learning and LLMOps in 2024

Zalando

A comprehensive overview of the current state and challenges of production machine learning and LLMOps, covering key areas including motivations, industry trends, technological developments, and organizational changes. The presentation highlights the evolution from model-centric to data-centric approaches, the importance of metadata management, and the growing focus on security and monitoring in ML systems.

Strategic Framework for Generative AI Implementation in Food Delivery Platform

Doordash

DoorDash outlines a comprehensive strategy for implementing Generative AI across five key areas: customer assistance, interactive discovery, personalized content generation, information extraction, and employee productivity enhancement. The company aims to revolutionize its delivery platform while maintaining strong considerations for data privacy and security, focusing on practical applications ranging from automated cart building to SQL query generation.

Strategic LLM Implementation in Chemical Manufacturing with Focus on Documentation and Virtual Agents

Chevron Philips Chemical

Chevron Phillips Chemical is implementing generative AI with a focus on virtual agents and document processing, taking a measured approach to deployment. They formed a cross-functional team including legal, IT security, and data science to educate leadership and identify appropriate use cases. The company is particularly focusing on processing unstructured documents and creating virtual agents for specific topics, while carefully considering bias, testing challenges, and governance in their implementation strategy.

Streamlining Legislative Analysis Model Deployment with MLOps

FiscalNote

FiscalNote, facing challenges in deploying and updating their legislative analysis ML models efficiently, transformed their MLOps pipeline using Databricks' MLflow and Model Serving. This shift enabled them to reduce deployment time and increase model deployment frequency by 3x, while improving their ability to provide timely legislative insights to clients through better model management and deployment practices.

Structured AI Workflow Orchestration for Developer Productivity at Scale

Shopify

Shopify's Augmented Engineering team developed Roast, an open-source workflow orchestration framework that structures AI agents to solve developer productivity challenges like flaky tests and low test coverage. The team discovered that breaking complex AI tasks into discrete, structured steps was essential for reliable performance at scale, leading them to create a convention-over-configuration tool that combines deterministic code execution with AI-powered analysis, enabling reproducible and testable AI workflows that can be version-controlled and integrated into development processes.

Structured LLM Conversations for Language Learning Video Calls

Duolingo

Duolingo implemented an AI-powered video call feature called "Video Call with Lily" that enables language learners to practice speaking with an AI character. The system uses carefully structured prompts, conversational blueprints, and dynamic evaluations to ensure appropriate difficulty levels and natural interactions. The implementation includes memory management to maintain conversation context across sessions and separate processing steps to prevent LLM overload, resulting in a personalized and effective language learning experience.

Structured Workflow Orchestration for Large-Scale Code Operations with Claude

Shopify

Shopify's augmented engineering team developed ROAST, an open-source workflow orchestration tool designed to address challenges of maintaining developer productivity at massive scale (5,000+ repositories, 500,000+ PRs annually, millions of lines of code). The team recognized that while agentic AI tools like Claude Code excel at exploratory tasks, deterministic structured workflows are better suited for predictable, repeatable operations like test generation, coverage optimization, and code migrations. By interleaving Claude Code's non-deterministic agentic capabilities with ROAST's deterministic workflow orchestration, Shopify created a bidirectional system where ROAST can invoke Claude Code as a tool within workflows, and Claude Code can execute ROAST workflows for specific steps. The solution has rapidly gained adoption within Shopify, reaching 500 daily active users and 250,000 requests per second at peak, with developers praising the combination for minimizing instruction complexity at each workflow step and reducing entropy accumulation in multi-step processes.

Systematic AI Application Improvement Through Evaluation-Driven Development

Ragas, Various

This case study presents Ragas' comprehensive approach to improving AI applications through systematic evaluation practices, drawn from their experience working with various enterprises and early-stage startups. The problem addressed is the common challenge of AI engineers making improvements to LLM applications without clear measurement frameworks, leading to ineffective iteration cycles and poor user experiences. The solution involves a structured evaluation methodology encompassing dataset curation, human annotation, LLM-as-judge scaling, error analysis, experimentation, and continuous feedback loops. The results demonstrate that teams can move from subjective "vibe checks" to objective, data-driven improvements that systematically enhance AI application performance and user satisfaction.

Systematic Analysis of Prompt Templates in Production LLM Applications

Uber, Microsoft

The research analyzes real-world prompt templates from open-source LLM-powered applications to understand their structure, composition, and effectiveness. Through analysis of over 2,000 prompt templates from production applications like those from Uber and Microsoft, the study identifies key components, patterns, and best practices for template design. The findings reveal that well-structured templates with specific patterns can significantly improve LLMs' instruction-following abilities, potentially enabling weaker models to achieve performance comparable to more advanced ones.

Systematic Approach to Building Reliable LLM Data Processing Pipelines Through Iterative Development

DocETL

UC Berkeley researchers studied how organizations struggle with building reliable LLM pipelines for unstructured data processing, identifying two critical gaps: data understanding and intent specification. They developed DocETL, a research framework that helps users systematically iterate on LLM pipelines by first understanding failure modes in their data, then clarifying prompt specifications, and finally applying accuracy optimization strategies, moving beyond the common advice of simply "iterate on your prompts."

Systematic LLM Evaluation Framework for Content Generation

Canva

Canva developed a systematic framework for evaluating LLM outputs in their design transformation feature called Magic Switch. The framework focuses on establishing clear success criteria, codifying these into measurable metrics, and using both rule-based and LLM-based evaluators to assess content quality. They implemented a comprehensive evaluation system that measures information preservation, intent alignment, semantic order, tone appropriateness, and format consistency, while also incorporating regression testing principles to ensure prompt improvements don't negatively impact other metrics.

Test-Driven Vibe Development: Integrating Quality Engineering with AI Code Generation

Asos

ASOS, a major e-commerce retailer, developed Test-Driven Vibe Development (TDVD), a novel methodology that combines test-first quality engineering practices with LLM-driven code generation to address the quality and reliability challenges of "vibe coding." The company applied this approach to build an internal stock discrepancy reporting system, using AI agents to generate both tests and code in a structured workflow that prioritizes acceptance test-driven development (ATDD), behavior-driven development (BDD), and test-driven development (TDD). With a team of effectively 2.5 people working part-time, they delivered a full-stack MVP (backend API, Azure Functions, React frontend) in 4 weeksโ€”representing a 7-10x acceleration compared to traditional development estimatesโ€”while maintaining quality through continuous validation against predefined test requirements and catching hallucinations early in the development cycle.

Testing and Evaluation Strategies for AI-Powered Code Editor with Agentic Editing

Zed

Zed, an AI-enabled code editor built from scratch in Rust, implemented comprehensive testing and evaluation strategies to ensure reliable agentic editing capabilities. The company faced the challenge of maintaining their rigorous empirical testing approach while dealing with the non-deterministic nature of LLM outputs. They developed a multi-layered approach combining stochastic testing with deterministic unit tests, addressing issues like streaming edits, XML tag parsing, indentation handling, and escaping behaviors. Through statistical testing methods running hundreds of iterations and setting pass/fail thresholds, they successfully deployed reliable AI-powered code editing features that work effectively with frontier models like Claude 4.

Text-to-SQL AI Agent for Democratizing Data Access in Slack

Salesforce

Salesforce built Horizon Agent, an internal text-to-SQL Slack agent, to address a data access gap where engineers and data scientists spent dozens of hours weekly writing custom SQL queries for non-technical users. The solution combines Large Language Models with Retrieval-Augmented Generation (RAG) to allow users to ask natural language questions in Slack and receive SQL queries, answers, and explanations within seconds. After launching in Early Access in August 2024 and reaching General Availability in January 2025, the system freed technologists from routine query work and enabled non-technical users to self-serve data insights in minutes instead of waiting hours or days, transforming the role of technical staff from data gatekeepers to guides.

Text-to-SQL Solution for Data Democratization in Food Delivery Operations

Swiggy

Swiggy, a food delivery and quick commerce company, developed Hermes, a text-to-SQL solution that enables non-technical users to query company data using natural language through Slack. The problem addressed was the significant time and technical expertise required for teams to access specific business metrics, creating bottlenecks in decision-making. The solution evolved from a basic GPT-3.5 implementation (V1) to a sophisticated RAG-based architecture with GPT-4o (V2) that compartmentalizes business units into "charters" with dedicated metadata and knowledge bases. Results include hundreds of users across the organization answering several thousand queries with average turnaround times under 2 minutes, dramatically improving data accessibility for product managers, data scientists, and analysts while reducing dependency on technical resources.

Text-to-SQL System for Complex Healthcare Database Queries

MSD

MSD collaborated with AWS Generative Innovation Center to implement a text-to-SQL solution using Amazon Bedrock and Anthropic's Claude models to translate natural language queries into SQL for complex healthcare databases. The system addresses challenges like coded columns, non-intuitive naming, and complex medical code lists through custom lookup tools and prompt engineering, significantly reducing query time from hours to minutes while democratizing data access for non-technical staff.

Text-to-SQL System with Structured RAG and Comprehensive Evaluation

ICE / NYSE

ICE/NYSE developed a text-to-SQL application using structured RAG to enable business users to query financial data without needing SQL knowledge. The system leverages Databricks' Mosaic AI stack including Unity Catalog, Vector Search, Foundation Model APIs, and Model Serving. They implemented comprehensive evaluation methods using both syntactic and execution matching, achieving 77% syntactic accuracy and 96% execution match across approximately 50 queries. The system includes continuous improvement through feedback loops and few-shot learning from incorrect queries.

The Hidden Complexities of Building Production LLM Features: Lessons from Honeycomb's Query Assistant

Honeycomb

Honeycomb shares candid insights from building Query Assistant, their natural language to query interface, revealing the complex reality behind LLM-powered product development. Key challenges included managing context window limitations with large schemas, dealing with LLM latency (2-15+ seconds per query), navigating prompt engineering without established best practices, balancing correctness with usefulness, addressing prompt injection vulnerabilities, and handling legal/compliance requirements. The article emphasizes that successful LLM implementation requires treating models as feature engines rather than standalone products, and argues that early access programs often fail to reveal real-world implementation challenges.

Thinking Machines' Tinker: Low-Level Fine-Tuning API for Production LLM Training

Thinking Machines

Thinking Machines, a new AI company founded by former OpenAI researcher John Schulman, has developed Tinker, a low-level fine-tuning API designed to enable sophisticated post-training of language models without requiring teams to manage GPU infrastructure or distributed systems complexity. The product aims to abstract away infrastructure concerns while providing low-level primitives for expressing nearly all post-training algorithms, allowing researchers and companies to build custom models without developing their own training infrastructure. The company plans to release their own models and expand Tinker's capabilities to include multimodal functionality and larger-scale training jobs, while making the platform more accessible to non-experts through higher-level tooling.

Tool Masking for Enterprise Agentic AI Systems at Scale

Databook

Databook, which automates sales processes for large tech companies like Microsoft, Salesforce, and AWS, faced challenges running reliable agentic AI workflows at enterprise scale. The primary problem was that connecting services through Model Context Protocol (MCP) exposed entire APIs to LLMs, polluting execution with irrelevant data, increasing tokens and costs, and reducing reliability through "choice entropy." Their solution involved implementing "tool masks"โ€”a configuration layer between agents and tool handlers that filters and reshapes input/output schemas, customizes tool interfaces per agent context, and enables prompt engineering of tools themselves. This approach resulted in cleaner, faster, more reliable agents with reduced costs, better self-correction capabilities, and the ability to rapidly adapt to customer requirements without code deployments.

Training a 70B Japanese Large Language Model with Amazon SageMaker HyperPod

Institute of Science Tokyo

The Institute of Science Tokyo successfully developed Llama 3.3 Swallow, a 70-billion-parameter large language model with enhanced Japanese capabilities, using Amazon SageMaker HyperPod infrastructure. The project involved continual pre-training from Meta's Llama 3.3 70B model using 314 billion tokens of primarily Japanese training data over 16 days across 256 H100 GPUs. The resulting model demonstrates superior performance compared to GPT-4o-mini and other leading models on Japanese language benchmarks, showcasing effective distributed training techniques including 4D parallelism, asynchronous checkpointing, and comprehensive monitoring systems that enabled efficient large-scale model training in production.

Training and Deploying Advanced Hallucination Detection Models for LLM Evaluation

Patronus AI

Patronus AI addressed the critical challenge of LLM hallucination detection by developing Lynx, a state-of-the-art model trained on their HaluBench dataset. Using Databricks' Mosaic AI infrastructure and LLM Foundry tools, they fine-tuned Llama-3-70B-Instruct to create a model that outperformed both closed and open-source LLMs in hallucination detection tasks, achieving nearly 1% better accuracy than GPT-4 across various evaluation scenarios.

Training and Deploying AI Coding Agents at Scale with GPT-5 Codex

OpenAI

OpenAI's Bill and Brian discuss their work on GPT-5 Codex and Codex Max, AI coding agents designed for production use. The team focused on training models with specific "personalities" optimized for pair programming, including traits like communication, planning, and self-checking behaviors. They trained separate model lines: Codex models optimized specifically for their agent harness with strong opinions about tool use (particularly terminal tools), and mainline GPT-5 models that are more general and steerable across different tooling environments. The result is a coding agent that OpenAI employees trust for production work, with approximately 50% of OpenAI staff using it daily, and some engineers like Brian claiming they haven't written code by hand in months. The team emphasizes the shift toward shipping complete agents rather than just models, with abstractions moving upward to enable developers to build on top of pre-configured agentic systems.

Training and Deploying Compliant Multilingual Foundation Models

Dynamo

Dynamo, an AI company focused on secure and compliant AI solutions, developed an 8-billion parameter multilingual LLM using Databricks Mosaic AI Training platform. They successfully trained the model in just 10 days, achieving a 20% speedup in training compared to competitors. The model was designed to support enterprise-grade AI systems with built-in security guardrails, compliance checks, and multilingual capabilities for various industry applications.

Training and Deploying GPT-4.5: Scaling Challenges and System Design at the Frontier

OpenAI

OpenAI's development and training of GPT-4.5 represents a significant milestone in large-scale LLM deployment, featuring a two-year development cycle and unprecedented infrastructure scaling challenges. The team aimed to create a model 10x smarter than GPT-4, requiring intensive collaboration between ML and systems teams, sophisticated planning, and novel solutions to handle training across massive GPU clusters. The project succeeded in achieving its goals while revealing important insights about data efficiency, system design, and the relationship between model scale and intelligence.

Transforming a Voice Assistant from Scripted Commands to Generative AI Conversation at Scale

AWS (Alexa)

AWS (Alexa) faced the challenge of evolving their voice assistant from scripted, command-based interactions to natural, generative AI-powered conversations while serving over 600 million devices and maintaining complete backward compatibility with existing integrations. The team completely rearchitected Alexa using large language models (LLMs) to create Alexa Plus, which supports conversational interactions, complex multi-step planning, and real-world action execution. Through extensive experimentation with prompt engineering, multi-model architectures, speculative execution, prompt caching, API refactoring, and fine-tuning, they achieved the necessary balance between accuracy, latency (sub-2-second responses), determinism, and model flexibility required for a production voice assistant serving hundreds of millions of users daily.

Transforming HR Operations with AI-Powered Solutions at Scale

Nubank

Nubank, a rapidly growing fintech company with over 8,000 employees across multiple countries, faced challenges in managing HR operations at scale while maintaining employee experience quality. The company deployed multiple AI and LLM-powered solutions to address these challenges: AskNu, a Slack-based AI assistant for instant access to internal information; generative AI for analyzing thousands of open-ended employee feedback comments from engagement surveys; time-series forecasting models for predicting employee turnover; machine learning models for promotion budget planning; and AI quality scoring for optimizing their internal knowledge base (WikiPeople). These initiatives resulted in measurable improvements including 14 percentage point increase in turnover prediction accuracy, faster insights from employee feedback, more accurate promotion forecasting, and enhanced knowledge accessibility across the organization.

Tuning RAG Search for Production Customer Support Chatbot

Elastic

Elastic's Field Engineering team developed and improved a customer support chatbot using RAG and LLMs. They faced challenges with search relevance, particularly around CVE and version-specific queries, and implemented solutions including hybrid search strategies, AI-generated summaries, and query optimization techniques. Their improvements resulted in a 78% increase in search relevance for top-3 results and generated over 300,000 AI summaries for future applications.

UI/UX Design Considerations for Production GenAI Chatbots

Elastic

Elastic's Field Engineering team developed a customer support chatbot, focusing on crucial UI/UX design considerations for production deployment. The case study details how they tackled challenges including streaming response handling, timeout management, context awareness, and user engagement through carefully designed animations. The team created a custom chat interface using their EUI component library, implementing innovative solutions for handling long-running LLM requests and managing multiple types of contextual information in a user-friendly way.

Usability Challenges in Commercial AI Agent Systems: A Study of Industry Aspirations vs. User Realities

Carnegie Mellon

This research study addresses the gap between how AI agents are marketed by the technology industry and how end-users actually experience them in practice. Researchers from Carnegie Mellon conducted a systematic review of 102 commercial AI agent products to understand industry positioning, identifying three core use case categories: orchestration (automating GUI tasks), creation (generating structured documents), and insight (providing analysis and recommendations). They then conducted a usability study with 31 participants attempting representative tasks using popular commercial agents (Operator and Manus), revealing five critical usability barriers: misalignment between agent capabilities and user mental models, premature trust assumptions, inflexible collaboration styles, overwhelming communication overhead, and lack of meta-cognitive abilities. While users generally succeeded at assigned tasks and were impressed with the technology, these barriers significantly impacted the user experience and highlighted the disconnect between marketed capabilities and practical usability.

Using Evaluation Systems and Inference-Time Scaling for Beautiful, Scannable QR Code Generation

Modal

Modal's engineering team tackled the challenge of generating aesthetically pleasing QR codes that consistently scan by implementing comprehensive evaluation systems and inference-time compute scaling. The team developed automated evaluation pipelines that measured both scan rate and aesthetic quality, using human judgment alignment to validate their metrics. They applied inference-time compute scaling by generating multiple QR codes in parallel and selecting the best candidates, achieving a 95% scan rate service-level objective while maintaining aesthetic quality and returning results in under 20 seconds.

Using GenAI to Automatically Fix Java Resource Leaks

Uber

Uber developed FixrLeak, a framework combining generative AI and Abstract Syntax Tree (AST) analysis to automatically detect and fix resource leaks in Java code. The system processes resource leaks identified by SonarQube, analyzes code safety through AST, and uses GPT-4 to generate appropriate fixes. When tested on 124 resource leaks in Uber's codebase, FixrLeak successfully automated fixes for 93 out of 102 eligible cases, significantly reducing manual intervention while maintaining code quality.

Using LLMs to Combat Health Insurance Claim Denials

Fight Health Insurance

Fight Health Insurance is an open-source project that uses fine-tuned large language models to help people appeal denied health insurance claims in the United States. The system processes denial letters, extracts relevant information, and generates appeal letters based on training data from independent medical review boards. The project addresses the widespread problem of insurance claim denials by automating the complex and time-consuming process of crafting effective appeals, making it accessible to individuals who lack the resources or knowledge to navigate the appeals process themselves. The tool is available both as an open-source Python package and as a free hosted service, though the sustainability model is still being developed.

Using LLMs to Enhance Search Discovery and Recommendations

Instacart

Instacart integrated LLMs into their search stack to enhance product discovery and user engagement. They developed two content generation techniques: a basic approach using LLM prompting and an advanced approach incorporating domain-specific knowledge from query understanding models and historical data. The system generates complementary and substitute product recommendations, with content generated offline and served through a sophisticated pipeline. The implementation resulted in significant improvements in user engagement and revenue, while addressing challenges in content quality, ranking, and evaluation.

Using LLMs to Scale Insurance Operations at a Small Company

Anzen

Anzen, a small insurance company with under 20 people, leveraged LLMs to compete with larger insurers by automating their underwriting process. They implemented a document classification system using BERT and AWS Textract for information extraction, achieving 95% accuracy in document classification. They also developed a compliance document review system using sentence embeddings and question-answering models to provide immediate feedback on legal documents like offer letters.

Using RAG to Improve Industry Classification Accuracy

Ramp

Ramp tackled the challenge of inconsistent industry classification by developing an in-house Retrieval-Augmented Generation (RAG) system to migrate from a homegrown taxonomy to standardized NAICS codes. The solution combines embedding-based retrieval with a two-stage LLM classification process, resulting in improved accuracy, better data quality, and more precise customer understanding across teams. The system includes comprehensive logging and monitoring capabilities, allowing for quick iterations and performance improvements.

Using Token Log-Probabilities to Detect and Filter LLM Hallucinations in Customer Support

Gusto

Gusto developed a method to improve the reliability of their LLM-based customer support system by using token log-probabilities as a confidence metric. The approach monitors sequence log-probability scores to identify and filter out potentially hallucinated or low-quality LLM responses. In their case study, they found a 69% relative difference in accuracy between high and low confidence responses, with the highest confidence responses achieving 76% accuracy compared to 45% for the lowest confidence responses.

Voice AI Agent Development and Production Challenges

Various (Canonical, Prosus, DeepMind)

Panel discussion with experts from various companies exploring the challenges and solutions in deploying voice AI agents in production. The discussion covers key aspects of voice AI development including real-time response handling, emotional intelligence, cultural adaptation, and user retention. Experts shared experiences from e-commerce, healthcare, and tech sectors, highlighting the importance of proper testing, prompt engineering, and understanding user interaction patterns for successful voice AI deployments.