ZenML

LLMOps Tag: prompt_engineering

1194 tools with this tag

← Back to LLMOps Database

Common industries

View all industries →

A Practical Blueprint for Evaluating Conversational AI at Scale

Dropbox

Dropbox shares their comprehensive approach to building and evaluating Dropbox Dash, their conversational AI product. The company faced challenges with ad-hoc testing leading to unpredictable regressions where changes to any part of their LLM pipeline—intent classification, retrieval, ranking, prompt construction, or inference—could cause previously correct answers to fail. They developed a systematic evaluation-first methodology treating every experimental change like production code, requiring rigorous testing before merging. Their solution involved curating diverse datasets (both public and internal), defining actionable metrics using LLM-as-judge approaches that outperformed traditional metrics like BLEU and ROUGE, implementing the Braintrust evaluation platform, and automating evaluation throughout the development-to-production pipeline. This resulted in a robust system with layered gates catching regressions early, continuous live-traffic scoring for production monitoring, and a feedback loop for continuous improvement that significantly improved reliability and deployment safety.

Accelerating Drug Development with AI-Powered Clinical Trial Transformation

Novartis

Novartis partnered with AWS Professional Services and Accenture to modernize their drug development infrastructure and integrate AI across clinical trials with the ambitious goal of reducing trial development cycles by at least six months. The initiative involved building a next-generation GXP-compliant data platform on AWS that consolidates fragmented data from multiple domains, implements data mesh architecture with self-service capabilities, and enables AI use cases including protocol generation and an intelligent decision system (digital twin). Early results from the patient safety domain showed 72% query speed improvements, 60% storage cost reduction, and 160+ hours of manual work eliminated. The protocol generation use case achieved 83-87% acceleration in producing compliant protocols, demonstrating significant progress toward their goal of bringing life-saving medicines to patients faster.

Accelerating Game Asset Creation with Fine-Tuned Diffusion Models

Rovio

Rovio, the Finnish gaming company behind Angry Birds, faced challenges in meeting the high demand for game art assets across multiple games and seasonal events, with artists spending significant time on repetitive tasks. The company developed "Beacon Picasso," a suite of generative AI tools powered by fine-tuned diffusion models running on AWS infrastructure (SageMaker, Bedrock, EC2 with GPUs). By training custom models on proprietary Angry Birds art data and building multiple user interfaces tailored to different user needs—from a simple Slackbot to advanced cloud-based workflows—Rovio achieved an 80% reduction in production time for specific use cases like season pass backgrounds, while maintaining brand quality standards and keeping artists in creative control. The solution enabled artists to focus on high-value creative work while AI handled repetitive variations, ultimately doubling content production capacity.

Accelerating SAP S/4HANA Migration and Custom Code Documentation with Generative AI

Axfood / Harman

Two enterprise customers, Axfood (a Swedish grocery retailer) and Harman International (an audio technology company), shared their approaches to using AI and AWS services in conjunction with their SAP environments. Axfood leveraged traditional machine learning for over 100 production forecasting models to optimize inventory, assortment planning, and e-commerce personalization, while also experimenting with generative AI for design tools and employee productivity. Harman International faced a critical challenge during their S/4HANA migration: documenting 30,000 custom ABAP objects that had accumulated over 25 years with poor documentation. Manual documentation by 12 consultants was projected to take 15 months at high cost with inconsistent results. By adopting AWS Bedrock and Amazon Q Developer with Anthropic Claude models, Harman reduced the timeline from 15 months to 2 months, improved speed by 6-7x, cut costs by over 70%, and achieved structured, consistent documentation that was understandable by both business and technical stakeholders.

Advanced Fine-Tuning Techniques for Multi-Agent Orchestration at Scale

Amazon

Amazon teams faced challenges in deploying high-stakes LLM applications across healthcare, engineering, and e-commerce domains where basic prompt engineering and RAG approaches proved insufficient. Through systematic application of advanced fine-tuning techniques including Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and cutting-edge reasoning optimizations like Group-based Reinforcement Learning from Policy Optimization (GRPO) and Direct Advantage Policy Optimization (DAPO), three Amazon business units achieved production-grade results: Amazon Pharmacy reduced dangerous medication errors by 33%, Amazon Global Engineering Services achieved 80% human effort reduction in inspection reviews, and Amazon A+ Content improved quality assessment accuracy from 77% to 96%. These outcomes demonstrate that approximately one in four high-stakes enterprise applications require advanced fine-tuning beyond standard techniques to achieve necessary performance levels in production environments.

Advanced Prompt Engineering Techniques for Production LLM Applications

Instacart

Instacart shares their experience implementing various prompt engineering techniques to improve LLM performance in production applications. The article details both traditional and novel approaches including Chain of Thought, ReAct, Room for Thought, Monte Carlo brainstorming, Self Correction, Classifying with logit bias, and Puppetry. These techniques were developed and tested while building internal productivity tools like Ava and Ask Instacart, demonstrating practical ways to enhance LLM reliability and output quality in production environments.

Advanced RAG Implementation for AI Assistant Response Accuracy

Nippon India Mutual Fund

Nippon India Mutual Fund faced challenges with their AI assistant's accuracy when handling large volumes of documents, experiencing issues with hallucination and poor response quality in their naive RAG implementation. They implemented advanced RAG methods using Amazon Bedrock Knowledge Bases, including semantic chunking, query reformulation, multi-query RAG, and results reranking to improve retrieval accuracy. The solution resulted in over 95% accuracy improvement, 90-95% reduction in hallucinations, and reduced report generation time from 2 days to approximately 10 minutes.

Advancing Patient Experience and Business Operations Analytics with Generative AI in Healthcare

Huron

Huron Consulting Group implemented generative AI solutions to transform healthcare analytics across patient experience and business operations. The consulting firm faced challenges with analyzing unstructured data from patient rounding sessions and revenue cycle management notes, which previously required manual review and resulted in delayed interventions due to the 3-4 month lag in traditional HCAHPS survey feedback. Using AWS services including Amazon Bedrock with the Nova LLM model, Redshift, and S3, Huron built sentiment analysis capabilities that automatically process survey responses, staff interactions, and financial operation notes. The solution achieved 90% accuracy in sentiment classification (up from 75% initially) and now processes over 10,000 notes per week automatically, enabling real-time identification of patient dissatisfaction, revenue opportunities, and staff coaching needs that directly impact hospital funding and operational efficiency.

Agent Registry and Dynamic Prompt Management for AI Feature Development

Gitlab

Gitlab faced challenges with delivering prompt improvements for their AI-powered issue description generation feature, particularly for self-managed customers who don't update frequently. They developed an Agent Registry system within their AI Gateway that abstracts provider models, prompts, and parameters, allowing for rapid prompt updates and model switching without requiring monolith changes or new releases. This system enables faster iteration on AI features and seamless provider switching while maintaining a clean separation of concerns.

Agent Testing and Evaluation Using Autonomous Vehicle Simulation Principles

Coval

Coval addresses the challenge of testing and evaluating autonomous AI agents by applying lessons learned from self-driving car testing. The company proposes moving away from static, manual testing towards probabilistic evaluation with dynamic scenarios, drawing parallels between autonomous vehicles and AI agents in terms of system architecture, error handling, and reliability requirements. Their solution enables systematic testing of agents through simulation at different layers, measuring performance against human benchmarks, and implementing robust fallback mechanisms.

Agent-Based AI Assistants for Enterprise and E-commerce Applications

Prosus

Prosus developed two major AI agent applications: Toan, an internal enterprise AI assistant used by 15,000+ employees across 24 companies, and OLX Magic, an e-commerce assistant that enhances product discovery. Toan achieved significant reduction in hallucinations (from 10% to 1%) through agent-based architecture, while saving users approximately 50 minutes per day. OLX Magic transformed the traditional e-commerce experience by incorporating generative AI features for smarter product search and comparison.

Agent-Based Workflow Automation in Spreadsheets for Non-Technical Users

Otto

Otto, founded by Suli Omar, addresses the challenge of making AI agents accessible to non-technical users by embedding agent workflows directly into spreadsheet interfaces. The company transforms unstructured data processing tasks into spreadsheet-based workflows where each cell acts as an autonomous agent capable of executing tasks, waiting for dependencies, and outputting structured results. By leveraging the familiar spreadsheet UX instead of traditional chatbot interfaces, Otto enables finance teams, accountants, and other business users to harness agent capabilities without requiring technical expertise. The solution involves sophisticated model selection across three tiers (workhorse, middle-tier, and heavy reasoning models) to optimize cost and performance, continuous evaluation through customer usage patterns, and iterative model testing to maintain service quality as new LLM capabilities emerge.

Agent-First AI Development Platform with Multi-Surface Orchestration

Google Deepmind

Google DeepMind launched Anti-gravity, an agent-first AI development platform designed to handle increasingly complex, long-running software development tasks powered by Gemini 3 Pro. The platform addresses the challenge of managing AI agents operating across multiple surfaces (editor, browser, and agent manager) by introducing "artifacts" - dynamic representations that help organize agent outputs and enable asynchronous feedback. The solution emerged from close collaboration between product and research teams at DeepMind, creating a feedback loop where internal dogfooding identified model gaps and drove improvements. Initial launch experienced capacity constraints due to high demand, but users who accessed the product reported significant workflow improvements from the multi-surface agent orchestration approach.

Agentic AI Architecture for Investment Management Platform

Blackrock

BlackRock implemented Aladdin Copilot, an AI-powered assistant embedded across their proprietary investment management platform that serves over 11 trillion in assets under management. The system uses a supervised agentic architecture built on LangChain and LangGraph, with GPT-4 function calling for orchestration, to help users navigate complex financial workflows and democratize access to investment insights. The solution addresses the challenge of making hundreds of domain-specific APIs accessible through natural language queries while maintaining strict guardrails for responsible AI use in financial services, resulting in increased productivity and more intuitive user experiences across their global client base.

Agentic AI Architecture for Meeting Intelligence and Productivity Automation

Zoom

Zoom developed AI Companion 3.0, an agentic AI system that transforms meeting conversations into actionable outcomes through automated planning, reasoning, and execution. The system addresses the challenge of turning hours of meeting content across distributed teams into coordinated action by implementing a federated AI approach combining small language models (SLMs) with large language models (LLMs), deployed on AWS infrastructure including Bedrock and OpenSearch. The solution enables users to automatically generate meeting summaries, perform cross-meeting analysis, schedule meetings with intelligent calendar management, and prepare meeting agendas—reducing what typically takes days of administrative work to minutes while maintaining low latency and cost-effectiveness at scale.

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

Agentic AI for Automated Absence Reporting and Shift Management at Airport Operations

Manchester Airports Group

Manchester Airports Group (MAG) implemented an agentic AI solution to automate unplanned absence reporting and shift management across their three UK airports handling over 1,000 flights daily. The problem involved complex, non-deterministic workflows requiring coordination across multiple systems, with different processes at each airport and high operational costs from overtime payments when staff couldn't make shifts. MAG built a multi-agent system using Amazon Bedrock Agent Core with both text-to-text and speech-to-speech interfaces, allowing employees to report absences conversationally while the system automatically authenticated users, classified absence types, updated HR and rostering systems, and notified relevant managers. The solution achieved 99% consistency in absence reporting (standardizing previously variable processes) and reduced recording time by 90%, with measurable cost reductions in overtime payments and third-party service fees.

Agentic AI for Cloud Migration and Application Modernization at Scale

Commonwealth Bank of Australia

Commonwealth Bank of Australia (CBA) partnered with AWS ProServe to modernize legacy Windows 2012 applications and migrate them to cloud at scale. Facing challenges with time-consuming manual processes, missing documentation, and significant technical debt, CBA developed "Lumos," an internal multi-agent AI platform that orchestrates the entire modernization lifecycle—from application analysis and design through code transformation, testing, deployment, and operations. By integrating AI agents with deterministic engines and AWS services (Bedrock, ECS, OpenSearch, etc.), CBA increased their modernization velocity from 10 applications per year to 20-30 applications per quarter, while maintaining security, compliance, and quality standards through human-in-the-loop validation and multi-agent review processes.

Agentic AI for Legal Research: Building Deep Research in Westlaw and CoCounsel

Thomson Reuters

Thomson Reuters Labs developed Deep Research, an agentic AI system integrated into Westlaw Advantage and CoCounsel that conducts legal research with the sophistication of a practicing attorney. The system addresses the limitation of traditional RAG-based tools by autonomously planning multi-step research strategies, executing searches in parallel, selecting appropriate tools, adapting based on findings, and applying stopping criteria. Deep Research leverages specialized document-type agents, maintains memory across sessions, integrates Westlaw features as modular building blocks, and employs rigorous evaluation frameworks. The system reportedly takes about 10 minutes for comprehensive analyses and includes verification tools with inline citations, KeyCite flags, and highlighted excerpts to enable lawyers to quickly validate AI-generated insights.

Agentic AI Framework for Mainframe Modernization at Scale

Western Union / Unum

Western Union and Unum partnered with AWS and Accenture/Pega to modernize their mainframe-based legacy systems using AWS Transform, an agentic AI service designed for large-scale migration and modernization. Western Union aimed to modernize its 35-year-old money order platform to support growth targets and improve back-office operations, while Unum sought to streamline Colonial Life claims processing. The solution leveraged composable agentic AI frameworks where multiple specialized agents (AWS Transform agents, Accenture industry knowledge agents, and Pega Blueprint agents) worked together through orchestration layers. Results included converting 2.5 million lines of COBOL code in approximately 1.5 hours, reducing project timelines from 3+ months to 6 weeks for Western Union, and achieving a complete COBOL-to-cloud migration with testable applications in 3 months for Unum (compared to previous 7-year, $25 million estimates), while eliminating 7,000 annual manual hours in claims management.

Agentic AI Manufacturing Reasoner for Automated Root Cause Analysis

Apollo Tyres

Apollo Tyres developed a Manufacturing Reasoner powered by Amazon Bedrock Agents to automate root cause analysis for their tire curing processes. The solution replaced manual analysis that took 7 hours per issue with an AI-powered system that delivers insights in under 10 minutes, achieving an 88% reduction in manual effort. The multi-agent system analyzes real-time IoT data from over 250 automated curing presses to identify bottlenecks across 25+ subelements, enabling data-driven decision-making and targeting annual savings of approximately 15 million Indian rupees in their passenger car radial division.

Agentic AI Platform for Clinical Development and Commercial Operations in Pharmaceutical Drug Development

AstraZeneca

AstraZeneca partnered with AWS to deploy agentic AI systems across their clinical development and commercial operations to accelerate their goal of delivering 20 new medicines by 2030. The company built two major production systems: a Development Assistant serving over 1,000 users across 21 countries that integrates 16 data products with 9 agents to enable natural language queries across clinical trials, regulatory submissions, patient safety, and quality domains; and an AZ Brain commercial platform that uses 500+ AI models and agents to provide precision insights for patient identification, HCP engagement, and content generation. The implementation reduced time-to-market for various workflows from months to weeks, with field teams using the commercial assistant generating 2x more prescriptions, and reimbursement dossier authoring timelines dramatically shortened through automated agent workflows.

Agentic AI Search with Custom Evaluation Framework for Church Management

Pushpay

Pushpay, a digital giving and engagement platform for churches and faith-based organizations, developed an agentic AI search feature to help ministry leaders query community data using natural language. The initial solution achieved only 60-70% accuracy and faced challenges in systematic evaluation and improvement. To address these limitations, Pushpay built a comprehensive generative AI evaluation framework on Amazon Bedrock, incorporating a curated golden dataset of over 300 queries, an LLM-as-judge evaluator, domain-based categorization, and performance dashboards. This framework enabled rapid iteration, strategic domain-level feature rollout, and implementation of dynamic prompt construction with semantic search. The solution ultimately achieved 95% accuracy in high-priority domains, reduced time-to-insight from 120 seconds to under 4 seconds, and provided the confidence needed for production deployment.

Agentic AI System for Construction Industry Tender Management and Quote Generation

Tendos AI

Tendos AI built an agentic AI platform to automate the tendering and quoting process for manufacturers in the construction industry. The system addresses the massive inefficiency in back-office workflows where manufacturers receive customer requests via email with attachments, manually extract information, match products, and generate quotes. Their multi-agent LLM system automatically categorizes incoming requests, extracts entities from documents up to thousands of pages, matches products from complex catalogs using semantic understanding, and generates detailed quotes for human review. Starting with a narrow focus on radiators with a single design partner, they iteratively expanded to support full workflows across multiple product categories, employing sophisticated agentic architectures with planning patterns, review agents, and extensive evaluation frameworks at each pipeline step.

Agentic AI System for Document Summarization and Analysis

Moveworks

Moveworks developed "Brief Me," an AI-powered productivity tool that enables employees to upload documents (PDF, Word, PPT) and interact with them conversationally through their Copilot assistant. The system addresses the time-consuming challenge of manually processing lengthy documents for tasks like summarization, Q&A, comparisons, and insight extraction. By implementing a sophisticated two-stage agentic architecture with online content ingestion and generation capabilities, including hybrid search with custom-trained embeddings, multi-turn conversation support, operation planning, and a novel map-reduce approach for long context handling, the system achieves high accuracy metrics (97.24% correct actions, 89.21% groundedness, 97.98% completeness) with P90 latency under 10 seconds for ingestion, significantly reducing the hours typically required for document analysis tasks.

Agentic AI Systems for Drug Discovery and Business Intelligence

Loka

Loka, an AWS partner specializing in generative AI solutions, and Domo, a business intelligence platform, demonstrate production implementations of agentic AI systems across multiple industries. Loka showcases their drug discovery assistant (ADA) that integrates multiple AI models and databases to accelerate pharmaceutical research workflows, while Domo presents agentic solutions for call center optimization and financial analysis. Both companies emphasize the importance of systematic approaches to AI implementation, moving beyond simple chatbots to multi-agent systems that can take autonomous actions while maintaining human oversight through human-in-the-loop architectures.

Agentic AI Systems for Legal, Tax, and Compliance Workflows

Thomson Reuters

Thomson Reuters evolved their AI assistant strategy from helpfulness-focused tools to productive agentic systems that make judgments and produce output in high-stakes legal, tax, and compliance environments. They developed a framework treating agency as adjustable dials (autonomy, context, memory, coordination) rather than binary states, enabling them to decompose legacy applications into tools that AI agents can leverage. Their solutions include end-to-end tax return generation from source documents and comprehensive legal research systems that utilize their 1.5+ terabytes of proprietary content, with rigorous evaluation processes to handle the inherent variability in expert human judgment.

Agentic Data Analyst for Enterprise Self-Service Analytics

Ramp

Ramp faced a data bottleneck where data questions required hours of turnaround time through a single on-call analyst, causing decision delays and discouraging users from asking questions. To address this, they built Ramp Research, an AI agent deployed in Slack that answers data questions in minutes using an agentic architecture with access to dbt, Looker, and Snowflake metadata. Since launching in early August 2025, the system has answered over 1,800 questions across 1,200 conversations with 300 users, representing a 10-20x increase in data question volume compared to the traditional help channel, enabling faster decision-making and democratizing data access across the organization.

Agentic News Analysis Platform for Digital Asset Market Making

FSI

Digital asset market makers face the challenge of rapidly analyzing news events and social media posts to adjust trading strategies within seconds to avoid adverse selection and inventory risk. Traditional dictionary-based and statistical machine learning approaches proved too slow or required extensive labeled data. The solution involved building an agentic LLM-based platform on AWS that processes streaming news in near real-time, using fine-tuned embeddings for deduplication, reasoning models for sentiment analysis and impact assessment, and optimized inference infrastructure. Through progressive optimization from SageMaker JumpStart to VLLM to SGLNG, the team achieved 180 output tokens per second, enabling end-to-end latency under 10 seconds and doubling news processing capacity compared to initial deployment.

Agentic Platform Engineering Hub for Cloud Operations Automation

Thomson Reuters

Thomson Reuters' Platform Engineering team transformed their manual, labor-intensive operational processes into an automated agentic system to address challenges in providing self-service cloud infrastructure and enablement services at scale. Using Amazon Bedrock AgentCore as the foundational orchestration layer, they built "Aether," a custom multi-agent system featuring specialized agents for cloud account provisioning, database patching, network configuration, and architecture review, coordinated through a central orchestrator agent. The solution delivered a 15-fold productivity gain, achieved 70% automation rate at launch, and freed engineering teams from repetitive tasks to focus on higher-value innovation work while maintaining security and compliance standards through human-in-the-loop validation.

Agentic RAG Implementation for Retail Personalization and Customer Support

MongoDB

MongoDB and Dataworkz partnered to implement an agentic RAG (Retrieval Augmented Generation) solution for retail and e-commerce applications. The solution combines MongoDB Atlas's vector search capabilities with Dataworkz's RAG builder to create a scalable system that integrates operational data with unstructured information. This enables personalized customer experiences through intelligent chatbots, dynamic product recommendations, and enhanced search functionality, while maintaining context-awareness and real-time data access.

Agentic Search for Multi-Source Legal Research Intelligence

Harvey

Harvey, a legal AI platform, faced the challenge of enabling complex, multi-source legal research that mirrors how lawyers actually work—iteratively searching across case law, statutes, internal documents, and other sources. Traditional one-shot retrieval systems couldn't handle queries requiring reasoning about what information to gather, where to find it, and when sufficient context was obtained. Harvey implemented an agentic search system based on the ReAct paradigm that dynamically selects knowledge sources, performs iterative retrieval, evaluates completeness, and synthesizes citation-backed responses. Through a privacy-preserving evaluation process involving legal experts creating synthetic queries and systematic offline testing, they improved tool selection precision from near zero to 0.8-0.9 and enabled complex queries to scale from single tool calls to 3-10 retrieval operations as needed, raising baseline query quality across their Assistant product and powering their Deep Research feature.

Agentic Security Principles for AI-Powered Development Tools

Github

GitHub outlines the security principles and threat model they developed for their hosted agentic AI products, particularly GitHub Copilot coding agent. The company addresses three primary security concerns: data exfiltration through internet-connected agents, impersonation and action attribution, and prompt injection attacks. Their solution involves implementing six core security rules: ensuring all context is visible to users, firewalling agent network access, limiting access to sensitive information, preventing irreversible state changes without human approval, consistently attributing actions to both initiator and agent, and only gathering context from authorized users. These principles aim to balance the enhanced functionality of agentic AI with the increased security risks that come with more autonomous systems.

Agentic Workflow Automation for Financial Operations

Ramp

Ramp, a finance automation platform serving over 50,000 customers, built a comprehensive suite of AI agents to automate manual financial workflows including expense policy enforcement, accounting classification, and invoice processing. The company evolved from building hundreds of isolated agents to consolidating around a single agent framework with thousands of skills, unified through a conversational interface called Omnichat. Their Policy Agent product, which uses LLMs to interpret and enforce expense policies written in natural language, demonstrates significant production deployment challenges and solutions including iterative development starting with simple use cases, extensive evaluation frameworks, human-in-the-loop labeling sessions, and careful context engineering. Additionally, Ramp built an internal coding agent called Ramp Inspect that now accounts for over 50% of production PRs merged weekly, illustrating how AI infrastructure investments enable broader organizational productivity gains.

AI Agent Development and Evaluation Platform for Insurance Underwriting

Snorkel

Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.

AI Agent Evaluation Framework for Travel and Accommodation Platform

Booking.com

Booking.com developed a comprehensive evaluation framework for LLM-based agents that power their AI Trip Planner and other customer-facing features. The framework addresses the unique complexity of evaluating autonomous agents that can use external tools, reason through multi-step problems, and engage in multi-turn conversations. Their solution combines black box evaluation (focusing on task completion using judge LLMs) with glass box evaluation (examining internal decision-making, tool usage, and reasoning trajectories). The framework enables data-driven decisions about deploying agents versus simpler baselines by measuring performance gains against cost and latency tradeoffs, while also incorporating advanced metrics for consistency, reasoning quality, memory effectiveness, and trajectory optimality.

AI Agent for Automated Feature Flag Removal

Duolingo

Duolingo developed an AI agent to automate the removal of feature flags from their codebase, addressing the common engineering problem of technical debt accumulation from abandoned flags. The solution leverages Anthropic's Codex CLI running on Temporal workflow orchestration, allowing engineers to initiate automated code cleanup through an internal self-service UI. The agent clones repositories, uses AI to identify and remove obsolete feature flags across Python and Kotlin codebases, and automatically creates pull requests assigned to the requesting engineer. The tool was developed rapidly—moving from prototype to production in approximately one week—and serves as a foundation pattern for future autonomous coding agents at Duolingo.

AI Agent for Automated Merchant Classification and Transaction Matching

Ramp

Ramp built an AI agent using LLMs, embeddings, and RAG to automatically fix incorrect merchant classifications that previously required hours of manual intervention from customer support teams. The agent processes user requests to reclassify transactions in under 10 seconds, handling nearly 100% of requests compared to the previous 1.5-3% manual handling rate, while maintaining 99% accuracy according to LLM-based evaluation and reducing customer support costs from hundreds of dollars to cents per request.

AI Agent for Automated Root Cause Analysis in Production Systems

Cleric

Cleric developed an AI agent system to automatically diagnose and root cause production alerts by analyzing observability data, logs, and system metrics. The agent operates asynchronously, investigating alerts when they fire in systems like PagerDuty or Slack, planning and executing diagnostic tasks through API calls, and reasoning about findings to distill information into actionable root causes. The system faces significant challenges around ground truth validation, user feedback loops, and the need to minimize human intervention while maintaining high accuracy across diverse infrastructure environments.

AI Agent for Customer Service Order Management and Training

RHI Magnesita

RHI Magnesita, facing $3 million in annual losses due to human errors in order processing, implemented an AI agent to assist their Customer Service Representatives (CSRs). The solution, developed with IT-Tomatic, focuses on error reduction, standardization of processes, and enhanced training. The AI system serves as an operating system for CSRs, consolidating information from multiple sources and providing intelligent validation of orders. Early results show improved training efficiency, standardized processes, and the transformation of entry-level CSR positions into hybrid analyst roles.

AI Agent for Real Estate Legal Document Analysis and Lease Reporting

Orbital

Orbital Witness developed Orbital Copilot, an AI agent specifically designed for real estate legal work, to address the time-intensive nature of legal due diligence and lease reporting. The solution evolved from classical machine learning models through LLM-based approaches to a sophisticated agentic architecture that combines planning, memory, and tool use capabilities. The system analyzes hundreds of pages across multiple legal documents, answers complex queries by following information trails across documents, and provides transparent reasoning with source citations. Deployed with prestigious law firms including BCLP, Clifford Chance, and others, Orbital Copilot demonstrated up to 70% time savings on lease reporting tasks, translating to significant cost reductions for complex property analyses that typically require 2-10+ hours of lawyer time.

AI Agent for Self-Service Business Intelligence with Text-to-SQL

BGL

BGL, a provider of self-managed superannuation fund administration solutions serving over 12,700 businesses, faced challenges with data analysis where business users relied on data teams for queries, creating bottlenecks, and traditional text-to-SQL solutions produced inconsistent results. BGL built a production-ready AI agent using Claude Agent SDK hosted on Amazon Bedrock AgentCore that allows business users to retrieve analytics insights through natural language queries. The solution combines a strong data foundation using Amazon Athena and dbt for data transformation with an AI agent that interprets natural language, generates SQL queries, and processes results using code execution. The implementation uses modular knowledge architecture with CLAUDE.md for project context and SKILL.md files for product-specific domain expertise, while AgentCore provides stateful execution sessions with security isolation. This democratized data access for over 200 employees, enabling product managers, compliance teams, and customer success managers to self-serve analytics without SQL knowledge or data team dependencies.

AI Agent Mode and Model Context Protocol for Autonomous Code Generation

GitHub

GitHub demonstrates the evolution of their Copilot product from simple code completion to autonomous agent mode capable of building complete applications from specifications. The problem addressed is the inefficiency of manual coding and the limitations of simple prompt-response interactions with AI. The solution involves agent mode where developers can specify complete tasks in readme files and have Copilot autonomously implement them, iterating with the developer's permission for terminal access and database operations. Integration with Model Context Protocol allows agents to securely connect to external data sources like PostgreSQL databases and GitHub APIs. The demonstration shows an agent building a full-stack travel reservation application in approximately 8 minutes from a readme specification, then using MCP to pull database schemas for test generation, and finally autonomously creating branches and pull requests through GitHub's MCP server.

AI Agent Solutions for Data Warehouse Access and Security

Meta

Meta developed a multi-agent system to address the growing complexity of data warehouse access management at scale. The solution employs specialized AI agents that assist data users in obtaining access to warehouse data while helping data owners manage security and access requests. The system includes data-user agents with three sub-agents for suggesting alternatives, facilitating low-risk exploration, and crafting permission requests, alongside data-owner agents that handle security operations and access management. Key innovations include partial data preview capabilities with context-aware access control, query-level granular permissions, data-access budgeting, and rule-based risk management, all supported by comprehensive evaluation frameworks and feedback loops.

AI Agent System for Automated B2B Research and Sales Pipeline Generation

Unify

UniFi built an AI agent system that automates B2B research and sales pipeline generation by deploying research agents at scale to answer customer-defined questions about companies and prospects. The system evolved from initial React-based agents using GPT-4 and O1 models to a more sophisticated architecture incorporating browser automation, enhanced internet search capabilities, and cost-optimized model selection, ultimately processing 36+ billion tokens monthly while reducing per-query costs from 35 cents to 10 cents through strategic model swapping and architectural improvements.

AI Agent System for Automated Security Investigation and Alert Triage

Slack

Slack's Security Engineering team developed an AI agent system to automate the investigation of security alerts from their event ingestion pipeline that handles billions of events daily. The solution evolved from a single-prompt prototype to a multi-agent architecture with specialized personas (Director, domain Experts, and a Critic) that work together through structured output tasks to investigate security incidents. The system uses a "knowledge pyramid" approach where information flows upward from token-intensive data gathering to high-level decision making, allowing strategic use of different model tiers. Results include transformed on-call workflows from manual evidence gathering to supervision of agent teams, interactive verifiable reports, and emergent discovery capabilities where agents spontaneously identified security issues beyond the original alert scope, such as discovering credential exposures during unrelated investigations.

AI Agent System for Automated Travel Itinerary Generation

Aimpoint Digital

Aimpoint Digital developed an AI agent system to automate travel itinerary generation, addressing the time-consuming nature of trip planning. The solution combines multiple RAG frameworks with vector search for up-to-date information about places, restaurants, and events, using parallel processing and optimized prompts to generate personalized itineraries within seconds. The system employs Databricks' Vector Search and LLM capabilities, with careful attention to evaluation metrics and prompt optimization.

AI Agent-Driven Software Development Platform for Enterprise Engineering Teams

Factory

Factory is building a platform to transition from human-driven to agent-driven software development, targeting enterprise organizations with 5,000+ engineers. Their platform enables delegation of entire engineering tasks to AI agents (called "droids") that can go from project management tickets to mergeable pull requests. The system emphasizes three core principles: planning with subtask decomposition and model predictive control, decision-making with contextual reasoning, and environmental grounding through AI-computer interfaces that interact with existing development tools, observability systems, and knowledge bases.

AI Agent-Powered Compliance Review Automation for Financial Services

Stripe

Stripe developed an AI agent-based solution to address the growing complexity and resource intensity of compliance reviews in financial services, where enterprises spend over $206 billion annually on financial crime operations. The company implemented ReAct agents powered by Amazon Bedrock to automate the investigative and research portions of Enhanced Due Diligence (EDD) reviews while keeping human analysts in the decision-making loop. By decomposing complex compliance workflows into bite-sized tasks orchestrated through a directed acyclic graph (DAG), the agents perform autonomous investigations across multiple data sources and jurisdictions. The solution achieved a 96% helpfulness rating from reviewers and reduced average handling time by 26%, enabling compliance teams to scale without linearly increasing headcount while maintaining complete auditability for regulatory requirements.

AI Agents and Intelligent Observability for DevOps Modernization

HRS Group / Netflix / Harness

This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.

AI Agents for Automated Product Quality Testing and Bug Detection

Coinbase

Coinbase developed an AI-powered QA agent (qa-ai-agent) to dramatically scale their product testing efforts and improve quality assurance. The system addresses the challenge of maintaining high product quality standards while reducing manual testing overhead and costs. The AI agent processes natural language testing requests, uses visual and textual data to execute tests, and leverages LLM reasoning to identify issues. Results showed the agent detected 300% more bugs than human testers in the same timeframe, achieved 75% accuracy (compared to 80% for human testers), enabled new test creation in 15 minutes versus hours, and reduced costs by 86% compared to traditional manual testing, with the goal of replacing 75% of manual testing with AI-driven automation.

AI Agents for Data Labeling and Infrastructure Maintenance at Scale

Plaid

Plaid, a financial data connectivity platform, developed two internal AI agents to address operational challenges at scale. The AI Annotator agent automates the labeling of financial transaction data for machine learning model training, achieving over 95% human alignment while dramatically reducing annotation costs and time. The Fix My Connection agent proactively detects and repairs bank integration issues, having enabled over 2 million successful logins and reduced average repair time by 90%. These agents represent Plaid's strategic use of LLMs to improve data quality, maintain reliability across thousands of financial institution connections, and enhance their core product experiences.

AI Agents for Interpretability Research: Experimenter Agents in Production

Goodfire

Goodfire, an AI interpretability research company, deployed AI agents extensively for conducting experiments in their research workflow over several months. They distinguish between "developer agents" (for software development) and "experimenter agents" (for research and discovery), identifying key architectural differences needed for the latter. Their solution, code-named Scribe, leverages Jupyter notebooks with interactive, stateful access via MCP (Model Context Protocol), enabling agents to iteratively run experiments across domains like genomics, vision transformers, and diffusion models. Results showed agents successfully discovering features in genomics models, performing circuit analysis, and executing complex interpretability experiments, though validation, context engineering, and preventing reward hacking remain significant challenges that require human oversight and critic systems.

AI Agents for Travel Booking and Customer Service Automation

TPConnects

TPConnects, a software solutions provider for airlines and travel sellers, transformed their legacy travel booking APIs and UI into a production-ready AI agent system built on Amazon Bedrock. The company implemented a supervised multi-agent orchestration architecture that handles the complete travel journey from shopping and booking to order management and customer servicing. Key challenges included managing latency with large API responses (2000+ flight offers), orchestrating multiple APIs in a pipeline, handling industry-specific IATA codes, and ensuring JSON formatting consistency. The solution uses Claude 3.5 Sonnet as the primary model, incorporates prompt engineering and knowledge bases for travel domain expertise, and extends beyond traditional chat to WhatsApp Business API integration for proactive disruption management and upselling. The system took 3-4 months to develop with AWS support and represents a shift from manual UI interactions to conversational AI-driven travel experiences.

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

AI Assistant for Financial Data Discovery and Business Intelligence

Amazon Finance

Amazon Finance developed an AI-powered assistant to address analysts' challenges with data discovery across vast, disparate financial datasets and systems. The solution combines Amazon Bedrock (using Anthropic's Claude 3 Sonnet) with Amazon Kendra Enterprise Edition to create a Retrieval Augmented Generation (RAG) system that enables natural language queries for finding financial data and documentation. The implementation achieved a 30% reduction in search time, 80% improvement in search result accuracy, and demonstrated 83% precision and 88% faithfulness in knowledge search tasks, while reducing information discovery time from 45-60 minutes to 5-10 minutes.

AI Assistant for Global Customer Service Automation

Klarna

Klarna implemented an OpenAI-powered AI assistant for customer service that successfully handled two-thirds of all customer service chats within its first month of global deployment. The system processes 2.3 million conversations, matches human agent satisfaction scores, reduces repeat inquiries by 25%, and cuts resolution time from 11 to 2 minutes, while operating in 23 markets with support for over 35 languages, projected to deliver $40 million in profit improvement for 2024.

AI Assistant Integration for Manufacturing Execution System (MES)

42Q

42Q, a cloud-based Manufacturing Execution System (MES) provider, implemented an intelligent chatbot named Arthur to address the complexity of their system and improve user experience. The solution uses RAG and AWS Bedrock to combine documentation, training videos, and live production data, enabling users to query system functionality and real-time manufacturing data in natural language. The implementation showed significant improvements in user response times and system understanding, while maintaining data security within AWS infrastructure.

AI Data Analyst with Multi-Stage LLM Architecture for Enterprise Data Discovery

Delivery Hero

The BADA team at Woowa Brothers (part of Delivery Hero) developed QueryAnswerBird (QAB), an LLM-based agentic system to improve employee data literacy across the organization. The problem addressed was that employees with varying levels of data expertise struggled to discover, understand, and utilize the company's vast internal data resources, including structured tables and unstructured log data. The solution involved building a multi-layered architecture with question understanding (Router Supervisor) and information acquisition stages, implementing various features including query/table explanation, syntax verification, table/column guidance, and log data utilization. Through two rounds of beta testing with data analysts, engineers, and product managers, the team iteratively refined the system to handle diverse question types beyond simple Text-to-SQL, ultimately creating a comprehensive data discovery platform that integrates with existing tools like Data Catalog and Log Checker to provide contextualized answers and improve organizational productivity.

AI Error Summarizer Implementation: A Tiger Team Approach

CircleCI

CircleCI's engineering team formed a tiger team to explore AI integration possibilities, ultimately developing an AI error summarizer feature. The team spent 6-7 weeks on discovery, including extensive stakeholder interviews and technical exploration, before implementing a relatively simple but effective LLM-based solution that summarizes build errors for users. The case demonstrates how companies can successfully approach AI integration through focused exploration and iterative development, emphasizing that valuable AI features don't necessarily require complex implementations.

AI Managed Services and Agent Operations at Enterprise Scale

PriceWaterhouseCooper

PriceWaterhouseCooper (PWC) addresses the challenge of deploying and maintaining AI systems in production through their managed services practice focused on data analytics and AI. The organization has developed frameworks for deploying AI agents in enterprise environments, particularly in healthcare and back-office operations, using their Agent OS framework built on Python. Their approach emphasizes process standardization, human-in-the-loop validation, continuous model tuning, and comprehensive measurement through evaluations to ensure sustainable AI operations at scale. Results include successful deployments in healthcare pre-authorization processes and the establishment of specialized AI managed services teams comprising MLOps engineers and data scientists who continuously optimize production models.

AI Sales Representatives for Inbound Lead Conversion

ShowMe

ShowMe builds AI sales representatives that function as digital teammates for companies selling primarily through inbound channels. The company was founded in April 2025 after the co-founders identified a critical problem at their previous company: website visitors weren't converting to customers unless engaged directly by human sales representatives, but scaling human engagement was too expensive for unqualified leads. ShowMe's solution involves multi-agent voice and video systems that can conduct sales calls, share screens, demo products, qualify leads, and orchestrate follow-up actions across multiple channels. The AI agents use sophisticated prompt engineering, RAG-based knowledge bases, and workflow orchestration to guide prospects through the sales funnel, ultimately creating qualified meetings or closing contracts directly while reducing the need for human sales intervention by approximately 70%.

AI SRE Agents for Production System Diagnostics

Cleric

Cleric is developing an AI Site Reliability Engineering (SRE) agent system that helps diagnose and troubleshoot production system issues. The system uses knowledge graphs to map relationships between system components, background scanning to maintain system awareness, and confidence scoring to minimize alert fatigue. The solution aims to reduce the burden on human engineers by efficiently narrowing down problem spaces and providing actionable insights, while maintaining strict security controls and read-only access to production systems.

AI SRE System with Continuous Learning for Production Issue Investigation

Cleric AI

Cleric AI developed an AI-powered SRE system that automatically investigates production issues using existing observability tools and infrastructure. They implemented continuous learning capabilities using LangSmith to compare different investigation strategies, track investigation paths, and aggregate performance metrics. The system learns from user feedback and generalizes successful investigation patterns across deployments while maintaining strict privacy controls and data anonymization.

AI Strategy and LLM Application Development in Swedish Public Sector

Swedish Tax Authority

The Swedish Tax Authority (Skatteverket) has been on a multi-decade digitalization journey, progressively incorporating AI and large language models into production systems to automate and enhance tax services. The organization has developed various NLP applications including text categorization, transcription, OCR pipelines, and question-answering systems using RAG architectures. They have tested both open-source models (Llama 3.1, Mixtral 7B, Cohere) and commercial solutions (GPT-3.5), finding that open-source models perform comparably for simpler queries while commercial models excel at complex questions. The Authority operates within a regulated environment requiring on-premise deployment for sensitive data, adopting Agile/SAFe methodologies and building reusable AI infrastructure components that can serve multiple business domains across different public sector silos.

AI-Assisted Activity Onboarding in Travel Marketplace

GetYourGuide

GetYourGuide faced challenges with their lengthy 16-step activity creation process, where suppliers spent up to an hour manually entering content that often had quality issues, leading to traveler confusion and lower conversion rates. They implemented a generative AI solution that allows activity providers to paste existing content and automatically generates descriptions and fills structured fields across 8 key onboarding steps. After an initial failed experiment due to UX confusion and measurement challenges, they iterated with improved UI/UX design and developed a novel permutation testing framework. The second rollout successfully increased activity completion rates, improved content quality, and reduced onboarding time to as little as 14 minutes, ultimately achieving positive impacts on both supplier efficiency and traveler engagement metrics.

AI-Assisted Database Debugging Platform at Scale

Databricks

Databricks built an agentic AI platform to help engineers debug thousands of OLTP database instances across hundreds of regions on AWS, Azure, and GCP. The platform addresses the problem of fragmented tooling and dispersed expertise by unifying metrics, logs, and operational workflows into a single intelligent interface with a chat assistant. The solution reduced debugging time by up to 90%, enabled new engineers to start investigations in under 5 minutes, and has achieved company-wide adoption, fundamentally changing how engineers interact with their infrastructure.

AI-Assisted Product Attribute Extraction for E-commerce Content Creation

Zalando

Zalando developed a Content Creation Copilot to automate product attribute extraction during the onboarding process, addressing data quality issues and time-to-market delays. The manual content enrichment process previously accounted for 25% of production timelines with error rates that needed improvement. By implementing an LLM-based solution using OpenAI's GPT models (initially GPT-4 Turbo, later GPT-4o) with custom prompt engineering and a translation layer for Zalando-specific attribute codes, the system now enriches approximately 50,000 attributes weekly with 75% accuracy. The solution integrates multiple AI services through an aggregator architecture, auto-suggests attributes in the content creation workflow, and allows copywriters to maintain final decision authority while significantly improving efficiency and data coverage.

AI-Assisted Root Cause Analysis System for Incident Response

Meta

Meta developed an AI-assisted root cause analysis system to streamline incident investigations in their large-scale systems. The system combines heuristic-based retrieval with LLM-based ranking to identify potential root causes of incidents. Using a fine-tuned Llama 2 model and a novel ranking approach, the system achieves 42% accuracy in identifying root causes for investigations at creation time in their web monorepo, significantly reducing the investigation time and helping responders make better decisions.

AI-Augmented Code Review System for Large-Scale Software Development

Uber

Uber developed uReview, an AI-powered code review platform to address the challenges of traditional peer reviews at scale, including reviewer overload from increasing code volume and difficulty identifying subtle bugs and security issues. The system uses a modular, multi-stage GenAI architecture with prompt-chaining to break down code review into four sub-tasks: comment generation, filtering, validation, and deduplication. Currently analyzing over 90% of Uber's ~65,000 weekly code diffs, uReview achieves a 75% usefulness rating from engineers and sees 65% of its comments addressed, demonstrating significant adoption and effectiveness in production.

AI-Augmented Cybersecurity Triage Using Graph RAG for Cloud Security Operations

Deloitte

Deloitte developed a Cybersecurity Intelligence Center to help SecOps engineers manage the overwhelming volume of security alerts generated by cloud security platforms like Wiz and CrowdStrike. Using AWS's open-source Graph RAG Toolkit, Deloitte built "AI for Triage," a human-in-the-loop system that combines long-term organizational memory (stored in hierarchical lexical graphs) with short-term operational data (document graphs) to generate AI-assisted triage records. The solution reduced 50,000 security issues across 7 AWS domains to approximately 1,300 actionable items, converting them into over 6,500 nodes and 19,000 relationships for contextual analysis. This approach enables SecOps teams to make informed remediation decisions based on organizational policies, historical experiences, and production system context, while maintaining human accountability and creating automation recipes rather than brittle code-based solutions.

AI-Driven Clinical Trial Transformation with Next-Generation Data Platform

Novartis

Novartis embarked on a comprehensive data and AI modernization journey to accelerate drug development by at least 6 months per clinical trial. The company partnered with AWS Professional Services and Accenture to build a next-generation, GXP-compliant data platform that integrates fragmented data across multiple domains (including patient safety, medical imaging, and regulatory data), enabling both operational AI use cases and ambitious moonshot projects like a digital twin for clinical trial simulation. The initial implementation with the patient safety domain achieved significant results: 16 data pipelines processing 17 terabytes of data, 72% faster query speeds, 60% storage cost reduction, and over 160 hours of manual work eliminated, while protocol generation use cases demonstrated 83-87% acceleration in generating compliance-acceptable protocols.

AI-Driven Documentation Generation for dbt Data Models

Loblaw Digital

Loblaw Digital addressed the challenge of maintaining comprehensive documentation for over 3,000 dbt data models across their analytics engineering infrastructure. Manual documentation proved labor-intensive and often led to incomplete or outdated documentation that confused business users. The team implemented an LLM-based solution using the open-source dbt-documentor tool integrated with Google Cloud's Vertex AI platform, which automatically generates descriptions for models and their columns by ingesting dbt's manifest.json files without accessing actual data. This automation significantly improved documentation coverage and productivity while maintaining data security, enabling analysts to better understand model purposes and dependencies through the dbt documentation website.

AI-Driven Incident Response and Automated Remediation for Digital Media Platform

iHeart

iHeart Media, serving 250 million monthly users across broadcast radio, digital streaming, and podcasting platforms, faced significant operational challenges with incident response requiring engineers to navigate multiple monitoring systems, VPNs, and dashboards during critical 3 AM outages. The company implemented a multi-agent AI system using AWS Bedrock Agent Core and the Strands AI framework to automate incident triage, root cause analysis, and remediation. The solution reduced triage response time dramatically (from minutes of manual investigation to 30-60 seconds), improved operational efficiency by eliminating repetitive manual tasks, and enabled knowledge preservation across incidents while maintaining 24/7 uptime requirements for their infrastructure handling 5-7 billion requests per month.

AI-Driven Media Analysis and Content Assembly Platform for Large-Scale Video Archives

Bloomberg Media

Bloomberg Media, facing challenges in analyzing and leveraging 13 petabytes of video content growing at 3,000 hours per day, developed a comprehensive AI-driven platform to analyze, search, and automatically create content from their massive media archive. The solution combines multiple analysis approaches including task-specific models, vision language models (VLMs), and multimodal embeddings, unified through a federated search architecture and knowledge graphs. The platform enables automated content assembly using AI agents to create platform-specific cuts from long-form interviews and documentaries, dramatically reducing time to market while maintaining editorial trust and accuracy. This "disposable AI strategy" emphasizes modularity, versioning, and the ability to swap models and embeddings without re-engineering entire workflows, allowing Bloomberg to adapt quickly to evolving AI capabilities while expanding reach across multiple distribution platforms.

AI-Driven Multi-Agent System for Dynamic Product Taxonomy Evolution

Shopify

Shopify faced the challenge of maintaining and evolving a product taxonomy with over 10,000 categories and 2,000+ attributes at scale, processing tens of millions of daily predictions. Traditional manual curation couldn't keep pace with emerging product types, required deep domain expertise across diverse verticals, and suffered from growing inconsistencies. Shopify developed an innovative multi-agent AI system that combines specialized agents for structural analysis, product-driven analysis, intelligent synthesis, and equivalence detection, augmented by automated quality assurance through AI judges. The system has significantly improved efficiency by analyzing hundreds of categories in parallel (versus a few per day manually), enhanced quality through multi-perspective analysis, and enabled proactive rather than reactive taxonomy improvements, with validation showing enhanced classification accuracy and improved merchant/customer experience.

AI-Driven Security Posture Management Platform

LinkedIn

LinkedIn developed the Security Posture Platform (SPP) to enhance their security infrastructure management, incorporating an AI-powered interface called SPP AI. The platform streamlines security data analysis and vulnerability management across their distributed systems. By leveraging large language models and a comprehensive knowledge graph, the system improved vulnerability response speed by 150% and increased digital infrastructure coverage by 155%. The solution combines natural language querying capabilities with sophisticated data integration and automated decision-making to provide real-time security insights.

AI-Driven Student Services and Prescriptive Pathways at UCLA Anderson School of Management

UCLA

UCLA Anderson School of Management partnered with Kindle to address the challenge of helping MBA students navigate their intensive two-year program more effectively. Students were overwhelmed with coursework, career decisions, club activities, and internship searches, receiving extensive information without clear guidance. The solution involved digitizing over 2 million paper records and building an AI-powered application that provides personalized, prescriptive roadmaps for students based on their career goals. The system integrates data from multiple sources including student records, career placement systems, clubs, and course catalogs to recommend specific courses, internships, clubs, and target companies. The project took approximately 8 months (December 2023 to August 2024) and demonstrates how educational institutions can leverage agentic AI frameworks to deliver better student experiences while maintaining data security and privacy standards.

AI-Powered .NET Application Modernization at Scale

Thomson Reuters

Thomson Reuters faced the challenge of modernizing over 400 legacy .NET Framework applications comprising more than 500 million lines of code, which were running on costly Windows servers and slowing down innovation. By adopting AWS Transform for .NET during its beta phase, the company leveraged agentic AI capabilities powered by Amazon Bedrock LLMs with deep .NET expertise to automate the analysis, dependency mapping, code transformation, and validation process. This approach accelerated their modernization from months of planning to weeks of execution, enabling them to transform over 1.5 million lines of code per month while running 10 parallel modernization projects. The solution not only promised substantial cost savings by migrating to Linux containers and Graviton instances but also freed developers from maintaining legacy systems to focus on delivering customer value.

AI-Powered Accessibility Automation for E-commerce Platform

Mercado Libre

Mercado Libre's accessibility team implemented multiple AI-driven initiatives to scale their support for hundreds of designers and developers working on accessibility improvements across the platform. The team deployed four main solutions: an A11Y assistant that provides real-time support in Slack channels using RAG-based LLMs consulting internal documentation; automated enrichment of accessibility audit tickets with contextual explanations and remediation guidance; a Figma handoff assistant that analyzes UI designs and recommends accessibility annotations; and an automated ticket review system integrating Jira and GitHub to assess fix quality. These initiatives aim to multiply the effectiveness of accessibility experts by automating routine tasks, providing immediate answers, and enabling teams to become more autonomous in addressing accessibility issues, while the core team focuses on strategic challenges.

AI-Powered Account Planning System for Sales Process Optimization

AWS

AWS developed Account Plan Pulse, a generative AI solution built on Amazon Bedrock, to address the increasing complexity and manual overhead in their sales account planning process. The system automates the evaluation of customer account plans across 10 business-critical categories, generates actionable insights, and provides structured summaries to improve collaboration. The implementation resulted in a 37% improvement in plan quality year-over-year and a 52% reduction in the time required to complete, review, and approve plans, while helping sales teams focus more on strategic customer engagements rather than manual review processes.

AI-Powered Accounting Automation Using Claude and Amazon Bedrock

FloQast

FloQast developed an AI-powered accounting transformation solution to automate complex transaction matching and document annotation workflows using Anthropic's Claude 3 on Amazon Bedrock. The system combines document processing capabilities like Amazon Textract with LLM-based automation through Amazon Bedrock Agents to streamline reconciliation processes and audit workflows. The solution achieved significant efficiency gains, including 38% reduction in reconciliation time and 23% decrease in audit process duration.

AI-Powered Ad Description Generation for Classifieds Platform

Leboncoin

Leboncoin, a French classifieds platform, addressed the "blank page syndrome" where sellers struggled to write compelling ad descriptions, leading to poorly described items and reduced engagement. They developed an AI-powered feature using Claude Haiku via AWS Bedrock that automatically generates ad descriptions based on photos, titles, and item details while maintaining human control for editing. The solution was refined through extensive user testing to match the platform's authentic, conversational tone, and early results show a 20% increase in both inquiries and completed transactions for ads using the AI-generated descriptions.

AI-Powered Artwork Quality Moderation and Streaming Quality Management at Scale

Amazon Prime Video

Amazon Prime Video faced challenges in manually reviewing artwork from content partners and monitoring streaming quality for millions of concurrent viewers across 240+ countries. To address these issues, they developed two AI-powered solutions: (1) an automated artwork quality moderation system using multimodal LLMs to detect defects like safe zone violations, mature content, and text legibility issues, reducing manual review by 88% and evaluation time from days to under an hour; and (2) an agentic AI system for detecting, localizing, and mitigating streaming quality issues in real-time without manual intervention. Both solutions leveraged Amazon Bedrock, Strands agents framework, and iterative evaluation loops to achieve high precision while operating at massive scale.

AI-Powered Automated GraphQL Schema Cleanup

Whatnot

Whatnot, a livestream shopping platform, faced significant technical debt in their GraphQL schema with over 2,600 unused fields accumulated from deprecated features and old endpoints. Manual cleanup was time-consuming and risky, requiring 1-2 hours per field and deep domain knowledge. The engineering team built an AI subagent integrated into a GitHub Action that automatically identifies unused fields through traffic analysis and generates pull requests to safely remove them. The agent follows the same process an engineer would—removing schema fields, resolvers, dead code, and updating tests—but operates autonomously in the background. Running daily at $1-3 per execution, the system has successfully removed 24 of approximately 200 unused root fields with minimal human intervention, requiring edits to only three PRs, transforming schema maintenance from a neglected one-time project into an ongoing automated process.

AI-Powered Automated Issue Resolution Achieving State-of-the-Art Performance on SWE-bench

Trae

Trae developed an AI engineering system that achieved 70.6% accuracy on the SWE-bench Verified benchmark, setting a new state-of-the-art record for automated software issue resolution. The solution combines multiple large language models (Claude 3.7, Gemini 2.5 Pro, and OpenAI o4-mini) in a sophisticated multi-stage pipeline featuring generation, filtering, and voting mechanisms. The system uses specialized agents including a Coder agent for patch generation, a Tester agent for regression testing, and a Selector agent that employs both syntax-based voting and multi-selection voting to identify the best solution from multiple candidate patches.

AI-Powered Autonomous Infrastructure Monitoring and Self-Healing System

Railway

This case study presents a proof-of-concept system for autonomous infrastructure monitoring and self-healing using AI coding agents. The presenter demonstrates a workflow that automatically detects issues in deployed services on Railway (memory leaks, slow database queries, high error rates), analyzes metrics and logs using LLMs to generate diagnostic plans, and then deploys OpenCode—an open-source AI coding agent—to automatically create pull requests with fixes. The system leverages durable workflows via Inngest for reliability, combines multiple data sources (CPU/memory metrics, HTTP metrics, logs), and uses LLMs to analyze infrastructure health and generate remediation plans. While presented as a demo/concept, the approach showcases how LLMs can move from alerting engineers to autonomously proposing code-level fixes for production issues.

AI-Powered Background Coding Agents for Large-Scale Software Maintenance

Spotify

Spotify faced the challenge of scaling complex code migrations and maintenance tasks across thousands of repositories, where their existing Fleet Management system handled simple transformations well but required specialized expertise for complex changes. They integrated AI coding agents into their Fleet Management platform, allowing engineers to define fleet-wide code changes using natural language prompts instead of writing complex AST manipulation scripts. Since February 2025, this approach has generated over 1,500 merged pull requests handling complex tasks like language modernization, breaking API changes, and UI component migrations, achieving 60-90% time savings compared to manual implementation while expanding to ad hoc background coding tasks accessible via Slack and GitHub.

AI-Powered Benefits Navigation System for SNAP Recipients

Propel

Propel developed and tested AI-powered tools to help SNAP recipients diagnose and resolve benefits interruptions, addressing the problem of "program churn" that affects about 200,000 of their 5 million monthly users. They implemented two approaches: a structured triage flow using AI code generation for California users, and a conversational AI chat assistant powered by Decagon for nationwide deployment. Both tests showed promising results including strong user uptake (53% usage rate), faster benefits restoration, and improved user experience with multilingual support, while reducing administrative burden on state agencies.

AI-Powered Betting Assistant for Sports Wagering Platform

FanDuel

FanDuel, America's leading sportsbook platform handling over 16.6 million bets during Super Bowl Sunday 2025, developed AAI (an AI-powered betting assistant) to address friction in the customer betting journey. Previously, customers would leave the FanDuel app to research bets on external platforms, often getting distracted and missing betting opportunities. Working with AWS's Generative AI Innovation Center, FanDuel built an in-app conversational assistant using Amazon Bedrock that guides customers through research, discovery, bet construction, and execution entirely within their platform. The solution reduced bet construction time from hours to seconds (particularly for complex parlays), improved customer engagement, and was rolled out incrementally across states and sports using a rigorous evaluation framework with thousands of test cases to ensure accuracy and responsible gaming safeguards.

AI-Powered Business Assistant for Solopreneurs

Jimdo

Jimdo, a European website builder serving over 35 million solopreneurs across 190 countries, needed to help their customers—who often lack expertise in marketing, sales, and business strategy—drive more traffic and conversions to their websites. The company built Jimdo Companion, an AI-powered business advisor using LangChain.js and LangGraph.js for orchestration and LangSmith for observability. The system features two main components: Companion Dashboard (an agentic business advisor that queries 10+ data sources to deliver personalized insights) and Companion Assistant (a ChatGPT-like interface that adapts to each business's tone of voice). The solution resulted in 50% more first customer contacts within 30 days and 40% more overall customer activity for users with access to Companion.

AI-Powered Call Intelligence System for Multi-Location Marketing Analysis

Netsertive

Netsertive, a digital marketing solutions provider for multi-location brands and franchises, implemented an AI-powered call intelligence system using Amazon Bedrock and Amazon Nova Micro to automatically analyze customer call tracking data and extract actionable insights. The solution processes real-time phone call transcripts to provide sentiment analysis, call summaries, keyword identification, coaching suggestions, and performance tracking across locations, reducing analysis time from hours or days to minutes while enabling better customer service optimization and conversion rate improvements for their franchise clients.

AI-Powered Chatbot Automation with Hybrid NLU and LLM Approach

Scotiabank

Scotiabank developed a hybrid chatbot system combining traditional NLU with modern LLM capabilities to handle customer service inquiries. They created an innovative "AI for AI" approach using three ML models (nicknamed Luigi, Eva, and Peach) to automate the review and improvement of chatbot responses, resulting in 80% time savings in the review process. The system includes LLM-powered conversation summarization to help human agents quickly understand customer contexts, marking the bank's first production use of generative AI features.

AI-Powered Client Services Assistant for Post-Trade Services

London Stock Exchange Group

London Stock Exchange Group developed a client services assistant application using Amazon Q Business to enhance their post-trade customer support. The solution leverages RAG techniques to provide accurate and quick responses to complex member queries by accessing internal documents and public rulebooks. The system includes a robust validation process using Claude v2 to ensure response accuracy against a golden answer dataset, delivering responses within seconds and improving both customer experience and staff productivity.

AI-Powered Clinical Decision Support Platform for Healthcare Providers

Healio

Healio, a medical information platform serving healthcare providers across 20+ specialties for 125 years, developed Healio AI to address the challenge of physicians experiencing information overload while working under extreme time pressure. The solution uses a RAG-based system that combines Healio's proprietary clinical content with trusted sources like PubMed journals to provide physicians with accurate, contextual, and trustworthy answers at point of care. Through extensive user testing with over 300 healthcare professionals, the team discovered physicians primarily used the tool to prepare for patient interactions and improve patient communication rather than just diagnostic queries. The product launched successfully with predominantly positive feedback, featuring HIPAA compliance, citation transparency, and contextual advertising for monetization.

AI-Powered Clinical Documentation and Data Infrastructure for Point-of-Care Transformation

Veradigm

Veradigm, a healthcare IT company, partnered with AWS to integrate generative AI into their Practice Fusion electronic health record (EHR) system to address clinician burnout caused by excessive documentation tasks. The solution leverages AWS HealthScribe for autonomous AI scribing that generates clinical notes from patient-clinician conversations, and AWS HealthLake as a FHIR-based data foundation to provide patient context at scale. The implementation resulted in clinicians saving approximately 2 hours per day on charting, 65% of users requiring no training to adopt the technology, and high satisfaction with note quality. The system processes 60 million patient visits annually and enables ambient documentation that allows clinicians to focus on patient care rather than typing, with a clear path toward zero-edit note generation.

AI-Powered Clinical Documentation with Multi-Region Healthcare Compliance

Heidi Health

Heidi Health developed an ambient AI scribe to reduce the administrative burden on healthcare clinicians by automatically generating clinical notes from patient consultations. The company faced significant LLMOps challenges including building confidence in non-deterministic AI outputs through "clinicians in the loop" evaluation processes, scaling clinical validation beyond small teams using synthetic data generation and LLM-as-judge approaches, and managing global expansion across regions with different data sovereignty requirements, model availability constraints, and regulatory compliance needs. Their solution involved standardizing infrastructure-as-code deployments across AWS regions, using a hybrid approach of Amazon Bedrock for immediate availability and EKS for self-hosted model control, and integrating clinical ambassadors in each region to validate medical accuracy and local practice patterns. The platform now serves over 370,000 clinicians processing 10 million consultations per month globally.

AI-Powered Clinical Outcome Assessment Review Using Generative AI

Clario

Clario, a clinical trials endpoint data provider, developed an AI-powered solution to automate the analysis of Clinical Outcome Assessment (COA) interviews in clinical trials for psychosis, anxiety, and mood disorders. The traditional approach of manually reviewing audio-video recordings was time-consuming, logistically complex, and introduced variability that could compromise trial reliability. Using Amazon Bedrock and other AWS services, Clario built a system that performs speaker diarization, multi-lingual transcription, semantic search, and agentic AI-powered quality review to evaluate interviews against standardized criteria. The solution demonstrates potential for reducing manual review effort by over 90%, providing 100% data coverage versus subset sampling, and decreasing review turnaround time from weeks to hours, while maintaining regulatory compliance and improving data quality for submissions.

AI-Powered Clinical Trial Software Configuration Automation

Clario

Clario, a leading provider of endpoint data solutions for clinical trials, faced significant challenges with their manual software configuration process, which involved extracting data from multiple sources including PDF forms, study databases, and standardized protocols. The manual process was time-consuming, prone to transcription errors, and created version control challenges. To address this, Clario developed the Genie AI Service powered by Amazon Bedrock using Anthropic's Claude 3.7 Sonnet, orchestrated through Amazon ECS. The solution automates data extraction from transmittal forms, centralizes information from multiple sources, provides an interactive review dashboard for validation, and automatically generates Software Configuration Specification documents and XML configurations for their medical imaging software. This has reduced study configuration execution time while improving quality, minimizing transcription errors, and allowing teams to focus on higher-value activities like study design optimization.

AI-Powered Co-pilot System for Digital Sales Agents

Wayfair

Wayfair developed an AI-powered Agent Co-pilot system to assist their digital sales agents during customer interactions. The system uses LLMs to provide contextually relevant chat response recommendations by considering product information, company policies, and conversation history. Initial test results showed a 10% reduction in handle time, improving customer service efficiency while maintaining quality interactions.

AI-Powered Code Editor with Multi-Model Integration and Agentic Workflows

Cursor

Cursor, an AI-powered code editor, has scaled to over $300 million in revenue by integrating multiple language models including Claude 3.5 Sonnet for advanced coding tasks. The platform evolved from basic tab completion to sophisticated multi-file editing capabilities, background agents, and agentic workflows. By combining intelligent retrieval systems with large language models, Cursor enables developers to work across complex codebases, automate repetitive tasks, and accelerate software development through features like real-time code completion, multi-file editing, and background task execution in isolated environments.

AI-Powered Code Generation for Support Team Bug Fixing

Zapier

Zapier faced a backlog crisis caused by "app erosion"—constant API changes across their 8,000+ third-party integrations creating reliability issues faster than engineers could address them. They ran two parallel experiments: empowering their support team to fix bugs directly by shipping code, and building an AI-powered system called "Scout" to accelerate bug fixing through automated code generation. The solution evolved from standalone APIs to MCP-integrated tools, and ultimately to Scout Agent—an orchestrated agentic system that automatically categorizes issues, assesses fixability, generates merge requests, and iterates based on feedback. Results show that 40% of support team app fixes are now AI-generated, doubling some team members' velocity from 1-2 fixes per week to 3-4, while several support team members have successfully transitioned into engineering roles.

AI-Powered Code Review and Pull Request Automation for Developer Compliance

GitHub

GitHub explored how generative AI could transform compliance in software development by automating foundational components like separation of duties and code reviews. The company developed GitHub Copilot for Pull Requests, which uses AI to automatically generate pull request descriptions based on code changes and provide AI-assisted code review suggestions. This approach aims to maintain compliance requirements while keeping developers in the flow, reducing manual overhead for both development and audit teams, and enabling separation of duties through automated, objective code analysis rather than purely human-based processes.

AI-Powered Code Review Assistant at Enterprise Scale

Microsoft

Microsoft developed an AI-powered code review assistant to address friction in their pull request (PR) workflow, where reviewers spent time on low-value feedback while meaningful concerns were overlooked, and PRs often waited days for review. The solution integrated an AI assistant into the existing PR workflow that automatically reviews code, flags issues, suggests improvements, generates PR summaries, and answers questions interactively. This system now supports over 90% of PRs across Microsoft, impacting more than 600,000 pull requests monthly, and has resulted in 10-20% median PR completion time improvements for early adopter repositories, improved code quality through early bug detection, and accelerated developer learning, particularly for new hires.

AI-Powered Code Review Platform at Scale

Uber

Uber developed uReview, an AI-powered code review platform, to address the challenge of reviewing over 65,000 code changes weekly across six monorepos. Traditional peer reviews were becoming overwhelmed by the volume of code and struggled to consistently catch subtle bugs, security issues, and best practice violations. The solution employs a modular, multi-stage GenAI system using prompt chaining with multiple specialized assistants (Standard, Best Practices, and AppSec) that generate, filter, validate, and deduplicate code review comments. The system achieves a 75% usefulness rating from engineers, with 65% of comments being addressed, outperforming human reviewers (51% address rate), and saves approximately 1,500 developer hours weekly across Uber's engineering organization.

AI-Powered Code Review Platform Using Abstract Syntax Trees and LLM Context

Baz

Baz is building an AI code review agent that addresses the challenge of understanding complex codebases at scale. The platform combines Abstract Syntax Trees (AST) with LLM semantic understanding to provide automated code reviews that go beyond traditional static analysis. By integrating context from multiple sources including code structure, Jira/Linear tickets, CI logs, and deployment patterns, Baz aims to replicate the knowledge of a staff engineer who understands not just the code but the entire business context. The solution has evolved from basic reviews to catching performance issues and schema changes, with customers using it to review code generated by AI coding assistants like Cursor and Codex.

AI-Powered Community Voice Intelligence for Local Government

ZenCity

ZenCity builds AI-powered platforms that help local governments understand and act on community voices by synthesizing diverse data sources including surveys, social media, 311 requests, and public engagement data. The company faced the challenge of processing millions of data points daily and delivering actionable insights to government officials who need to make informed decisions about budgets, policies, and services. Their solution involves a multi-layered AI architecture that enriches raw data with sentiment analysis and topic modeling, creates trend highlights, generates topic-specific insights, and produces automated briefs for specific government workflows like annual budgeting or crisis management. By implementing LLM-driven agents with MCP (Model Context Protocol) servers, they created an AI assistant that allows government officials to query data on-demand while maintaining data accuracy through citation requirements and multi-tenancy security. The system successfully delivers personalized, timely briefs to different government roles, reducing the need for manual analysis while ensuring community voices inform every decision.

AI-Powered Compliance Investigation Agents for Enhanced Due Diligence

Stripe

Stripe developed an LLM-powered AI research agent system to address the scalability challenges of enhanced due diligence (EDD) compliance reviews in financial services. The manual review process was resource-intensive, with compliance analysts spending significant time navigating fragmented data sources across different jurisdictions rather than performing high-value analysis. Stripe built a React-based agent system using Amazon Bedrock that orchestrates autonomous investigations across multiple data sources, pre-fetches analysis before reviewers open cases, and provides comprehensive audit trails. The solution maintains human oversight for final decision-making while enabling agents to handle data gathering and initial research. This resulted in a 26% reduction in average handling time for compliance reviews, with agents achieving 96% helpfulness ratings from reviewers, allowing Stripe to scale compliance operations alongside explosive business growth without proportionally increasing headcount.

AI-Powered Contact Center Transformation for Energy Retail Customer Experience

Energy

So Energy, a UK-based independent energy retailer serving 300,000 customers, faced significant customer experience challenges stemming from fragmented communication platforms, manual processes, and escalating customer frustration during the UK energy crisis. The company implemented Amazon Connect as a unified cloud-based contact center platform, integrating voice, chat, email, and messaging channels with AI-powered capabilities including automatic identity verification, intent recognition, contact summarization, and case management. The implementation, completed in 6-7 months with an in-house tech team, resulted in a 33% reduction in call wait times, increased chat volumes from less than 1% to 15% of contacts, improved CSAT scores, and a Trustpilot rating approaching 4.5. The platform's AI foundation positioned So Energy for future deployment of chatbots, voicebots, and agentic AI capabilities while maintaining focus on human-centric customer service.

AI-Powered Contact Center Transformation for Pet Retail

PetCo

PetCo transformed its contact center operations serving over 10,000 daily customer interactions by implementing Amazon Connect with integrated AI capabilities. The company faced challenges balancing cost efficiency with customer satisfaction while managing 400 care team members handling everything from e-commerce inquiries to veterinary appointments across 1,500+ stores. By deploying call summaries, automated QA, AI-supported agent assistance, and generative AI-powered chatbots using Amazon Q and Connect, PetCo achieved reduced handle times, improved routing efficiency, and launched conversational self-service capabilities. The implementation emphasized starting with high-friction use cases like order status inquiries and grooming salon call routing, with plans to expand into conversational IVR and appointment booking through voice and chat interfaces.

AI-Powered Contact Center Transformation for Student Support Services

Anthology

Anthology, an education technology company operating a BPO for higher education institutions, transformed their traditional contact center infrastructure to an AI-first, cloud-based solution using Amazon Connect. Facing challenges with seasonal spikes requiring doubling their workforce (from 1,000 to 2,000+ agents during peak periods), homegrown legacy systems, and reliability issues causing 12 unplanned outages during busy months, they migrated to AWS to handle 8 million annual student interactions. The implementation, which went live in July 2024 just before their peak back-to-school period, resulted in 50% reduction in wait times, 14-point increase in response accuracy, 10% reduction in agent attrition, and improved system reliability (reducing unplanned outages from 12 to 2 during peak months). The solution leverages AI virtual agents for handling repetitive queries, agent assist capabilities with real-time guidance, and automated quality assurance enabling 100% interaction review compared to the previous 1%.

AI-Powered Contact Center Transformation with Amazon Connect

Traeger

Traeger Grills transformed their customer experience operations from a legacy contact center with poor performance metrics (35% CSAT, 30% first contact resolution) into a modern AI-powered system built on Amazon Connect. The company implemented generative AI capabilities for automated case note generation, email composition, and chatbot interactions while building a "single pane of glass" agent experience using Amazon Connect Cases. This eliminated their legacy CRM, reduced new hire training time by 40%, improved agent satisfaction, and enabled seamless integration of their acquired Meater thermometer brand. The implementation leveraged AI to handle non-value-added work while keeping human agents focused on building emotional connections with customers in the "Traeger Hood" community, demonstrating a shift from cost center to profit center thinking.

AI-Powered Content Curation for Financial Crime Detection

LSEG

London Stock Exchange Group (LSEG) Risk Intelligence modernized its WorldCheck platform—a global database used by financial institutions to screen for high-risk individuals, politically exposed persons (PEPs), and adverse media—by implementing generative AI to accelerate data curation. The platform processes thousands of news sources in 60+ languages to help 10,000+ customers combat financial crime including fraud, money laundering, and terrorism financing. By adopting a maturity-based approach that progressed from simple prompt-only implementations to agent orchestration with human-in-the-loop validation, LSEG reduced content curation time from hours to minutes while maintaining accuracy and regulatory compliance. The solution leverages AWS Bedrock for LLM operations, incorporating summarization, entity extraction, classification, RAG for cross-referencing articles, and multi-agent orchestration, all while keeping human analysts at critical decision points to ensure trust and regulatory adherence.

AI-Powered Content Generation and Shot Commentary System for Live Golf Tournament Coverage

PGA Tour

The PGA Tour faced the challenge of engaging fans with golf content across multiple tournaments running nearly every week of the year, generating meaningful content from 31,000+ shots per tournament across 156 players, and maintaining relevance during non-tournament days. They implemented an agentic AI system using AWS Bedrock that generates up to 800 articles per week across eight different content types (betting profiles, tournament previews, player recaps, round recaps, purse breakdowns, etc.) and a real-time shot commentary system that provides contextual narration for live tournament play. The solution achieved 95% cost reduction (generating articles at $0.25 each), enabled content publication within 5-10 minutes of live events, resulted in billions of annual page views for AI-generated content, and became their highest-engaged content on non-tournament days while maintaining brand voice and factual accuracy through multi-agent validation workflows.

AI-Powered Content Moderation at Scale: SafeChat Platform

DoorDash

DoorDash developed SafeChat, an AI-powered content moderation system to handle millions of daily messages, hundreds of thousands of images, and voice calls exchanged between delivery drivers (Dashers) and customers. The platform employs a multi-layered architecture that evolved from using three external LLMs to a more efficient two-layer approach combining an internally trained model with a precise external LLM, processing text, images, and voice communications in real-time. Since launch, SafeChat has achieved a 50% reduction in low to medium-severity safety incidents while maintaining low latency (under 300ms for most messages) and cost-effectiveness by intelligently routing only 0.2% of content to expensive, high-precision models.

AI-Powered Content Understanding and Ad Targeting Platform

Dotdash

Dotdash Meredith, a major digital publisher, developed an AI-powered system called Decipher that understands user intent from content consumption to deliver more relevant advertising. Through a strategic partnership with OpenAI, they enhanced their content understanding capabilities and expanded their targeting platform across the premium web. The system outperforms traditional cookie-based targeting while maintaining user privacy, proving that high-quality content combined with AI can drive better business outcomes.

AI-Powered Contract Data Extraction and Review Agent

OpenAI

OpenAI's internal finance team faced a bottleneck as contract volume grew from hundreds to over a thousand per month, with manual data entry becoming unsustainable. The team built a contract data agent using retrieval-augmented prompting that ingests various document formats, extracts structured data through reasoning-based inference, and presents annotated results for expert review. The system reduced review turnaround time by half, enabled the team to handle thousands of contracts without proportional headcount growth, and provides queryable, structured data in the warehouse while keeping human experts firmly in control of final decisions.

AI-Powered Conversational Assistant for Streamlined Home Buying Experience

Rocket

Rocket Companies, a Detroit-based FinTech company, developed Rocket AI Agent to address the overwhelming complexity of the home buying process by providing 24/7 personalized guidance and support. Built on Amazon Bedrock Agents, the AI assistant combines domain knowledge, personalized guidance, and actionable capabilities to transform client engagement across Rocket's digital properties. The implementation resulted in a threefold increase in conversion rates from web traffic to closed loans, 85% reduction in transfers to customer care, and 68% customer satisfaction scores, while enabling seamless transitions between AI assistance and human support when needed.

AI-Powered Conversational Contact Center for Healthcare Patient Communication

Clarus Care

Clarus Care, a healthcare contact center solutions provider serving over 16,000 users and handling 15 million patient calls annually, partnered with AWS Generative AI Innovation Center to transform their traditional menu-driven IVR system into a generative AI-powered conversational contact center. The solution uses Amazon Connect, Amazon Lex, and Amazon Bedrock (with Claude 3.5 Sonnet and Amazon Nova models) to enable natural language interactions that can handle multiple patient intents in a single conversation—such as appointment scheduling, prescription refills, and billing inquiries. The system achieves sub-3-second latency requirements, maintains 99.99% availability SLA, supports both voice and web chat interfaces, and includes smart transfer capabilities for urgent cases. The architecture leverages multi-model selection through Bedrock to optimize for specific tasks based on accuracy and latency requirements, with comprehensive analytics pipelines for monitoring system performance and patient interactions.

AI-Powered Conversational Search Assistant for B2B Foodservice Operations

Tyson Foods

Tyson Foods implemented a generative AI assistant on their website to bridge the gap with over 1 million unattended foodservice operators who previously purchased through distributors without direct company relationships. The solution combines semantic search using Amazon OpenSearch Serverless with embeddings from Amazon Titan, and an agentic conversational interface built with Anthropic's Claude 3.5 Sonnet on Amazon Bedrock and LangGraph. The system replaced traditional keyword-based search with semantic understanding of culinary terminology, enabling chefs and operators to find products using natural language queries even when their search terms don't match exact catalog descriptions, while also capturing high-value customer interactions for business intelligence.

AI-Powered CRM Insights with RAG and Text-to-SQL

TP ICAP

TP ICAP faced the challenge of extracting actionable insights from tens of thousands of vendor meeting notes stored in their Salesforce CRM system, where business users spent hours manually searching through records. Using Amazon Bedrock, their Innovation Lab built ClientIQ, a production-ready solution that combines Retrieval Augmented Generation (RAG) and text-to-SQL approaches to transform hours of manual analysis into seconds. The solution uses Amazon Bedrock Knowledge Bases for unstructured data queries, automated evaluations for quality assurance, and maintains enterprise-grade security through permission-based access controls. Since launch with 20 initial users, ClientIQ has driven a 75% reduction in time spent on research tasks and improved insight quality with more comprehensive and contextual information being surfaced.

AI-Powered Customer Conversation Analytics at Scale

GoDaddy

GoDaddy faced the challenge of extracting actionable insights from over 100,000 daily customer service transcripts, which were previously analyzed through limited manual review that couldn't surface systemic issues or emerging problems quickly enough. To address this, they developed Lighthouse, an internal AI analytics platform that uses large language models, prompt engineering, and lexical search to automatically analyze massive volumes of unstructured customer interaction data. The platform successfully processes the full daily volume of 100,000+ transcripts in approximately 80 minutes, enabling teams to identify pain points and operational issues within hours instead of weeks, as demonstrated in a real case where they quickly detected and resolved a spike in customer calls caused by a malfunctioning link before it escalated into a major service disruption.

AI-Powered Customer Feedback Analysis at Scale

Github

GitHub faced the challenge of manually processing vast amounts of customer feedback from support tickets, with data scientists spending approximately 80% of their time on data collection and organization tasks. To address this, GitHub's Customer Success Engineering team developed an internal AI analytics tool that combines open-source machine learning models (BERTopic with BERT embeddings and HDBSCAN clustering) to identify patterns in feedback, and GPT-4 to generate human-readable summaries of customer pain points. This system transformed their feedback analysis from manual classification to automated trend identification, enabling faster identification of common issues, improved feature prioritization, data-driven decision making, and discovery of self-service opportunities for customers.

AI-Powered Customer Feedback Analysis Using RAG and LLMs in Product Analytics

Meta

Meta's Reality Labs developed a self-service AI tool powered by their open-source Llama 4 LLM to analyze customer feedback for their Quest VR headsets and Ray-Ban Meta products. The challenge was that customer feedback data—from reviews, bug reports, surveys, and social media—was underutilized due to noise, bias, and lack of structure. By building a comprehensive feedback repository from internal and external sources and implementing a Retrieval Augmented Generation (RAG) system with embedding-based similarity search, Meta created a production system that transforms qualitative feedback into actionable insights. The tool is being used for bug deduplication, internal testing summaries, and strategic planning, enabling the company to bridge quantitative metrics with qualitative customer insights and dramatically reduce manual analysis time from hours to minutes.

AI-Powered Customer Interest Generation for Personalized E-commerce Recommendations

Wayfair

Wayfair developed a GenAI-powered system to generate nuanced, free-form customer interests that go beyond traditional behavioral models and fixed taxonomies. Using Google's Gemini LLM, the system processes customer search queries, product views, cart additions, and purchase history to infer deep insights about preferences, functional needs, and lifestyle values. These LLM-generated interests power personalized product carousels on the homepage and product detail pages, driving measurable engagement and revenue gains while enabling more transparent and adaptable personalization at scale across millions of customers.

AI-Powered Customer Segmentation with Natural Language Interface

Klaviyo

Klaviyo, a customer data platform serving 130,000 customers, launched Segments AI in November 2023 to address two key problems: inexperienced users struggling to express customer segments through traditional UI, and experienced users spending excessive time building repetitive complex segments. The solution uses OpenAI's LLMs combined with prompt chaining and few-shot learning techniques to transform natural language descriptions into structured segment definitions adhering to Klaviyo's JSON schema. The team tackled the significant challenge of validating non-deterministic LLM outputs by combining automated LLM-based evaluation with hand-designed test cases, ultimately deploying a production system that required ongoing maintenance due to the stochastic nature of generative AI outputs.

AI-Powered Customer Service Agent for Healthcare Navigation

Alan

Alan, a healthcare company supporting 1 million members, built AI agents to help members navigate complex healthcare questions and processes. The company transitioned from traditional workflows to playbook-based agent architectures, implementing a multi-agent system with classification and specialized agents (particularly for claims handling) that uses a ReAct loop for tool calling. The solution achieved 30-35% automation of customer service questions with quality comparable to human care experts, with 60% of reimbursements processed in under 5 minutes. Critical to their success was building custom orchestration frameworks and extensive internal tooling that empowered domain experts (customer service operators) to configure, debug, and maintain agents without engineering bottlenecks.

AI-Powered Customer Service and Call Center Transformation with Multi-Agent Systems

Fastweb / Vodafone

Fastweb / Vodafone, a major European telecommunications provider serving 9.5 million customers in Italy, transformed their customer service operations by building two AI agent systems to address the limitations of traditional customer support. They developed Super TOBi, a customer-facing agentic chatbot system, and Super Agent, an internal tool that empowers call center consultants with real-time diagnostics and guidance. Built on LangGraph and LangChain with Neo4j knowledge graphs and monitored through LangSmith, the solution achieved a 90% correctness rate, 82% resolution rate, 5.2/7 Customer Effort Score for Super TOBi, and over 86% One-Call Resolution rate for Super Agent, delivering faster response times and higher customer satisfaction while reducing agent workload.

AI-Powered Customer Support Automation for Global Transportation Service

Lime

Lime, a global micromobility company, implemented Forethought's AI solutions to scale their customer support operations. They faced challenges with manual ticket handling, language barriers, and lack of prioritization for critical cases. By implementing AI-powered automation tools including Solve for automated responses and Triage for intelligent routing, they achieved 27% case automation, 98% automatic ticket tagging, and reduced response times by 77%, while supporting multiple languages and handling 1.7 million tickets annually.

AI-Powered Data Copilot for Autonomous Analysis in IDEs

BlaBlaCar

BlaBlaCar developed an AI-powered Data Copilot to address the inefficient workflow between Software Engineers and Data Analysts, where engineers lacked data warehouse access and analysts were overwhelmed with repetitive queries. The solution embeds an LLM-powered assistant directly in VS Code that connects to BigQuery, provides contextual business logic from curated queries, generates SQL and Python code with unit tests, and enables engineers to perform their own analyses with data health checks as guardrails. The tool leverages a "zero-infrastructure" RAG approach using VS Code's native capabilities and GitHub Copilot, treating analyses as code artifacts in pull requests that analysts review, resulting in faster question resolution (from weeks to minutes) and freeing analysts to focus on high-value modeling work.

AI-Powered Developer Productivity and Product Discovery at Wholesale Marketplace

Faire

Faire, a wholesale marketplace connecting brands and retailers, implemented multiple AI initiatives across their engineering organization to enhance both internal developer productivity and external customer-facing features. The company deployed agentic development workflows using GitHub Copilot and custom orchestration systems to automate repetitive coding tasks, introduced natural-language and image-based search capabilities for retailers seeking products, and built a hybrid Python-Kotlin architecture to support multi-step AI agents that compose purchasing recommendations. These efforts aimed to reduce manual workflows, accelerate product discovery, and deliver more personalized experiences for their wholesale marketplace customers.

AI-Powered Developer Productivity Platform with MCP Servers and Agent-Based Automation

Bloomberg

Bloomberg's Technology Infrastructure team, led by Lei, implemented an enterprise-wide AI coding platform to enhance developer productivity across 9,000+ engineers working with one of the world's largest JavaScript codebases. Starting approximately two years before this presentation, the team moved beyond initial experimentation with various AI coding tools to focus on strategic use cases: automated code uplift agents for patching and refactoring, and incident response agents for troubleshooting. To avoid organizational chaos, they built a platform-as-a-service (PaaS) approach featuring a unified AI gateway for model selection, an MCP (Model Context Protocol) directory/hub for tool discovery, and standardized tool creation/deployment infrastructure. The solution was supported by integration into onboarding training programs and cross-organizational communities. Results included improved adoption, reduced duplication of efforts, faster proof-of-concepts, and notably, a fundamental shift in the cost function of software engineering that enabled teams to reconsider trade-offs in their development practices.

AI-Powered Developer Tools for Code Quality and Test Generation

Uber

Uber's developer platform team built AI-powered developer tools using LangGraph to improve code quality and automate test generation for their 5,000 engineers. Their approach focuses on three pillars: targeted product development for developer workflows, cross-cutting AI primitives, and intentional technology transfer. The team developed Validator, an IDE-integrated tool that flags best practices violations and security issues with automatic fixes, and AutoCover, which generates comprehensive test suites with coverage validation. These tools demonstrate the successful deployment of multi-agent systems in production, achieving measurable improvements including thousands of daily fix interactions, 10% increase in developer platform coverage, and 21,000 developer hours saved through automated test generation.

AI-Powered Digital Co-Workers for Customer Support and Business Process Automation

Neople

Neople, a European startup founded almost three years ago, has developed AI-powered "digital co-workers" (called Neeles) primarily targeting customer success and service teams in e-commerce companies across Europe. The problem they address is the repetitive, high-volume work that customer service agents face, which reduces job satisfaction and efficiency. Their solution evolved from providing AI-generated response suggestions to human agents, to fully automated ticket responses, to executing actions across multiple systems, and finally to enabling non-technical users to build custom workflows conversationally. The system now serves approximately 200 customers, with AI agents handling repetitive tasks autonomously while human agents focus on complex cases. Results include dramatic improvements in first response rates (from 10% to 70% in some cases), reduced resolution times, and expanded use cases beyond customer service into finance, operations, and marketing departments.

AI-Powered Ecommerce Content Optimization Platform

Pattern

Pattern developed Content Brief, an AI-driven tool that processes over 38 trillion ecommerce data points to optimize product listings across multiple marketplaces. Using Amazon Bedrock and other AWS services, the system analyzes consumer behavior, content performance, and competitive data to provide actionable insights for product content optimization. In one case study, their solution helped Select Brands achieve a 21% month-over-month revenue increase and 14.5% traffic improvement through optimized product listings.

AI-Powered Email Search Assistant with Advanced Cognitive Architecture

Superhuman

Superhuman developed Ask AI to solve the challenge of inefficient email and calendar searching, where users spent up to 35 minutes weekly trying to recall exact phrases and sender names. They evolved from a single-prompt RAG system to a sophisticated cognitive architecture with parallel processing for query classification and metadata extraction. The solution achieved sub-2-second response times and reduced user search time by 14% (5 minutes per week), while maintaining high accuracy through careful prompt engineering and systematic evaluation.

AI-Powered Engineering Team Management and Code Review Platform

Entelligence

Entelligence addresses the challenges of managing large engineering teams by providing AI agents that handle code reviews, documentation maintenance, and team performance analytics. The platform combines LLM-based code analysis with learning from team feedback to provide contextually appropriate reviews, while maintaining up-to-date documentation and offering insights into engineering productivity beyond traditional metrics like lines of code.

AI-Powered Epilepsy Diagnosis Platform Reducing Diagnostic Time Through Multimodal Data Processing

Australian Epilepsy Project

The Australian Epilepsy Project (AEP) developed a cloud-based precision medicine platform on AWS that integrates multimodal patient data (MRI scans, neuropsychological assessments, genetic data, and medical histories) to support epilepsy diagnosis and treatment planning. The platform leverages various AI/ML techniques including machine learning models for automated brain region analysis, large language models for medical text processing through RAG approaches, and generative AI for patient summaries. This resulted in a 70% reduction in diagnosis time for language area mapping prior to surgery, 10% higher lesion detection rates, and improved patient outcomes including 9% better work productivity and 8% reduction in seizures over two years.

AI-Powered Escrow Agent for Programmable Money Settlement

Circle

Circle developed an experimental AI-powered escrow agent system that combines OpenAI's multimodal models with their USDC stablecoin and smart contract infrastructure to automate agreement verification and payment settlement. The system uses AI to parse PDF contracts, extract key terms and payment amounts, deploy smart contracts programmatically, and verify work completion through image analysis, enabling near-instant settlement of escrow transactions while maintaining human oversight for final approval.

AI-Powered Fan Engagement and Content Personalization for Global Football Audiences

DFL / Bundesliga

DFL / Bundesliga, the organization behind Germany's premier football league, partnered with AWS to enhance fan engagement for their 1 billion global fans through AI and generative AI solutions. The primary challenges included personalizing content at scale across diverse geographies and languages, automating manual content creation processes, and making decades of archival footage searchable and accessible. The solutions implemented included an AI-powered live ticker providing real-time commentary in multiple languages and styles within 7 seconds of events, an intelligent metadata generation (IGM) system to analyze 9+ petabytes of historical footage using multimodal AI, automated content localization for speech-to-speech and speech-to-text translation, AI-generated "Stories" format content from existing articles, and personalized app experiences. Results demonstrated significant impact: 20% increase in overall app usage, 67% increase in articles read through personalization, 75% reduction in processing time for localized content with 5x content output, 2x increase in app dwell time from AI-generated stories, and 67% story retention rate indicating strong user engagement.

AI-Powered Fax Processing Automation for Healthcare Referrals

Providence

Providence Health System automated the processing of over 40 million annual faxes using GenAI and MLflow on Databricks to transform manual referral workflows into real-time automated triage. The system combines OCR with GPT-4.0 models to extract referral data from diverse document formats and integrates seamlessly with Epic EHR systems, eliminating months-long backlogs and freeing clinical staff to focus on patient care across 1,000+ clinics.

AI-Powered Financial Assistant for Automated Expense Management

Brex

Brex developed an AI-powered financial assistant to automate expense management workflows, addressing the pain points of manual data entry, policy compliance, and approval bottlenecks that plague traditional finance operations. Using Amazon Bedrock with Claude models, they built a comprehensive system that automatically processes expenses, generates compliant documentation, and provides real-time policy guidance. The solution achieved 75% automation of expense workflows, saving hundreds of thousands of hours monthly across customers while improving compliance rates from 70% to the mid-90s, demonstrating how LLMs can transform enterprise financial operations when properly integrated with existing business processes.

AI-Powered Food Image Generation System at Scale

Delivery Hero

Delivery Hero built a comprehensive AI-powered image generation system to address the problem that 86% of food products lacked images, which significantly impacted conversion rates. The solution involved implementing both text-to-image generation and image inpainting workflows using Stable Diffusion models, with extensive optimization for cost efficiency and quality assurance. The system successfully generated over 100,000 production images, achieved 6-8% conversion rate improvements, and reduced costs to under $0.003 per image through infrastructure optimization and model fine-tuning.

AI-Powered Government Service Assistant with Advanced RAG and Multi-Agent Architecture

City of Buenos Aires

The Government of the City of Buenos Aires partnered with AWS to enhance their existing WhatsApp-based AI assistant "Boti" with advanced generative AI capabilities to help citizens navigate over 1,300 government procedures. The solution implemented an agentic AI system using LangGraph and Amazon Bedrock, featuring custom input guardrails and a novel reasoning retrieval system that achieved 98.9% top-1 retrieval accuracy—a 12.5-17.5% improvement over standard RAG methods. The system successfully handles 3 million conversations monthly while maintaining safety through content filtering and delivering responses in culturally appropriate Rioplatense Spanish dialect.

AI-Powered Healthcare: Building Reliable Care Agents in Production

Sword Health

Sword Health, a digital health company specializing in remote physical therapy, developed Phoenix, an AI care agent that provides personalized support to patients during and after rehabilitation sessions while acting as a co-pilot for physical therapists. The company faced challenges deploying LLMs in a highly regulated healthcare environment, requiring robust guardrails, evaluation frameworks, and human oversight. Through iterative development focusing on prompt engineering, RAG for domain knowledge, comprehensive evaluation systems combining human and LLM-based ratings, and continuous data monitoring, Sword Health successfully shipped AI-powered features that improve care accessibility and efficiency while maintaining clinical safety through human-in-the-loop validation for all clinical decisions.

AI-Powered Help Desk for Accounts Payable Automation

Xelix

Xelix developed an AI-enabled help desk system to automate responses to vendor inquiries for accounts payable teams who often receive over 1,000 emails daily. The solution uses a multi-stage pipeline that classifies incoming emails, enriches them with vendor and invoice data from ERP systems, and generates contextual responses using LLMs. The system handles invoice status inquiries, payment reminders, and statement reconciliation requests, with confidence scoring to indicate response reliability. By pre-generating responses and surfacing relevant financial data, the platform reduces average handling time for tickets while maintaining human oversight through a review-and-send workflow, enabling AP teams to process high volumes of vendor communications more efficiently.

AI-Powered Home Loan Guardian for Mortgage Refinancing

Lendi

Lendi, an Australian FinTech company, developed Guardian, an agentic AI application to transform the home loan refinancing experience. The company identified that homeowners lacked visibility into their mortgage positions and faced cumbersome refinancing processes, while brokers spent excessive time on administrative tasks. Using Amazon Bedrock's foundation models, Lendi built a multi-agent system deployed on Amazon EKS that monitors loan competitiveness, tracks equity positions in real-time, and streamlines refinancing through conversational AI. The solution was developed in 16 weeks and has already settled millions in home loans with significantly reduced refinance cycle times, enabling customers to complete refinancing in as little as 10 minutes through the Rate Radar feature.

AI-Powered Hormonal Health Platform Built in 8 Weeks

FemmFlo

FemmFlo, a women's health tech startup, developed an LLM-powered platform to address the massive data gap in women's hormonal health, where millions of women wait over seven years for accurate diagnoses. Working with Millio AI and leveraging AWS services, they built a full MVP in just eight weeks that integrates hormonal tracking, lab diagnostics, mental health support, and personalized care recommendations through an AI agent named Gabby. The platform was designed for rapid deployment with beta users, lab integrations, and partnerships, specifically targeting underserved women with culturally relevant, localized healthcare guidance. The solution uses AWS Bedrock agents, API Gateway, DynamoDB, S3, and other managed services to deliver a scalable, cost-effective system that translates complex lab results into actionable health insights while maintaining clinical rigor through a controlled testing environment.

AI-Powered Hybrid Approach for Large-Scale Test Migration from Enzyme to React Testing Library

Slack

Slack faced the challenge of migrating 15,500 Enzyme test cases to React Testing Library to enable upgrading to React 18, an effort estimated at over 10,000 engineering hours across 150+ developers. The team developed an innovative hybrid approach combining Abstract Syntax Tree (AST) transformations with Large Language Models (LLMs), specifically Claude 2.1, to automate the conversion process. The solution involved a sophisticated pipeline that collected context including DOM trees, performed partial AST conversions with annotations, and leveraged LLMs to handle complex cases that traditional codemods couldn't address. This hybrid approach achieved an 80% success rate for automated conversions and saved developers 22% of their migration time, ultimately enabling the complete migration by May 2024.

AI-Powered Hyper-Personalized Email Campaign Automation

PromptLayer

PromptLayer built an automated AI sales system that creates hyper-personalized email campaigns by using three specialized AI agents to research leads, score their fit, generate subject lines, and draft tailored email sequences. The system integrates with existing sales tools like Apollo, HubSpot, and Make.com, achieving 50-60% open rates and ~7% positive reply rates while enabling non-technical sales teams to manage prompts and content directly through PromptLayer's platform without requiring engineering support.

AI-Powered Hyper-Personalized Email Marketing System

Hubspot

Hubspot developed an AI-powered system for one-to-one email personalization at scale, moving beyond traditional segmented cohort-based approaches. The system uses GPT-4 to analyze user behavior, website data, and content interactions to understand user intent, then automatically recommends and personalizes relevant educational content. The implementation resulted in dramatic improvements: 82% increase in conversion rates, 30% improvement in open rates, and over 50% increase in click-through rates.

AI-Powered Image Generation for Customizable Grocery Products

Instacart

Instacart's FoodStorm Order Management System faced the challenge of providing high-quality product images for countless customizable grocery items like deli sandwiches, cakes, and prepared foods, where professional photography for every configuration was impractical and costly. The solution involved integrating generative AI image generation capabilities through Instacart's internal Pixel service (which provides access to Google Imagen and other models) directly into FoodStorm's user interface, allowing grocery retailers to create product images on-demand with customizable prompts. Through multiple design iterations, the system evolved from simple one-click generation to a sophisticated interface where users can fine-tune prompts, preview multiple variations, and inspect details for quality control, ultimately enabling retailers to efficiently produce images for ingredients, toppings, promotional banners, and category thumbnails across the Instacart platform.

AI-Powered Incident Response System with Multi-Agent Investigation

Incident.io

Incident.io developed an AI SRE product to automate incident investigation and response for tech companies. The product uses a multi-agent system to analyze incidents by searching through GitHub pull requests, Slack messages, historical incidents, logs, metrics, and traces to build hypotheses about root causes. When incidents occur, the system automatically creates investigations that run parallel searches, generate findings, formulate hypotheses, ask clarifying questions through sub-agents, and present actionable reports in Slack within 1-2 minutes. The system demonstrates significant value by reducing mean time to detection and resolution while providing continuous ambient monitoring throughout the incident lifecycle, working collaboratively with human responders.

AI-Powered Insurance Claims Chatbot with Continuous Feedback Loop

Allianz

Allianz Benelux tackled their complex insurance claims process by implementing an AI-powered chatbot using Landbot. The system processed over 92,000 unique search terms, categorized insurance products, and implemented a real-time feedback loop with Slack and Trello integration. The solution achieved 90% positive ratings from 18,000+ customers while significantly simplifying the claims process and improving operational efficiency.

AI-Powered IT Operations Management with Multi-Agent Systems

Iberdrola

Iberdrola, a global utility company, implemented AI agents using Amazon Bedrock AgentCore to transform IT operations in ServiceNow by addressing bottlenecks in change request validation and incident management. The solution deployed three agentic architectures: a deterministic workflow for validating change requests in the draft phase, a multi-agent orchestration system for enriching incident tickets with contextual intelligence, and a conversational AI assistant for simplifying change model selection. The implementation leveraged LangGraph agents containerized and deployed through AgentCore Runtime, with specialized agents working in sequence or adaptively based on incident complexity, resulting in reduced processing times, accelerated ticket resolution, and improved data quality across departments.

AI-Powered Legal Document Analysis and Hearing Transcription for Social Security Disability Law

LexMed

LexMed developed an AI-native suite of tools leveraging large language models to streamline pain points for social security disability attorneys who advocate for claimants applying for disability benefits. The solution addresses the challenge of analyzing thousands of pages of medical records to find evidence that maps to complex regulatory requirements, as well as transcribing and auditing administrative hearings for procedural errors. By using LLMs with RAG architecture and custom logic, the platform automates the previously manual process of finding "needles in haystacks" within medical documentation and identifying regulatory compliance issues, enabling attorneys to provide more effective advocacy for all clients regardless of case complexity.

AI-Powered Legal Document Review and Analysis Platform

Lexbe

Lexbe, a legal document review software company, developed Lexbe Pilot, an AI-powered Q&A assistant integrated into their eDiscovery platform using Amazon Bedrock and associated AWS services. The solution addresses the challenge of legal professionals needing to analyze massive document sets (100,000 to over 1 million documents) to identify critical evidence for litigation. By implementing a RAG-based architecture with Amazon Bedrock Knowledge Bases, the system enables legal teams to query entire datasets and retrieve contextually relevant results that go beyond traditional keyword searches. Through an eight-month collaborative development process with AWS, Lexbe achieved a 90% recall rate with the final implementation, enabling the generation of comprehensive findings-of-fact reports and deep automated inference capabilities that can identify relationships and connections across multilingual document collections.

AI-Powered Lesson Generation System for Language Learning

Duolingo

Duolingo implemented an LLM-based system to accelerate their lesson creation process, enabling their teaching experts to generate language learning content more efficiently. The system uses carefully crafted prompts that combine fixed rules and variable parameters to generate exercises that meet specific educational requirements. This has resulted in faster course development, allowing Duolingo to expand their course offerings and deliver more advanced content while maintaining quality through human expert oversight.

AI-Powered Market Surveillance System for Financial Compliance

London Stock Exchange Group

London Stock Exchange Group (LSEG) developed an AI-powered Surveillance Guide using Amazon Bedrock and Anthropic's Claude Sonnet 3.5 to automate market abuse detection by analyzing news articles for price sensitivity. The system addresses the challenge of manual and time-consuming surveillance processes where analysts must review thousands of trading alerts and determine if suspicious activity correlates with price-sensitive news events. The solution achieved 100% precision in identifying non-sensitive news and 100% recall in detecting price-sensitive content, significantly reducing analyst workload while maintaining comprehensive market oversight and regulatory compliance.

AI-Powered Marketing Compliance Automation System

Remitly

Remitly, a global financial services company operating in 170 countries, developed an AI-based system to streamline their marketing compliance review process. The system analyzes marketing content against regulatory guidelines and internal policies, providing real-time feedback to marketers before legal review. The initial implementation focused on English text content, achieving 95% accuracy and 97% recall in identifying compliance issues, reducing the back-and-forth between marketing and legal teams, and significantly improving time-to-market for marketing materials.

AI-Powered Marketing Compliance Monitoring at Scale

PerformLine

PerformLine, a marketing compliance platform, needed to efficiently process complex product pages containing multiple overlapping products for compliance checks. They developed a serverless, event-driven architecture using Amazon Bedrock with Amazon Nova models to parse and extract contextual information from millions of web pages daily. The solution implemented prompt engineering with multi-pass inference, achieving a 15% reduction in human evaluation workload and over 50% reduction in analyst workload through intelligent content deduplication and change detection, while processing an estimated 1.5-2 million pages daily to extract 400,000-500,000 products for compliance review.

AI-Powered Marketing Content Generation and Compliance Platform at Scale

Volkswagen

Volkswagen Group Services partnered with AWS to build a production-scale generative AI platform for automotive marketing content generation and compliance evaluation. The problem was a slow, manual content supply chain that took weeks to months, created confidentiality risks with pre-production vehicles, and faced massive compliance bottlenecks across 10 brands and 200+ countries. The solution involved fine-tuning diffusion models on proprietary vehicle imagery (including digital twins from CAD), automated prompt enhancement using LLMs, and multi-stage image evaluation using vision-language models for both component-level accuracy and brand guideline compliance. Results included massive time savings (weeks to minutes), automated compliance checks across legal and brand requirements, and a reusable shared platform supporting multiple use cases across the organization.

AI-Powered Marketing Intelligence Platform Accelerates Industry Analysis

CLICKFORCE

CLICKFORCE, a digital advertising leader in Taiwan, faced challenges with generic AI outputs, disconnected internal datasets, and labor-intensive analysis processes that took two to six weeks to complete industry reports. The company built Lumos, an AI-powered marketing analysis platform using Amazon Bedrock Agents for contextualized reasoning, Amazon SageMaker for Text-to-SQL fine-tuning, Amazon OpenSearch for vector embeddings, and AWS Glue for data integration. The solution reduced industry analysis time from weeks to under one hour, achieved a 47% reduction in operational costs, and enabled multiple stakeholder groups to independently generate insights without centralized analyst teams.

AI-Powered Marketing Platform for Small and Medium Businesses

Mowie

Mowie is an AI marketing platform targeting small and medium businesses in restaurants, retail, and e-commerce sectors. Founded by Chris Okconor and Jessica Valenzuela, the platform addresses the challenge of SMBs purchasing marketing tools but barely using them due to limited time and expertise. Mowie automates the entire marketing workflow by ingesting publicly available data about a business (reviews, website content, competitive intelligence), building a comprehensive "brand dossier" using LLMs, and automatically generating personalized content calendars across social media and email channels. The platform evolved from manual concierge services into a fully automated system that requires minimal customer input—just a business name and URL—and delivers weekly content calendars that customers can approve via email, with performance tracking integrated through point-of-sale systems to measure actual business impact.

AI-Powered Medical Content Review and Revision at Scale

Flo Health

Flo Health, a leading women's health app, partnered with AWS Generative AI Innovation Center to develop MACROS (Medical Automated Content Review and Revision Optimization Solution), an AI-powered system for verifying and maintaining the accuracy of thousands of medical articles. The solution uses Amazon Bedrock foundation models to automatically review medical content against established guidelines, identify outdated or inaccurate information, and propose evidence-based revisions while maintaining Flo's editorial style. The proof of concept achieved 80% accuracy and over 90% recall in identifying content requiring updates, significantly reduced processing time from hours to minutes per guideline, and demonstrated more consistent application of medical guidelines compared to manual reviews while reducing the workload on medical experts.

AI-Powered Menu Description Generation for Restaurant Platforms

Doordash

DoorDash developed a production-grade AI system to automatically generate menu item descriptions for restaurants on their platform, addressing the challenge that many small restaurant owners face in creating compelling descriptions for every menu item. The solution combines three interconnected systems: a multimodal retrieval system that gathers relevant data even when information is sparse, a learning and generation system that adapts to each restaurant's unique voice and style, and an evaluation system that incorporates both automated and human feedback loops to ensure quality and continuous improvement.

AI-Powered Merchant Classification Correction Agent

Ramp

Ramp built an AI agent to automatically fix incorrect merchant classifications that were previously causing customer frustration and requiring hours of manual intervention from support, finance, and engineering teams. The solution uses a large language model backed by embeddings and OLAP queries, multimodal retrieval augmented generation (RAG) with receipt image analysis, and carefully constructed guardrails to validate and process user-submitted correction requests. The agent now handles nearly 100% of requests (compared to less than 3% previously handled manually) in under 10 seconds with a 99% improvement rate according to LLM-based evaluation, saving both customer time and substantial operational costs.

AI-Powered Multi-Agent Decision Support System for Enterprise Strategic Planning

Coinbase

Coinbase developed RAPID-D, an AI-powered decision support tool to augment their existing RAPID decision-making framework used for critical strategic choices. The system employs a multi-agent architecture where specialized AI agents collaborate to analyze decision documents, surface risks, challenge assumptions, and provide comprehensive recommendations to human decision-makers. By implementing a modular approach with agents serving as analysts, contextual seekers, devil's advocates, and synthesizers, Coinbase created a transparent and auditable system that helps mitigate cognitive bias while maintaining human oversight. The solution was iteratively developed based on leadership feedback, achieving strong accuracy benchmarks with Claude 3.7 Sonnet, and incorporates real-time feedback mechanisms to continuously improve recommendation quality.

AI-Powered Multi-Agent Platform for Blockchain Operations and Log Analysis

Ripple

Ripple, a fintech company operating the XRP Ledger (XRPL) blockchain, built an AI-powered multi-agent operations platform to address the challenge of monitoring and troubleshooting their decentralized network of 900+ nodes. Previously, analyzing operational issues required C++ experts to manually parse through 30-50GB of debug logs per node, taking 2-3 days per incident. The solution leverages AWS services including Amazon Bedrock, Neptune Analytics for graph-based RAG, CloudWatch for log aggregation, and a multi-agent architecture using the Strands SDK. The system features four specialized agents (orchestrator, code analysis, log analysis, and query generator) that correlate code and logs to provide engineers with actionable insights in minutes rather than days, eliminating the dependency on C++ experts and enabling faster feature development and incident response.

AI-Powered Multi-Agent System for Global Compliance Screening at Scale

Amazon

Amazon developed an AI-driven compliance screening system to handle approximately 2 billion daily transactions across 160+ businesses globally, ensuring adherence to sanctions and regulatory requirements. The solution employs a three-tier approach: a screening engine using fuzzy matching and vector embeddings, an intelligent automation layer with traditional ML models, and an AI-powered investigation system featuring specialized agents built on Amazon Bedrock AgentCore Runtime. These agents work collaboratively to analyze matches, gather evidence, and make recommendations following standardized operating procedures. The system achieves 96% accuracy with 96% precision and 100% recall, automating decision-making for over 60% of case volume while reserving human intervention only for edge cases requiring nuanced judgment.

AI-Powered Music Lyric Analysis and Semantic Search Platform

LyricLens

LyricLens, developed by Music Smatch, is a production AI system that extracts semantic meaning, themes, entities, cultural references, and sentiment from music lyrics at scale. The platform analyzes over 11 million songs using Amazon Bedrock's Nova family of foundation models to provide real-time insights for brands, artists, developers, and content moderators. By migrating from a previous provider to Amazon Nova models, Music Smatch achieved over 30% cost savings while maintaining accuracy, processing over 2.5 billion tokens. The system employs a multi-level semantic engine with knowledge graphs, supports content moderation with granular PG ratings, and enables natural language queries for playlist generation and trend analysis across demographics, genres, and time periods.

AI-Powered Natural Language Flight Search Implementation

Alaska Airlines

Alaska Airlines implemented a natural language destination search system powered by Google Cloud's Gemini LLM to transform their flight booking experience. The system moves beyond traditional flight search by allowing customers to describe their desired travel experience in natural language, considering multiple constraints and preferences simultaneously. The solution integrates Gemini with Alaska Airlines' existing flight data and customer information, ensuring recommendations are grounded in actual available flights and pricing.

AI-Powered Natural Language Search for Vehicle Marketplace

Coches.net

Coches.net, Spain's leading vehicle marketplace, implemented an AI-powered natural language search system to replace traditional filter-based search. The team completed a 15-day sprint using Amazon Bedrock and Anthropic's Claude Haiku model to translate natural language queries like "family-friendly SUV for mountain trips" into structured search filters. The solution includes content moderation, few-shot prompting, and costs approximately €19 per day to operate. While user adoption remains limited, early results show that users utilizing the AI search generate more value compared to traditional search methods, demonstrating improved efficiency and user experience through automated filter application.

AI-Powered Network Operations Assistant with Multi-Agent RAG Architecture

Swisscom

Swisscom, Switzerland's leading telecommunications provider, developed a Network Assistant using Amazon Bedrock to address the challenge of network engineers spending over 10% of their time manually gathering and analyzing data from multiple sources. The solution implements a multi-agent RAG architecture with specialized agents for documentation management and calculations, combined with an ETL pipeline using AWS services. The system is projected to reduce routine data retrieval and analysis time by 10%, saving approximately 200 hours per engineer annually while maintaining strict data security and sovereignty requirements for the telecommunications sector.

AI-Powered Nutrition Guidance with Fine-Tuned Llama Models

Omada Health

Omada Health, a virtual healthcare provider, developed OmadaSpark, an AI-powered nutrition education feature that provides real-time motivational interviewing and personalized nutritional guidance to members in their chronic condition management programs. The solution uses a fine-tuned Llama 3.1 8B model deployed on Amazon SageMaker AI, trained on 1,000 question-answer pairs derived from internal care protocols and peer-reviewed medical literature. The implementation was completed in 4.5 months and resulted in members who used the tool being three times more likely to return to the Omada app, while reducing response times from days to seconds. The solution maintains strict HIPAA compliance and includes human-in-the-loop review by registered dietitians for quality assurance.

AI-Powered On-Call Assistant for Airflow Pipeline Debugging

Wix

Wix developed AirBot, an AI-powered Slack agent to address the operational burden of managing over 3,500 Apache Airflow pipelines processing 4 billion daily HTTP transactions across a 7 petabyte data lake. The traditional manual debugging process required engineers to act as "human error parsers," navigating multiple distributed systems (Airflow, Spark, Kubernetes) and spending approximately 45 minutes per incident to identify root causes. AirBot leverages LLMs (GPT-4o Mini and Claude 4.5 Opus) in a Chain of Thought architecture to automatically investigate failures, generate diagnostic reports, create pull requests with fixes, and route alerts to appropriate team owners. The system achieved measurable impact by saving approximately 675 engineering hours per month (equivalent to 4 full-time engineers), generating 180 candidate pull requests with a 15% fully automated fix rate, and reducing debugging time by at least 15 minutes per incident while maintaining cost efficiency at $0.30 per AI interaction.

AI-Powered Onboarding Agent for Small Business CRM

HoneyBook

HoneyBook, a CRM platform for small businesses and freelancers in the United States, implemented an AI agent to transform their user onboarding experience from a generic static flow into a personalized, conversational process. The onboarding agent uses RAG for knowledge retrieval, can generate real contracts and invoices tailored to user business types, and actively guides conversations toward three specific goals while managing conversation flow to prevent endless back-and-forth. The implementation on Temporal infrastructure with custom tool orchestration resulted in a 36% increase in trial-to-subscription conversion rates compared to the control group that experienced the traditional onboarding quiz.

AI-Powered Performance Optimization System for Go Code

Uber

Uber developed PerfInsights, a production system that combines runtime profiling data with generative AI to automatically detect performance antipatterns in Go services and recommend optimizations. The system addresses the challenge of expensive manual performance tuning by using LLMs to analyze the most CPU-intensive functions identified through profiling, applying sophisticated prompt engineering and validation techniques including LLM juries and rule-based checkers to reduce false positives from over 80% to the low teens. This has resulted in hundreds of merged optimization diffs, significant engineering time savings (93% reduction from 14.5 hours to 1 hour per issue), and measurable compute cost reductions across Uber's Go services.

AI-Powered Personal Health Coach Using Gemini Models

Fitbit

Fitbit developed an AI-powered personal health coach to address the fragmented and generic nature of traditional health and fitness guidance. Using Gemini models within a multi-agent framework, the system provides proactive, personalized, and adaptive coaching grounded in behavioral science and individual health metrics such as sleep and activity data. The solution employs a conversational agent for orchestration, a data science agent for numerical reasoning on physiological time series, and domain expert agents for specialized guidance. The system underwent extensive validation through the SHARP evaluation framework, involving over 1 million human annotations and 100k hours of expert evaluation across multiple health disciplines. The health coach entered public preview for eligible US-based Fitbit Premium users, providing personalized insights, goal setting, and adaptive plans to build sustainable health habits.

AI-Powered Personalized Sales Pitch Generation for CPG Loyalty Programs

Vxceed

Vxceed developed the Lighthouse Loyalty Selling Story platform to address the critical challenge faced by consumer packaged goods (CPG) companies in emerging economies: low uptake (below 30%) of trade promotion and loyalty programs despite 15-20% revenue investment. The solution uses Amazon Bedrock with a multi-agent AI architecture to generate personalized sales pitches at scale for field sales teams targeting millions of retail outlets. The implementation achieved 95% response accuracy, automated 90% of loyalty program queries, increased program enrollment by 5-15%, reduced enrollment processing time by 20%, and decreased support time requirements by 10%, delivering annual savings of 2 person-months per region in administrative overhead.

AI-Powered Personalized Year-in-Review Campaign at Scale

Canva

Canva launched DesignDNA, a year-in-review campaign in December 2024 to celebrate their community's design achievements. The campaign needed to create personalized, shareable experiences for millions of users while respecting privacy constraints. Canva leveraged generative AI to match users to design trends using keyword analysis, generate design personalities, and create over a million unique personalized poems across 9 locales. The solution combined template metadata analysis, prompt engineering, content generation at scale, and automated review processes to produce 95 million unique DesignDNA stories. Each story included personalized statistics, AI-generated poems, design personality profiles, and predicted emerging design trends, all dynamically assembled using URL parameters and tagged template elements.

AI-Powered PLC Code Generation for Industrial Automation

Wipro PARI

Wipro PARI, a global automation company, partnered with AWS and ShellKode to develop an AI-powered solution that transforms the manual process of generating Programmable Logic Controller (PLC) ladder text code from complex process requirements. Using Amazon Bedrock with Anthropic's Claude models, advanced prompt engineering techniques, and custom validation logic, the system reduces PLC code generation time from 3-4 days to approximately 10 minutes per requirement while achieving up to 85% code accuracy. The solution automates validation against IEC 61131-3 industry standards, handles complex state management and transition logic, and provides a user-friendly interface for industrial engineers, resulting in 5,000 work-hours saved across projects and enabling Wipro PARI to win key automotive clients.

AI-Powered Postmortem Analysis for Site Reliability Engineering

Zalando

Zalando developed an LLM-powered pipeline to analyze thousands of incident postmortems accumulated over two years, transforming them from static documents into actionable strategic insights. The traditional human-centric approach to postmortem analysis was unable to scale to the volume of incidents, requiring 15-20 minutes per document and making it impossible to identify systemic patterns across the organization. Their solution involved building a multi-stage LLM pipeline that summarizes, classifies, analyzes, and identifies patterns across incidents, with a particular focus on datastore technologies (Postgres, DynamoDB, ElastiCache, S3, and Elasticsearch). Despite challenges with hallucinations and surface attribution errors, the system reduced analysis time from days to hours, achieved 3x productivity gains, and uncovered critical investment opportunities such as automated change validation that prevented 25% of subsequent datastore incidents.

AI-Powered Product Description Generation for E-commerce Marketplaces

Handmade.com

Handmade.com, a hand-crafts marketplace with over 60,000 products, automated their product description generation process to address scalability challenges and improve SEO performance. The company implemented an end-to-end AI pipeline using Amazon Bedrock's Anthropic Claude 3.7 Sonnet for multimodal content generation, Amazon Titan Text Embeddings V2 for semantic search, and Amazon OpenSearch Service for vector storage. The solution employs Retrieval Augmented Generation (RAG) to enrich product descriptions by leveraging a curated dataset of 1 million handmade products, reducing manual processing time from 10 hours per week while improving content quality and search discoverability.

AI-Powered Real Estate Transaction Newsworthiness Detection System

The Globe and Mail

A collaboration between journalists and technologists from multiple news organizations (Hearst, Gannett, The Globe and Mail, and E24) developed an AI system to automatically detect newsworthy real estate transactions. The system combines anomaly detection, LLM-based analysis, and human feedback to identify significant property transactions, with a particular focus on celebrity involvement and price anomalies. Early results showed promise with few-shot prompting, and the system successfully identified several newsworthy transactions that might have otherwise been missed by traditional reporting methods.

AI-Powered Real-Time Content Moderation with Prevalence Measurement

Pinterest

Pinterest built a real-time AI-assisted system to measure the prevalence of policy-violating content—the percentage of daily views that went to harmful content—to address the limitations of relying solely on user reports. The company developed a workflow combining ML-assisted impression-weighted sampling with multimodal LLM labeling to process daily samples at scale. This approach reduced labeling turnaround time by 15x compared to human-only review while maintaining comparable decision quality, enabling continuous monitoring across multiple policy areas, faster intervention testing, and proactive risk detection that was previously impossible with infrequent manual studies.

AI-Powered Regression Testing with Natural Language Test Case Generation

Duolingo

Duolingo's QA team faced significant challenges with manual regression testing that consumed substantial bandwidth each week, requiring multiple team members several hours to validate releases against their highly iterative product with numerous A/B tests and feature variants. To address this, they partnered with MobileBoost in 2024 to implement GPT Driver, an AI-powered testing tool that accepts natural language instructions and executes them on virtual devices. By reframing test cases from prescriptive step-by-step instructions to goal-oriented prompts (e.g., "Progress through screens until you see XYZ"), they enabled the system to adapt to changing UIs and feature variations while maintaining test reliability. The solution reduced manual regression testing workflows by 70%, allowing QA team members to shift from hours of manual execution to minutes of reviewing recorded test runs, thereby freeing the team to focus on higher-value activities like bug fixes and new feature testing.

AI-Powered Revenue Operating System with Multi-Agent Orchestration

Rox

Rox built a revenue operating system to address the challenge of fragmented sales data across CRM, marketing automation, finance, support, and product usage systems that create silos and slow down sales teams. The solution uses Amazon Bedrock with Anthropic's Claude Sonnet 4 to power intelligent AI agent swarms that unify disparate data sources into a knowledge graph and execute multi-step GTM workflows including research, outreach, opportunity management, and proposal generation. Early customers reported 50% higher representative productivity, 20% faster sales velocity, 2x revenue per rep, 40-50% increase in average selling price, 90% reduction in prep time, and 50% faster ramp time for new reps.

AI-Powered Root Cause Analysis Assistant for Race Day Operations

Formula 1

Formula 1 developed an AI-driven root cause analysis assistant using Amazon Bedrock to streamline issue resolution during race events. The solution reduced troubleshooting time from weeks to minutes by enabling engineers to query system issues using natural language, automatically checking system health, and providing remediation recommendations. The implementation combines ETL pipelines, RAG, and agentic capabilities to process logs and interact with internal systems, resulting in an 86% reduction in end-to-end resolution time.

AI-Powered Sales Assistant for Go-To-Market Team Productivity

OpenAI

OpenAI's go-to-market team faced significant productivity challenges as it tripled in size within a year while launching new products weekly. Sales representatives spent excessive time (often an hour preparing for 30-minute calls) navigating disconnected systems to gather context, while product questions overwhelmed subject matter experts. To address this, OpenAI built GTM Assistant, a Slack-based AI system using their automation platform that provides daily meeting briefs with comprehensive account history, automated recaps, and instant product Q&A with traceable sources. The solution resulted in sales reps exchanging an average of 22 messages weekly with the assistant and achieving a 20% productivity lift (approximately one extra day per week), while also piloting autonomous capabilities like CRM logging and proactive usage pattern detection.

AI-Powered Sales Intelligence and Go-to-Market Orchestration Platform

Clay

Clay is a creative sales and marketing platform that helps companies execute go-to-market strategies by turning unstructured data about companies and people into actionable insights. The platform addresses the challenge of finding unique competitive advantages in sales ("go-to-market alpha") by integrating with over 150 data providers and using LLM-powered agents to research prospects, enrich data, and automate outreach. Their flagship agent "Claygent" performs web research to extract custom data points that aren't available in traditional sales databases, while their newer "Navigator" agent can interact with web forms and complex websites. Clay has achieved significant scale, crossing one billion agent runs and targeting two billion runs annually, while maintaining a philosophy that data will be imperfect and building tools for rapid iteration, validation, and trust-building through features like session replay.

AI-Powered Security Operations Center with Agentic AI for Threat Detection and Response

Trellix

Trellix, in partnership with AWS, developed an AI-powered Security Operations Center (SOC) using agentic AI to address the challenge of overwhelming security alerts that human analysts cannot effectively process. The solution leverages AWS Bedrock with multiple models (Amazon Nova for classification, Claude Sonnet for analysis) to automatically investigate security alerts, correlate data across multiple sources, and provide detailed threat assessments. The system uses a multi-agent architecture where AI agents autonomously select tools, gather context from various security platforms, and generate comprehensive incident reports, significantly reducing the burden on human analysts while improving threat detection accuracy.

AI-Powered Self-Remediation Loop for Large-Scale Kubernetes Operations

Salesforce

Salesforce's Hyperforce Kubernetes platform team manages over 1,400 clusters scaling millions of pods, facing significant operational challenges with engineers spending over 1,000 hours monthly on support tasks. They developed a multi-agent AI-powered self-remediation loop built on AWS Bedrock's multi-agent collaboration framework, integrating with their existing monitoring and automation tools (Prometheus, K8sGPT, Argo CD, and custom tools like Sloop and Periscope). The solution features a manager AI agent that orchestrates multiple specialized worker agents to retrieve telemetry data, perform root cause analysis using RAG-augmented runbooks, and execute safe remediation actions with human-in-the-loop approval via Slack. The implementation achieved a 30% improvement in troubleshooting time and saved approximately 150 hours per month in operational toil, with plans to expand capabilities using knowledge graphs and advanced anomaly detection.

AI-Powered Semantic Job Search at Scale

Linkedin

LinkedIn transformed their traditional keyword-based job search into an AI-powered semantic search system to serve 1.2 billion members. The company addressed limitations of exact keyword matching by implementing a multi-stage LLM architecture combining retrieval and ranking models, supported by synthetic data generation, GPU-optimized embedding-based retrieval, and cross-encoder ranking models. The solution enables natural language job queries like "Find software engineer jobs that are mostly remote with above median pay" while maintaining low latency and high relevance at massive scale through techniques like model distillation, KV caching, and exhaustive GPU-based nearest neighbor search.

AI-Powered Shift-Left Testing Platform with Multiple LLM Agents

QyrusAI

QyrusAI developed a comprehensive shift-left testing platform that integrates multiple AI agents powered by Amazon Bedrock's foundation models. The solution addresses the challenge of maintaining quality while accelerating development cycles by implementing AI-driven testing throughout the software development lifecycle. Their implementation resulted in an 80% reduction in defect leakage, 20% reduction in UAT effort, and 36% faster time to market.

AI-Powered Slack Conversation Summarization System

Salesforce

Salesforce AI Research developed AI Summarist, a conversational AI-powered tool to address information overload in Slack workspaces. The system uses state-of-the-art AI to automatically summarize conversations, channels, and threads, helping users manage their information consumption based on work preferences. The solution processes messages through Slack's API, disentangles conversations, and generates concise summaries while maintaining data privacy by not storing any summarized content.

AI-Powered Sleep Coach for CBTI Protocol Delivery

Rest

Rest, a company that evolved from developing a podcast player app, built an AI sleep coach to help people solve chronic sleep problems through an 8-week protocol based on Cognitive Behavioral Therapy for Insomnia (CBTI). The problem they identified was that while CBTI is clinically proven to be effective for 80% of people with insomnia, it typically costs thousands of dollars, requires specialized practitioners who have year-long waitlists, and isn't accessible to most people. Rest's solution uses voice-first AI agents powered by OpenAI's GPT-4 and integrated with Vapi for voice capabilities, creating daily check-ins where the AI coaches users through the CBTI protocol with personalized guidance based on their sleep logs, behavioral patterns, and personal context stored in a custom memory system. The product evolved iteratively from a text-based chatbot to a sophisticated voice agent with RAG for knowledge retrieval, dynamic agenda generation tailored to each user's program stage and recent sleep data, and multi-layered memory systems that track user context over time. The company now logs hundreds of hours of voice conversations monthly with users preferring voice interactions for the intimacy and ease it provides in discussing sleep challenges.

AI-Powered SNAP Benefits Notice Interpretation System

Propel

Propel developed an AI system to help SNAP (food stamp) recipients better understand official notices they receive. The system uses LLMs to analyze notice content and provide clear explanations of importance and required actions. The prototype successfully interprets complex government communications and provides simplified, actionable guidance while maintaining high safety standards for this sensitive use case.

AI-Powered Social Intelligence for Life Sciences

Indegene

Indegene developed an AI-powered social intelligence solution to help pharmaceutical companies extract insights from digital healthcare conversations on social media. The solution addresses the challenge that 52% of healthcare professionals now prefer receiving medical content through social channels, while the life sciences industry struggles with analyzing complex medical discussions at scale. Using Amazon Bedrock, SageMaker, and other AWS services, the platform provides healthcare-focused analytics including HCP identification, sentiment analysis, brand monitoring, and adverse event detection. The layered architecture delivers measurable improvements in time-to-insight generation and operational cost savings while maintaining regulatory compliance.

AI-Powered SRE Agent for Production Infrastructure Management

Cleric AI

Cleric Ai addresses the growing complexity of production infrastructure management by developing an AI-powered agent that acts as a team member for SRE and DevOps teams. The system autonomously monitors infrastructure, investigates issues, and provides confident diagnoses through a reasoning engine that leverages existing observability tools and maintains a knowledge graph of infrastructure relationships. The solution aims to reduce engineer workload by automating investigation workflows and providing clear, actionable insights.

AI-Powered Supply Chain Visibility and ETA Prediction System

Toyota / IBM

Toyota partnered with IBM and AWS to develop an AI-powered supply chain visibility platform that addresses the automotive industry's challenges with delivery prediction accuracy and customer transparency. The system uses machine learning models (XGBoost, AdaBoost, random forest) for time series forecasting and regression to predict estimated time of arrival (ETA) for vehicles throughout their journey from manufacturing to dealer delivery. The solution integrates real-time event streaming, feature engineering with Amazon SageMaker, and batch inference every four hours to provide near real-time predictions. Additionally, the team implemented an agentic AI chatbot using AWS Bedrock to enable natural language queries about vehicle status. The platform provides customers and dealers with visibility into vehicle journeys through a "pizza tracker" style interface, improving customer satisfaction and enabling proactive delay management.

AI-Powered Teacher Assistant for Core Curriculum Alignment in K-5 Education

eSpark

eSpark, an adaptive learning platform for K-5 students, developed an LLM-powered teacher assistant to address a critical post-COVID challenge: school administrators were emphasizing expensive core curricula investments while relegating supplemental programs like eSpark to secondary status. The team built a RAG-based recommendation system that matches eSpark's 15 years of curated content with hundreds of different core curricula, enabling teachers to seamlessly integrate eSpark activities with their mandated lesson plans. Through continuous teacher interviews and iterative development, they evolved from a conversational chatbot interface (which teachers found overwhelming) to a streamlined dropdown-based system with AI-generated follow-up questions. The solution leverages embeddings databases, tool-calling agents, and a sophisticated eval framework using Brain Trust for testing across hundreds of curricula, ultimately helping teachers work more efficiently while keeping eSpark relevant in a changing educational landscape.

AI-Powered Technical Help Desk for Energy Utility Field Operations

Infosys Topaz

A large energy supplier faced challenges with technical help desk operations supporting 5,000 weekly calls from meter technicians in the field, with average handling times exceeding 5 minutes for the top 10 issue categories representing 60% of calls. Infosys Topaz partnered with AWS to build a generative AI solution using Amazon Bedrock's Claude Sonnet model to create a knowledge base from call transcripts, implement retrieval-augmented generation (RAG), and deploy an AI assistant with role-based access control. The solution reduced average handling time by 60% (from over 5 minutes to under 2 minutes), enabled the AI assistant to handle 70% of previously human-managed calls, and increased customer satisfaction scores by 30%.

AI-Powered Text Message-Based Healthcare Treatment Management System

Stride

Stride developed an AI-powered text message-based healthcare treatment management system for Aila Science to assist patients through self-administered telemedicine regimens, particularly for early pregnancy loss treatment. The system replaced manual human operators with LLM-powered agents that can interpret patient responses, provide medically-approved guidance, schedule messages, and escalate complex situations to human reviewers. The solution achieved approximately 10x capacity improvement while maintaining treatment quality and safety through a hybrid human-in-the-loop approach.

AI-Powered Tour Guide for Financial Platform Navigation

Ramp

Ramp developed an AI-powered Tour Guide agent to help users navigate their financial operations platform more effectively. The solution guides users through complex tasks by taking control of cursor movements while providing step-by-step explanations. Using an iterative action-taking approach and optimized prompt engineering, the Tour Guide increases user productivity and platform accessibility while maintaining user trust through transparent human-agent collaboration.

AI-Powered Trade Assistant for Equities Trading Workflows

Jefferies Equities

Jefferies Equities, a full-service investment bank, developed an AI Trade Assistant on Amazon Bedrock to address challenges faced by their front-office traders who struggled to access and analyze millions of daily trades stored across multiple fragmented data sources. The solution leverages LLMs (specifically Amazon Titan embeddings model) to enable traders to query trading data using natural language, automatically generating SQL queries and visualizations through a conversational interface integrated into their existing business intelligence platform. In a beta rollout to 50 users across sales and trading operations, the system delivered an 80% reduction in time spent on routine analytical tasks, high adoption rates, and reduced technical burden on IT teams while democratizing data access across trading desks.

AI-Powered Transformation of AWS Support for Mission-Critical Workloads

Whoop

AWS Support transformed from a reactive firefighting model to a proactive AI-augmented support system to handle the increasing complexity of cloud operations. The transformation involved building autonomous agents, context-aware systems, and structured workflows powered by Amazon Bedrock and Connect to provide faster incident response and proactive guidance. WHOOP, a health wearables company, utilized AWS's new Unified Operations offering to successfully launch two new hardware products with 10x mobile traffic and 200x e-commerce traffic scaling, achieving 100% availability in May 2025 and reducing critical case response times from 8 minutes to under 2.5 minutes, ultimately improving quarterly availability from 99.85% to 99.95%.

AI-Powered Transportation Planning and Safety Countermeasure Visualization

INRIX

INRIX partnered with AWS to develop an AI-powered solution that accelerates transportation planning by combining their 50 petabyte data lake with Amazon Bedrock's generative AI capabilities. The solution addresses the challenge of processing vast amounts of transportation data to identify high-risk locations for vulnerable road users and automatically generate safety countermeasures. By leveraging Amazon Nova Canvas for image visualization and RAG-powered natural language queries, the system transforms traditional manual processes that took weeks into automated workflows that can be completed in days, enabling faster deployment of safety measures while maintaining compliance with local regulations.

AI-Powered Travel Assistant for Rail and Coach Platform

Trainline

Trainline, the world's leading rail and coach ticketing platform serving 27 million customers across 40 countries, developed an AI-powered travel assistant to address underserved customer needs during the travel experience. The company identified that while they excelled at selling tickets, customers lacked support during their journeys when disruptions occurred or they had questions about their travel. They built an agentic AI system using LLMs that could answer diverse customer questions ranging from refund requests to real-time train information to unusual queries like bringing pets or motorbikes on trains. The solution went from concept to production in five months, launching in February 2025, and now handles over 300,000 conversations monthly. The system uses a central orchestrator with multiple tools including RAG with 700,000 pages of curated content, real-time train data APIs, terms and conditions lookups, and automated refund capabilities, all protected by multiple layers of guardrails to ensure safety and factual accuracy.

AI-Powered Travel Assistant for Trip Planning and Personalization

Expedia

Expedia Group launched Romie, an AI-powered travel assistant designed to simplify group trip planning and provide personalized travel experiences. The problem addressed is the complexity of coordinating travel plans among multiple people with different preferences, along with the challenge of managing itineraries and responding to travel disruptions. Romie integrates with SMS group chats, email, and the Expedia app to assist with destination recommendations, smart search based on group preferences, itinerary building, and real-time updates for disruptions. The solution was released in alpha through EG Labs in May 2024, alongside 40+ new AI-powered features including destination comparison, guest review summaries, air price comparison, and an enhanced help center. The assistant is designed to be progressively intelligent, learning user preferences over time while remaining assistive rather than intrusive.

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

AI-Powered Video Workflow Orchestration Platform for Broadcasting

Cires21

Cires21, a Spanish live streaming services company, developed MediaCoPilot to address the fragmented ecosystem of applications used by broadcasters, which resulted in slow content delivery, high costs, and duplicated work. The solution is a unified serverless platform on AWS that integrates custom AI models for video and audio processing (ASR, diarization, scene detection) with Amazon Bedrock for generating complex metadata like subtitles, highlights, and summaries. The platform uses AWS Step Functions for orchestration, exposes capabilities via API for integration into client workflows, and recently added AI agents powered by AWS Agent Core that can handle complex multi-step tasks like finding viral moments, creating social media clips, and auto-generating captions. The architecture delivers faster time-to-market, improved scalability, and automated content workflows for broadcast clients.

AI-Powered Voice Agents for Proactive Hotel Payment Verification

Perk

Perk, a business travel management platform, faced a critical problem where virtual credit cards sent to hotels sometimes weren't charged before guest arrival, leading to catastrophic check-in experiences for exhausted travelers. To prevent this, their customer care team was making approximately 10,000 proactive phone calls per week to hotels. The team built an AI voice agent system that autonomously calls hotels to verify and request payment processing. Starting with a rapid prototype using Make.com, they iterated through extensive prompt engineering, call structure refinement, and comprehensive evaluation frameworks. The solution now successfully handles tens of thousands of calls weekly across multiple languages (English, German), matching or exceeding human performance while dramatically reducing manual workload and uncovering additional operational insights through systematic call classification.

Architecture and Production Patterns of Autonomous Coding Agents

Anthropic

This talk explores the architecture and production implementation patterns behind modern autonomous coding agents like Claude Code, Cursor, and others, presented by Jared from Prompt Layer. The speaker examines why coding agents have recently become effective, arguing that the key innovation is a simple while-loop architecture with tool calling, combined with improved models, rather than complex DAGs or RAG systems. The presentation covers implementation details including tool design (particularly bash as the universal adapter), context management strategies, sandboxing approaches, and evaluation methodologies. The speaker's company, Prompt Layer, has reorganized their engineering practices around Claude Code, establishing a rule that any task completable in under an hour using the agent should be done immediately, demonstrating practical production adoption and measurable productivity gains.

Architecture Patterns for Production AI Systems: Lessons from Building and Failing with Generative AI Products

Outropy

Phil Calçado shares a post-mortem analysis of Outropy, a failed AI productivity startup that served thousands of users, revealing why most AI products struggle in production. Despite having superior technology compared to competitors like Salesforce's Slack AI, Outropy failed commercially but provided valuable insights into building production AI systems. Calçado argues that successful AI products require treating agents as objects and workflows as data pipelines, applying traditional software engineering principles rather than falling into "Twitter-driven development" or purely data science approaches.

AskNu: RAG-Based Employee Knowledge Management System

Nubank

Nubank developed AskNu, an AI-powered Slack integration to help its 9,000 employees quickly access internal documentation across multiple Confluence spaces. The solution uses a Retrieval Augmented Generation (RAG) framework with a two-stage process: first routing queries to the appropriate department using dynamic few-shot classification, then generating personalized answers from relevant documentation. After six months of deployment, the system achieved 5,000 active users, processed 280,000 messages, received 80% positive feedback, reduced support tickets by 96%, and decreased information retrieval time from 30 minutes (or up to 8 hours with tickets) down to 9 seconds.

Augmented Unit Test Generation Using LLMs

Adyen

Adyen, a global payments platform company, explored the integration of large language models to enhance their code quality practices by automating and augmenting unit test generation. The company investigated how LLMs could assist developers in creating comprehensive test coverage more efficiently, addressing the challenge of maintaining high code quality standards while managing the time investment required for writing thorough unit tests. Through this venture, Adyen aimed to leverage AI capabilities to generate contextually appropriate test cases that could complement human-written tests, potentially accelerating development cycles while maintaining or improving test coverage and code reliability.

Auto-Moderating Car Dealer Reviews with GenAI

Edmunds

Edmunds transformed their dealer review moderation process from a manual system taking up to 72 hours to an automated GenAI solution using GPT-4 through Databricks Model Serving. The solution processes over 300 daily dealer quality-of-service reviews, reducing moderation time from days to minutes and requiring only two moderators instead of a larger team. The implementation included careful prompt engineering and integration with Databricks Unity Catalog for improved data governance.

Automated Carrier Claims Management Using AI Agents

FIEGE

FIEGE, a major German logistics provider, implemented an AI agent system to handle carrier claims processing end-to-end, launched in September 2024. The system automatically processes claims from initial email receipt through resolution, handling multiple languages and document types. By implementing a controlled approach with sandboxed generative AI and templated responses, the system successfully processes 70-90% of claims automatically, resulting in eight-digit cost savings while maintaining high accuracy and reliability.

Automated Clinical Document Generation Platform for Pharmaceutical R&D

AbbVie

AbbVie developed Gaia, a generative AI platform to automate the creation of clinical and regulatory documents in their R&D organization. The platform addresses the challenge of producing hundreds of complex, regulated documents required throughout the clinical trial lifecycle, from study startup through regulatory submissions. By the end of 2024, Gaia automated 26 document types, saving 20,000 hours annually, with plans to scale to over 350 document types by 2030, targeting 115,000+ hours in annual savings. The platform uses a modular "Lego block" approach with reusable components, integrates with over 90 data sources, employs AWS Bedrock for LLM access, and implements human-in-the-loop workflows to maintain quality standards while being "GXP-ready" for future validation in life sciences regulatory environments.

Automated Code Reviews with LLMs

Faire

Faire, an e-commerce marketplace connecting retailers with brands, implemented an LLM-powered automated code review pipeline to enhance developer productivity by handling generic code review tasks. The solution leverages OpenAI's Assistants API through an internal orchestrator service called Fairey, which uses RAG (Retrieval Augmented Generation) to fetch context-specific information about pull requests including diffs, test coverage reports, and build logs. The system performs various automated reviews such as enforcing style guides, assessing PR descriptions, diagnosing build failures with auto-fix suggestions, recommending test coverage improvements, and detecting backward-incompatible changes. Early results demonstrated success with positive user satisfaction and high accuracy, freeing up engineering talent to focus on more complex review aspects like architecture decisions and long-term maintainability.

Automated Contract Processing and Rights Analysis Using Multi-Model LLM Pipeline

Condé Nast

Condé Nast, a global media company managing complex contracts across multiple brands and geographies, faced significant operational bottlenecks due to manual contract review processes that were time-consuming, error-prone, and led to missed revenue opportunities. AWS developed an automated solution using Amazon Bedrock with Anthropic's Claude 3.7 Sonnet to process contracts through a multi-stage pipeline: converting PDFs to text using visual reasoning capabilities, extracting metadata fields through structured prompting, comparing contracts to existing templates using a knowledge base with RAG, and clustering low-similarity contracts to identify new template patterns. The solution reduced processing time from weeks to hours, improved accuracy in rights management, enabled better scalability during high-volume periods, and transformed how subject matter experts could drive AI application development through prompt engineering rather than traditional software development cycles.

Automated Data Journalism Platform Using LLMs for Real-time News Generation

Realtime

Realtime built an automated data journalism platform that uses LLMs to generate news stories from continuously updated datasets and news articles. The system processes raw data sources, performs statistical analysis, and employs GPT-4 Turbo to generate contextual summaries and headlines. The platform successfully automates routine data journalism tasks while maintaining transparency about AI usage and implementing safeguards against common LLM pitfalls.

Automated Email Triage System Using Amazon Bedrock Flows

Parameta

Parameta Solutions, a financial data services provider, transformed their client email processing system from a manual workflow to an automated solution using Amazon Bedrock Flows. The system intelligently processes technical support queries by classifying emails, extracting relevant entities, validating information, and generating appropriate responses. This transformation reduced resolution times from weeks to days while maintaining high accuracy and operational control, achieved within a two-week implementation period.

Automated ESG Reporting with Agentic AI for Enterprise Sustainability Compliance

Gardenia Technologies

Gardenia Technologies partnered with AWS to develop Report GenAI, an automated ESG reporting solution that helps organizations reduce sustainability reporting time by up to 75%. The system uses agentic AI on Amazon Bedrock to automatically pre-fill ESG disclosure reports by integrating data from corporate databases, document stores, and web searches, while maintaining human oversight for validation and refinement. Omni Helicopters International successfully reduced their CDP reporting time from one month to one week using this solution.

Automated Evaluation Framework for LLM-Powered Features

Slack

Slack's machine learning team developed a comprehensive evaluation framework for their LLM-powered features, including message summarization and natural language search. They implemented a three-tiered evaluation approach using golden sets, validation sets, and A/B testing, combined with automated quality metrics to assess various aspects like hallucination detection and system integration. This framework enabled rapid prototyping and continuous improvement of their generative AI products while maintaining quality standards.

Automated GPU Kernel Generation Using LLMs and Inference-Time Scaling

NVIDIA

NVIDIA engineers developed a novel approach to automatically generate optimized GPU attention kernels using the DeepSeek-R1 language model combined with inference-time scaling. They implemented a closed-loop system where the model generates code that is verified and refined through multiple iterations, achieving 100% accuracy for Level-1 problems and 96% for Level-2 problems in Stanford's KernelBench benchmark. This approach demonstrates how additional compute resources during inference can improve code generation capabilities of LLMs.

Automated Image Generation for E-commerce Categories Using Multimodal LLMs

Ebay

eBay developed an automated image generation system to replace manual curation of category and theme images across thousands of categories. The system leverages multimodal LLMs to process item data, simplify titles, generate image prompts, and create category-representative images through text-to-image models. A novel automated evaluation framework uses a rubric-based approach to assess image quality across fidelity, clarity, and style adherence, with an iterative refinement loop that regenerates images until quality thresholds are met. Human evaluation showed 88% of automatically generated and approved images were suitable for production use, demonstrating the system's ability to scale visual content creation while maintaining brand standards and reducing manual effort.

Automated Inventory Counting with Multimodal LLMs in Grocery Fulfillment

Picnic

Picnic, an online grocery delivery company, implemented a multimodal LLM-based computer vision system to automate inventory counting in their automated warehouse. The manual stock counting process was time-consuming at scale, and traditional approaches like weighing scales proved unreliable due to measurement variance. The solution involved deploying camera setups to capture high-quality images of grocery totes, using Google Gemini's multimodal models with carefully crafted prompts and supply chain reference images to count products. Through fine-tuning, they achieved performance comparable to expensive pro-tier models using cost-effective flash models, deployed via a Fast API service with LiteLLM as a proxy layer for model interchangeability, and implemented continuous validation through selective manual checks.

Automated Knowledge Base Enhancement Using LLMs and Clustering for Customer Support

Doordash

DoorDash developed an automated system to enhance their support chatbot's knowledge base by identifying content gaps through clustering analysis of escalated customer conversations and using LLMs to generate draft articles from user-generated content. The system uses semantic clustering to identify high-impact knowledge gaps, classifies issues as actionable problems or informational queries, and automatically generates polished knowledge base articles that are then reviewed by human specialists before deployment through a RAG-based retrieval system. The implementation resulted in significant improvements, with escalation rates dropping from 78% to 43% for high-traffic clusters, while maintaining human oversight for quality control and edge case handling.

Automated LLM Evaluation and Quality Monitoring in Customer Support Analytics

Echo AI

Echo AI, leveraging Log10's platform, developed a system for analyzing customer support interactions at scale using LLMs. They faced the challenge of maintaining accuracy and trust while processing high volumes of customer conversations. The solution combined Echo AI's conversation analysis capabilities with Log10's automated feedback and evaluation system, resulting in a 20-point F1 score improvement in accuracy and the ability to automatically evaluate LLM outputs across various customer-specific use cases.

Automated LLM Evaluation Framework for Customer Support Chatbots

Instacart

Instacart developed the LLM-Assisted Chatbot Evaluation (LACE) framework to systematically evaluate their AI-powered customer support chatbot performance at scale. The company faced challenges in measuring chatbot effectiveness beyond traditional metrics, needing a system that could assess nuanced aspects like query understanding, answer correctness, and customer satisfaction. LACE employs three LLM-based evaluation methods (direct prompting, agentic reflection, and agentic debate) across five key dimensions with binary scoring criteria, validated against human judgment through iterative refinement. The framework enables continuous monitoring and improvement of chatbot interactions, successfully identifying issues like context maintenance failures and inefficient responses that directly impact customer experience.

Automated LLM Pipeline Optimization with DSPy for Multi-Stage Agent Development

JetBlue

JetBlue faced challenges in manually tuning prompts across complex, multi-stage LLM pipelines for applications like customer feedback classification and RAG-powered predictive maintenance chatbots. The airline adopted DSPy, a framework for building self-optimizing LLM pipelines, integrated with Databricks infrastructure including Model Serving and Vector Search. By leveraging DSPy's automatic optimization capabilities and modular architecture, JetBlue achieved 2x faster RAG chatbot deployment compared to their previous Langchain implementation, eliminated manual prompt engineering, and enabled automatic optimization of pipeline quality metrics using LLM-as-a-judge evaluations, resulting in more reliable and efficient LLM applications at scale.

Automated Log Classification System for Device Security Infrastructure

Palo Alto Networks

Palo Alto Networks' Device Security team faced challenges with reactively processing over 200 million daily service and application log entries, resulting in delayed response times to critical production issues. In partnership with AWS Generative AI Innovation Center, they developed an automated log classification pipeline powered by Amazon Bedrock using Anthropic's Claude Haiku model and Amazon Titan Text Embeddings. The solution achieved 95% precision in detecting production issues while reducing incident response times by 83%, transforming reactive log monitoring into proactive issue detection through intelligent caching, context-aware classification, and dynamic few-shot learning.

Automated Medical Literature Review System Using Domain-Specific LLMs

John Snow Labs

John Snow Labs developed a medical chatbot system that automates the traditionally time-consuming process of medical literature review. The solution combines proprietary medical-domain-tuned LLMs with a comprehensive medical research knowledge base, enabling researchers to analyze hundreds of papers in minutes instead of weeks or months. The system includes features for custom knowledge base integration, intelligent data extraction, and automated filtering based on user-defined criteria, while maintaining explainability and citation tracking.

Automated Performance Optimization with GenAI-Powered Code Analysis

Uber

Uber developed PerfInsights to address unsustainable compute costs from inefficient Go services, where traditionally manual performance optimization required deep expertise and days or weeks of effort. The system combines runtime CPU/memory profiling with GenAI-powered static analysis to automatically detect performance antipatterns in Go code, using LLM juries and rule-based validation (LLMCheck) to reduce hallucinations and false positives from over 80% to the low teens. Since deployment, PerfInsights has generated hundreds of merged optimization diffs, reduced antipattern detection time by 93% (from 14.5 hours to under 1 hour per issue), eliminated approximately 3,800 hours of manual engineering effort annually, and achieved a 33.5% reduction in codebase antipatterns over four months while delivering measurable compute cost savings.

Automated Product Attribute Extraction and Title Standardization Using Agentic AI

Delivery Hero

Delivery Hero Quick Commerce faced significant challenges managing vast product catalogs across multiple platforms and regions, where manual verification of product attributes was time-consuming, costly, and error-prone. They implemented an agentic AI system using Large Language Models to automatically extract 22 predefined product attributes from vendor-provided titles and images, then generate standardized product titles conforming to their format. Using a predefined agent architecture with two sequential LLM components, optimized through prompt engineering, Teacher/Student knowledge distillation for the title generation step, and confidence scoring for quality control, the system achieved significant improvements in efficiency, accuracy, data quality, and customer satisfaction while maintaining cost-effectiveness and predictability.

Automated Product Classification and Attribute Extraction Using Vision LLMs

Shopify

Shopify tackled the challenge of automatically understanding and categorizing millions of products across their platform by implementing a multi-step Vision LLM solution. The system extracts structured product information including categories and attributes from product images and descriptions, enabling better search, tax calculation, and recommendations. Through careful fine-tuning, evaluation, and cost optimization, they scaled the solution to handle tens of millions of predictions daily while maintaining high accuracy and managing hallucinations.

Automated Prompt Optimization for Intelligent Text Processing using Amazon Bedrock

Yuewen Group

Yuewen Group, a global online literature platform, transitioned from traditional NLP models to Claude 3.5 Sonnet on Amazon Bedrock for intelligent text processing. Initially facing challenges with unoptimized prompts performing worse than traditional models, they implemented Amazon Bedrock's Prompt Optimization feature to automatically enhance their prompts. This led to significant improvements in accuracy for tasks like character dialogue attribution, achieving 90% accuracy compared to the previous 70% with unoptimized prompts and 80% with traditional NLP models.

Automated Search Quality Evaluation Using LLMs for Typeahead Suggestions

LinkedIn

LinkedIn developed an automated evaluation system using GPT models served through Azure to assess the quality of their typeahead search suggestions at scale. The system replaced manual human evaluation with automated LLM-based assessment, using carefully engineered prompts and a golden test set. The implementation resulted in faster evaluation cycles (hours instead of weeks) and demonstrated significant improvements in suggestion quality, with one experiment showing a 6.8% absolute improvement in typeahead quality scores.

Automated Sign Language Translation Using Large Language Models

VSL Labs

VSL Labs is developing an automated system for translating English into American Sign Language (ASL) using generative AI models. The solution addresses the significant challenges faced by the deaf community, including limited availability and high costs of human interpreters. Their platform uses a combination of in-house and GPT-4 models to handle text processing, cultural adaptation, and generates precise signing instructions including facial expressions and body movements for realistic avatar-based sign language interpretation.

Automated Software Development Insights and Communication Platform

Blueprint AI

Blueprint AI addresses the challenge of communication and understanding between business and technical teams in software development by leveraging LLMs. The platform automatically analyzes data from various sources like GitHub and Jira, creating intelligent reports that surface relevant insights, track progress, and identify potential blockers. The system provides 24/7 monitoring and context-aware updates, helping teams stay informed about development progress without manual reporting overhead.

Automated Sports Commentary Generation using LLMs

WSC Sport

WSC Sport developed an automated system to generate real-time sports commentary and recaps using LLMs. The system takes game events data and creates coherent, engaging narratives that can be automatically translated into multiple languages and delivered with synthesized voice commentary. The solution reduced production time from 3-4 hours to 1-2 minutes while maintaining high quality and accuracy.

Automated Synopsis Generation Pipeline with Human-in-the-Loop Quality Control

Netflix

Netflix developed an automated pipeline for generating show and movie synopses using LLMs, replacing a highly manual context-gathering process. The system uses Metaflow to orchestrate LLM-based content summarization and synopsis generation, with multiple human feedback loops and automated quality control checks. While maintaining human writers and editors in the process, the system has significantly improved efficiency and enabled the creation of more synopses per title while maintaining quality standards.

Automated Unit Test Improvement Using LLMs for Android Applications

Meta

Meta developed TestGen-LLM, a tool that leverages large language models to automatically improve unit test coverage for Android applications written in Kotlin. The system uses an Assured Offline LLM-Based Software Engineering approach to generate additional test cases while maintaining strict quality controls. When deployed at Meta, particularly for Instagram and Facebook platforms, the tool successfully enhanced 10% of the targeted classes with reliable test improvements that were accepted by engineers for production use.

Automating AWS Well-Architected Reviews at Scale with GenAI

CommBank

Commonwealth Bank of Australia (CommBank) faced challenges conducting AWS Well-Architected Reviews across their workloads at scale due to the time-intensive nature of traditional reviews, which typically required 3-4 hours and 10-15 subject matter experts. To address this, CommBank partnered with AWS to develop a GenAI-powered solution called the "Well-Architected Infrastructure Analyzer" that automates the review process. The solution leverages AWS Bedrock to analyze CloudFormation templates, Terraform files, and architecture diagrams alongside organizational documentation to automatically map resources against Well-Architected best practices and generate comprehensive reports with recommendations. This automation enables CommBank to conduct reviews across all workloads rather than just the most critical ones, significantly reducing the time and expertise required while maintaining quality and enabling continuous architecture improvement throughout the workload lifecycle.

Automating Code Sample Updates with LLMs for Technical Documentation

Wix

When Wix needed to update over 2,000 code samples in their API reference documentation due to a syntax change, they implemented an LLM-based automation solution instead of manual updates. The team used GPT-4 for code classification and GPT-3.5 Turbo for code conversion, combined with TypeScript compilation for validation. This automated approach reduced what would have been weeks of manual work to a single morning of team involvement, while maintaining high accuracy in the code transformations.

Automating Community Conference Operations with AI Coding Agents

PyCon

A volunteer-run conference organization (PyData/PyConDE) with events serving up to 1,500 attendees faced significant operational overhead in managing tickets, marketing, video production, and community engagement. Over a three-month period, the team experimented with various AI coding agents (Claude, Gemini, Qwen Coder Plus, Codex) to automate tasks including LinkedIn scraping for social media content, automated video cutting using computer vision, ticket management integration, and multi-step workflow automation. The results were mixed: while AI agents proved valuable for well-documented API integration, boilerplate code generation, and specific automation tasks like screenshot capture and video processing, they struggled with multi-step procedural workflows, data normalization, and maintaining code quality without close human oversight. The team concluded that AI agents work best when kept on a "short leash" with narrow use cases, frequent commits, and human validation, delivering time savings for generalist tasks but requiring careful expectation management and not delivering the "10x productivity" improvements often claimed.

Automating Enterprise Workflows with Foundation Models in Healthcare

Various

The researchers present aCLAr (Demonstrate, Execute, Validate framework), a system that uses multimodal foundation models to automate enterprise workflows, particularly in healthcare settings. The system addresses limitations of traditional RPA by enabling passive learning from demonstrations, human-like UI navigation, and self-monitoring capabilities. They successfully demonstrated the system automating a real healthcare workflow in Epic EHR, showing how foundation models can be leveraged for complex enterprise automation without requiring API integration.

Automating Healthcare Documentation and Rule Management with GenAI

Orizon

Orizon, a healthcare tech company, faced challenges with manual code documentation and rule interpretation for their medical billing fraud detection system. They implemented a GenAI solution using Databricks' platform to automate code documentation and rule interpretation, resulting in 63% of tasks being automated and reducing documentation time to under 5 minutes. The solution included fine-tuned Llama2-code and DBRX models deployed through Mosaic AI Model Serving, with strict governance and security measures for protecting sensitive healthcare data.

Automating Healthcare Procedure Code Selection Through Domain-Specific LLM Platform

Hasura / PromptQL

A large public healthcare company specializing in radiology software deployed an AI-powered automation solution to streamline the complex process of procedure code selection during patient appointment scheduling. The traditional manual process took 12-15 minutes per call, requiring operators to navigate complex UIs and select from hundreds of procedure codes that varied by clinic, regulations, and patient circumstances. Using PromptQL's domain-specific LLM platform, non-technical healthcare administrators can now write automation logic in natural language that gets converted into executable code, reducing call times and potentially delivering $50-100 million in business impact through increased efficiency and reduced training costs.

Automating Job Role Extraction Using Prosus AI Assistant in Production

OLX

OLX faced a challenge with unstructured job roles in their job listings platform, making it difficult for users to find relevant positions. They implemented a production solution using Prosus AI Assistant, a GenAI/LLM model, to automatically extract and standardize job roles from job listings. The system processes around 2,000 daily job updates, making approximately 4,000 API calls per day. Initial A/B testing showed positive uplift in most metrics, particularly in scenarios with fewer than 50 search results, though the high operational cost of ~15K per month has led them to consider transitioning to self-hosted models.

Automating Leadership Assessment Using GenAI and LLM Operations

DDI

DDI, a leadership development company, transformed their manual behavioral simulation assessment process by implementing LLMs and MLOps practices using Databricks. They reduced report generation time from 48 hours to 10 seconds while improving assessment accuracy through prompt engineering and model fine-tuning. The solution leveraged DSPy for prompt optimization and achieved significant improvements in recall and F1 scores, demonstrating the successful automation of complex behavioral analyses at scale.

Automating Merchant Onboarding with Reinforcement Learning

Doordash

DoorDash faced challenges with menu accuracy during merchant onboarding, where their existing AI system struggled with diverse and messy real-world menu formats. Working with Applied Compute, they developed an automated grading system calibrated to internal expert standards, then used reinforcement learning to train a menu error correction model against this grader as a reward function. The solution achieved a 30% relative reduction in low-quality menus and was rolled out to all USA menu traffic, demonstrating how institutional knowledge can be encoded into automated training signals for production LLM systems.

Automating Post Incident Review Summaries with GPT-4

Canva

Canva implemented GPT-4 chat to automate the summarization of Post Incident Reports (PIRs), addressing inconsistency and workload challenges in their incident review process. The solution involves extracting PIR content from Confluence, preprocessing to remove sensitive data, using carefully crafted prompts with GPT-4 chat for summary generation, and integrating the results with their data warehouse and Jira tickets. The implementation proved successful with most AI-generated summaries requiring no human modification while maintaining high quality and consistency.

Automating Private Credit Deal Analysis with LLMs and RAG

Riskspan

Riskspan, a technology company providing analysis for complex investment asset classes, tackled the challenge of analyzing private credit deals that traditionally required 3-4 weeks of manual document review and Excel modeling. The company built a production GenAI system on AWS using Claude LLM, embeddings, RAG (Retrieval Augmented Generation), and automated code generation to extract information from unstructured documents (PDFs, emails, amendments) and dynamically generate investment waterfall models. The solution reduced deal processing time from 3-4 weeks to 3-5 days, achieved 87% faster customer onboarding, delivered 10x scalability improvement, and reduced per-deal processing costs by 90x to under $50, while enabling the company to address a $9 trillion untapped market opportunity in private credit.

Automating Root Cause Analysis Using Amazon Bedrock Agents

BMW

BMW implemented a generative AI solution using Amazon Bedrock Agents to automate and accelerate root cause analysis (RCA) for cloud incidents in their connected vehicle services. The solution combines architecture analysis, log inspection, metrics monitoring, and infrastructure evaluation tools with a ReAct (Reasoning and Action) framework to identify service disruptions. The automated RCA agent achieved 85% accuracy in identifying root causes, significantly reducing diagnosis times and enabling less experienced engineers to effectively troubleshoot complex issues.

Automating Search Engine Marketing Ad Generation with Multi-Stage LLM Pipeline

Thumbtack

Thumbtack faced significant challenges with their manual Search Engine Marketing (SEM) ad creation process, where 80% of ad assets were generic templates across all ad groups, leading to suboptimal performance and requiring extensive manual effort. They developed a multi-stage LLM-powered solution that automates the generation, review, and grouping of Google Responsive Search Ads (RSAs) headlines and descriptions, incorporating specific keywords and value propositions for each ad group. The implementation was rolled out in four phases, with initial proof-of-concept showing 20% increase in traffic and 10% increase in conversions, and the final phase demonstrating statistically significant improvements in click-through rates and conversion value using Google's Drafts and Experiments feature for robust measurement.

Automating Supplier Ticket Management with LLM Agents

Wayfair

Wayfair developed Wilma, an LLM-based ticket automation system, to automate the manual triage of supplier support tickets in their SupportHub JIRA-based system. The solution uses LangGraph to orchestrate LLM calls and tool interactions for intent classification, language detection, and supplier ID lookup through a ReAct agent with BigQuery access. The system achieved better-than-human performance with 93% accuracy on question type identification (vs. 75% human accuracy), 98% on language detection, and 88% on supplier ID identification, while reducing processing time and allowing associates to focus on higher-value work.

Automating Test Generation with LLMs at Scale

Assembled

Assembled leveraged Large Language Models to automate and streamline their test writing process, resulting in hundreds of saved engineering hours. By developing effective prompting strategies and integrating LLMs into their development workflow, they were able to generate comprehensive test suites in minutes instead of hours, leading to increased test coverage and improved engineering velocity without compromising code quality.

Automating Translation Workflows with LLMs for Post-Editing and Transcreation

TransPerfect

TransPerfect integrated Amazon Bedrock into their GlobalLink translation management system to automate and improve translation workflows. The solution addressed two key challenges: automating post-editing of machine translations and enabling AI-assisted transcreation of creative content. By implementing LLM-powered workflows, they achieved up to 50% cost savings in translation post-editing, 60% productivity gains in transcreation, and up to 80% reduction in project turnaround times while maintaining high quality standards.

Automating Weather Forecast Text Generation Using Fine-Tuned Vision-Language Models

UK MetOffice

The UK Met Office partnered with AWS to automate the generation of the Shipping Forecast, a 100-year-old maritime weather forecast that traditionally required expert meteorologists several hours daily to produce. The solution involved fine-tuning Amazon Nova foundation models (both LLM and vision-language model variants) to convert complex multi-dimensional weather data into structured text forecasts. Within four weeks of prototyping, they achieved 52-62% accuracy using vision-language models and 62% accuracy using text-based LLMs, reducing forecast generation time from hours to under 5 minutes. The project demonstrated scalable architectural patterns for data-to-text conversion tasks involving massive datasets (45GB+ per forecast run) and established frameworks for rapid experimentation with foundation models in production weather services.

Autonomous Codebase Migration at Scale Using LLM-Powered Agents

Spotify

Spotify faced the challenge of maintaining a massive, diverse codebase across thousands of repositories, with developers spending less than one hour per day actually writing code and the rest on maintenance tasks. While they had pre-existing automation through their "fleet management" system that could handle simple migrations like dependency bumps, this approach struggled with the complex "long tail" of edge cases affecting 30% of their codebase. The solution involved building an agentic LLM system that replaces deterministic scripts with AI-powered code generation combined with automated verification loops, enabling unsupervised migrations from prompt to pull request. In the first three months, the system generated over 1,000 merged production PRs, enabling previously impossible large-scale refactors and allowing non-experts to perform complex migrations through natural language prompts rather than writing complicated transformation scripts.

Autonomous Coding Agent Evolution: From Short-Burst to Extended Runtime Operations

Replit

Replit evolved their AI coding agent from V1 (running autonomously for only a couple of minutes) to V2 (running for 10-15 minutes of productive work) through significant rearchitecting and leveraging new frontier models. The company focuses on enabling non-technical users to build complete applications without writing code, emphasizing performance and cost optimization over latency while maintaining comprehensive observability through tools like Langsmith to manage the complexity of production AI agents at scale.

Autonomous Network Operations Using Agentic AI

British Telecom

British Telecom (BT) partnered with AWS to deploy agentic AI systems for autonomous network operations across their 5G standalone mobile network infrastructure serving 30 million subscribers. The initiative addresses major operational challenges including high manual operations costs (up to 20% of revenue), complex failure diagnosis in containerized networks with 20,000 macro sites generating petabytes of data, and difficulties in change impact analysis with 11,000 weekly network changes. The solution leverages AWS Bedrock Agent Core, Amazon SageMaker for multivariate anomaly detection, Amazon Neptune for network topology graphs, and domain-specific community agents for root cause analysis and service impact assessment. Early results focus on cost reduction through automation, improved service level agreements, faster customer impact identification, and enhanced change efficiency, with plans to expand coverage optimization, dynamic network slicing, and further closed-loop automation across all network domains.

Autonomous Observability with AI Agents and Model Context Protocol

Pinterest

Pinterest's observability team faced a fragmented infrastructure challenge where logs, metrics, traces, and change events existed in disconnected silos, predating modern standards like OpenTelemetry. Engineers had to navigate multiple interfaces during incident resolution, increasing mean time to resolution (MTTR) and creating steep learning curves. To address this without a complete infrastructure overhaul, Pinterest developed an MCP (Model Context Protocol) server that acts as a unified interface for AI agents to access all observability data pillars. The centerpiece is "Tricorder Agent," which autonomously gathers relevant information from alerts, generates filtered dashboard links, queries dependencies, and provides root cause hypotheses. Early results show the agent successfully navigating dependency graphs and correlating data across previously disconnected systems, streamlining incident response and reducing the time engineers spend context-switching between tools.

Autonomous Software Development Agent for Production Code Generation

Devin

Cognition AI developed Devin, an autonomous software engineering agent that can handle complex software development tasks by combining natural language understanding with practical coding abilities. The system demonstrated its capabilities by building interactive web applications from scratch and contributing to its own codebase, effectively working as a team member that can handle parallel tasks and integrate with existing development workflows through GitHub, Slack, and other tools.

Autonomous Software Development Using Multi-Model LLM System with Advanced Planning and Tool Integration

Factory.ai

Factory.ai has developed Code Droid, an autonomous software development system that leverages multiple LLMs and sophisticated planning capabilities to automate various programming tasks. The system incorporates advanced features like HyperCode for codebase understanding, ByteRank for information retrieval, and multi-model sampling for solution generation. In benchmark testing, Code Droid achieved 19.27% on SWE-bench Full and 31.67% on SWE-bench Lite, demonstrating strong performance in real-world software engineering tasks while maintaining focus on safety and explainability.

Autonomous SRE Agent for Cloud Infrastructure Monitoring Using FastMCP

FuzzyLabs

FuzzyLabs developed an autonomous Site Reliability Engineering (SRE) agent using Anthropic's Model Context Protocol (MCP) with FastMCP to automate the diagnosis of production incidents in cloud-native applications. The agent integrates with Kubernetes, GitHub, and Slack to automatically detect issues, analyze logs, identify root causes in source code, and post diagnostic summaries to development teams. While the proof-of-concept successfully demonstrated end-to-end incident response automation using a custom MCP client with optimizations like tool caching and filtering, the project raises important questions about effectiveness measurement, security boundaries, and cost optimization that require further research.

Avoiding Unearned Complexity in Production LLM Systems

Microsoft

Microsoft's ISE team shares their experiences working with large customers implementing LLM solutions in production, highlighting how premature adoption of complex frameworks like LangChain and multi-agent architectures can lead to maintenance and reliability challenges. They advocate for starting with simpler, more explicit designs before adding complexity, and provide detailed analysis of the security, dependency, and versioning considerations when adopting pre-v1.0 frameworks in production systems.

Background Coding Agents for Large-Scale Software Maintenance and Migrations

Spotify

Spotify faced challenges in scaling complex code transformations across thousands of repositories despite having a successful Fleet Management system that automated simple, repetitive maintenance tasks. The company integrated AI coding agents into their existing Fleet Management infrastructure, allowing engineers to define fleet-wide code changes using natural language prompts instead of writing complex transformation scripts. Since February 2025, this approach has generated over 1,500 merged pull requests handling complex tasks like language modernization, breaking-change upgrades, and UI component migrations, achieving 60-90% time savings compared to manual approaches while expanding the system's use to ad-hoc development tasks through IDE and chat integrations.

Background Coding Agents with Strong Feedback Loops for Large-Scale Code Transformations

Spotify

Spotify deployed background coding agents across thousands of software components to automate large-scale code transformations and maintenance tasks, addressing the challenge of ensuring correctness and reliability when agents operate without direct human supervision. The solution centered on implementing strong verification loops consisting of deterministic verifiers (for syntax, building, and testing) and an LLM-as-a-judge component to prevent scope creep. The system successfully generated over 1,500 merged pull requests, with the judge component catching roughly a quarter of problematic changes and enabling course correction in half of those cases, demonstrating that verification loops are essential for predictable agent behavior at scale.

Benchmarking AI Agents for Software Bug Detection and Maintenance Tasks

Bismuth

Bismuth, a startup focused on software agents, developed SM-100, a comprehensive benchmark to evaluate AI agents' capabilities in software maintenance tasks, particularly bug detection and fixing. The benchmark revealed significant limitations in existing popular agents, with most achieving only 7% accuracy in finding complex bugs and exhibiting high false positive rates (90%+). While agents perform well on feature development benchmarks like SWE-bench, they struggle with real-world maintenance tasks that require deep system understanding, cross-file reasoning, and holistic code evaluation. Bismuth's own agent achieved better performance (10 out of 100 bugs found vs. 7 for the next best), demonstrating that targeted improvements in model architecture, prompting strategies, and navigation techniques can enhance bug detection capabilities in production software maintenance scenarios.

Best Practices for AI Agent Development and Deployment

Microsoft

A discussion with Raj Ricky, Principal Product Manager at Microsoft, about the development and deployment of AI agents in production. He shares insights on how to effectively evaluate agent frameworks, develop MVPs, and implement testing strategies. The conversation covers the importance of starting with constrained environments, keeping humans in the loop during initial development, and gradually scaling up agent capabilities while maintaining clear success criteria.

Best Practices for Building Production-Grade MCP Servers for AI Agents

Prefect

This case study presents best practices for designing and implementing Model Context Protocol (MCP) servers for AI agents in production environments, addressing the widespread problem of poorly designed MCP servers that fail to account for agent-specific constraints. The speaker, founder and CEO of Prefect Technologies and creator of fastmcp (a widely-adopted framework downloaded 1.5 million times daily), identifies key design principles including outcome-oriented tool design, flattened arguments, comprehensive documentation, token budget management, and ruthless curation. The solution involves treating MCP servers as agent-optimized user interfaces rather than simple REST API wrappers, acknowledging fundamental differences between human and agent capabilities in discovery, iteration, and context management. Results include actionable guidelines that have shaped the MCP ecosystem, with the fastmcp framework becoming the de facto standard for building MCP servers and influencing the official Anthropic SDK design.

Best Practices for Implementing LLMs in High-Stakes Applications

Moonhub

The presentation discusses implementing LLMs in high-stakes use cases, particularly in healthcare and therapy contexts. It addresses key challenges including robustness, controllability, bias, and fairness, while providing practical solutions such as human-in-the-loop processes, task decomposition, prompt engineering, and comprehensive evaluation strategies. The speaker emphasizes the importance of careful consideration when implementing LLMs in sensitive applications and provides a framework for assessment and implementation.

Best Practices for LLM Production Deployments: Evaluation, Prompt Management, and Fine-tuning

HumanLoop

HumanLoop, based on their experience working with companies from startups to large enterprises like Jingo, shares key lessons for successful LLM deployment in production. The talk emphasizes three critical aspects: systematic evaluation frameworks for LLM applications, treating prompts as serious code artifacts requiring proper versioning and collaboration, and leveraging fine-tuning for improved performance and cost efficiency. The presentation uses GitHub Copilot as a case study of successful LLM deployment at scale.

Bridging Behavioral Silos in Multi-Vertical Recommendations with LLMs

Doordash

DoorDash addressed the challenge of behavioral silos in their multi-vertical marketplace, where customers have deep interaction history in some categories (like restaurants) but sparse data in others (like grocery or retail). They built an LLM-powered framework using hierarchical RAG to translate restaurant orders and search queries into cross-vertical affinity features aligned with their product taxonomy. These semantic features were integrated into their production multi-task ranking models. The approach delivered consistent improvements both offline and online: approximately 4.4% improvement in AUC-ROC and 4.8% in MRR offline, with similar gains in production (+4.3% AUC-ROC, +3.2% MRR). The solution proved particularly effective for cold-start scenarios while maintaining practical inference costs through prompt optimization, caching strategies, and use of smaller language models like GPT-4o-mini.

Build vs. Buy AI Agents: Enterprise Deployment Lessons from 1,000+ Companies

Dust

Dust, an AI agent platform company, shares insights from deploying AI agents across over 1,000 enterprise customers to address the common build-versus-buy dilemma. The case study explores the hidden costs of building custom AI infrastructure—including longer time-to-value (6-12 months underestimation), ongoing maintenance burden, and opportunity costs that divert engineering resources from core business objectives. Multiple customer examples demonstrate that buying a platform enabled rapid deployment (20 minutes to functional agents at November Five, 70% adoption in two months at Wakam, 95% adoption in 90 days at Ardabelle) with enterprise-grade security, continuous improvements, and significant productivity gains. The study advocates that most companies should buy AI infrastructure and focus engineering talent on competitive differentiation, though building may make sense for truly unique requirements or when AI infrastructure is the core product itself.

Building a Bot Factory: Standardizing AI Agent Development with Multi-Agent Architecture

AutoScout24

AutoScout24, Europe's leading automotive marketplace, addressed the challenge of fragmented AI experimentation across their organization by building a "Bot Factory" - a standardized framework for creating and deploying AI agents. The initial use case targeted internal developer support, where platform engineers were spending 30% of their time on repetitive tasks like answering questions and granting access. By partnering with AWS, they developed a serverless, event-driven architecture using Amazon Bedrock AgentCore, Knowledge Bases, and the Strands Agents SDK to create a multi-agent system that handles both knowledge retrieval (RAG) and action execution. The solution produced a production-ready Slack support bot and a reusable blueprint that enables teams across the organization to rapidly build secure, scalable AI agents without reinventing infrastructure.

Building a Collaborative Multi-Agent AI Ecosystem for Enterprise Knowledge Access

DoorDash

DoorDash developed an internal agentic AI platform to address the challenge of fragmented knowledge spread across experimentation platforms, metrics hubs, dashboards, wikis, and team communications. The solution evolved from deterministic workflows through single agents to hierarchical deep agents and exploratory agent swarms, built on foundational capabilities including hybrid vector search with RRF-based re-ranking, schema-aware SQL generation with pre-cached examples, multi-stage zero-data query validation, and LLM-as-judge evaluation frameworks. The platform integrates with Slack and Cursor to meet users in their existing workflows, enabling business teams and developers to access complex data and insights without context-switching, democratizing data access across the organization while maintaining rigorous guardrails and provenance tracking.

Building a Commonsense Knowledge Graph for E-commerce Product Recommendations

Amazon

Amazon developed COSMO, a framework that leverages LLMs to build a commonsense knowledge graph for improving product recommendations in e-commerce. The system uses LLMs to generate hypotheses about commonsense relationships from customer interaction data, validates these through human annotation and ML filtering, and uses the resulting knowledge graph to enhance product recommendation models. Tests showed up to 60% improvement in recommendation performance when using the COSMO knowledge graph compared to baseline models.

Building a Complex AI Answer Engine with Multi-Step Reasoning

Perplexity

Perplexity developed Pro Search, an advanced AI answer engine that handles complex, multi-step queries by breaking them down into manageable steps. The system combines careful prompt engineering, step-by-step planning and execution, and an interactive UI to deliver precise answers. The solution resulted in a 50% increase in query search volume, demonstrating its effectiveness in handling complex research questions efficiently.

Building a Comprehensive AI Platform with SageMaker and Bedrock for Experience Management

Qualtrics

Qualtrics built Socrates, an enterprise-level ML platform, to power their experience management solutions. The platform leverages Amazon SageMaker and Bedrock to enable the full ML lifecycle, from data exploration to model deployment and monitoring. It includes features like the Science Workbench, AI Playground, unified GenAI Gateway, and managed inference APIs, allowing teams to efficiently develop, deploy, and manage AI solutions while achieving significant cost savings and performance improvements through optimized inference capabilities.

Building a Comprehensive LLM Evaluation Framework with BrainTrust Integration

Hostinger

Hostinger's AI team developed a systematic approach to LLM evaluation for their chatbots, implementing a framework that combines offline development testing against golden examples with continuous production monitoring. The solution integrates BrainTrust as a third-party tool to automate evaluation workflows, incorporating both automated metrics and human feedback. This framework enables teams to measure improvements, track performance, and identify areas for enhancement through a combination of programmatic testing and user feedback analysis.

Building a Comprehensive LLM Platform for Food Delivery Services

Swiggy

Swiggy implemented various generative AI solutions to enhance their food delivery platform, focusing on catalog enrichment, review summarization, and vendor support. They developed a platformized approach with a middle layer for GenAI capabilities, addressing challenges like hallucination and latency through careful model selection, fine-tuning, and RAG implementations. The initiative showed promising results in improving customer experience and operational efficiency across multiple use cases including image generation, text descriptions, and restaurant partner support.

Building a Comprehensive LLM Platform for Healthcare Applications

IncludedHealth

IncludedHealth built Wordsmith, a comprehensive platform for GenAI applications in healthcare, starting in early 2023. The platform includes a proxy service for multi-provider LLM access, model serving capabilities, training and evaluation libraries, and prompt engineering tools. This enabled multiple production applications including automated documentation, coverage checking, and clinical documentation, while maintaining security and compliance in a regulated healthcare environment.

Building a Context-Aware AI Assistant with RAG for Developer Support

Vectorize

Vectorize, a platform for building RAG pipelines, faced a challenge where users frequently asked questions already answered in their documentation but were reluctant to leave the UI to search for answers. To address this, they built an AI assistant integrated directly into their product interface using RAG technology. The solution leverages their own platform to ingest documentation from multiple sources (docs site, Discord, Intercom), implements context-sensitive retrieval using page topics, employs reranking models to filter irrelevant results, and uses anti-hallucination prompting with Llama 3.1 70B on Groq. The resulting assistant provides users with immediate, contextually relevant answers without requiring them to leave their workflow, while the system continuously improves as new support content and documentation are added.

Building a Conversational AI Agent for Slack Integration

Linear

Linear, a project management tool for product teams, developed an experimental AI agent that operates within Slack to allow users to create issues and query workspace data without leaving their communication platform. The project faced challenges around balancing context provision to the LLM, maintaining conversation continuity, and determining appropriate boundaries between LLM-driven decisions and programmatic logic. The team solved these issues by providing localized context (10 messages) rather than full conversation history, splitting the system early to distinguish between issue creation and data lookup requests, and limiting LLM involvement to tasks it excels at (summarization, title generation) while handling complex business logic programmatically. This approach resulted in higher accuracy for issue creation, faster response times, and improved user satisfaction as the agent could quickly generate well-formed issues that users could then refine manually.

Building a Conversational Shopping Assistant with Multi-Modal Search and Agent Architecture

OLX

OLX developed "OLX Magic", a conversational AI shopping assistant for their secondhand marketplace. The system combines traditional search with LLM-powered agents to handle natural language queries, multi-modal searches (text, image, voice), and comparative product analysis. The solution addresses challenges in e-commerce personalization and search refinement, while balancing user experience with technical constraints like latency and cost. Key innovations include hybrid search combining keyword and semantic matching, visual search with modifier capabilities, and an agent architecture that can handle both broad and specific queries.

Building a Custom LLM for Automated Documentation Generation

Databricks

Databricks developed an AI-generated documentation feature for automatically documenting tables and columns in Unity Catalog. After initially using SaaS LLMs that faced challenges with quality, performance, and cost, they built a custom fine-tuned 7B parameter model in just one month with two engineers and less than $1,000 in compute costs. The bespoke model achieved better quality than cheaper SaaS alternatives, 10x cost reduction, and higher throughput, now powering 80% of table metadata updates on their platform.

Building a Customer Support AI Assistant: From PoC to Production

Elastic

Elastic's Field Engineering team developed a generative AI solution to improve customer support operations by automating case summaries and drafting initial replies. Starting with a proof of concept using Google Cloud's Vertex AI, they achieved a 15.67% positive response rate, leading them to identify the need for better input refinement and knowledge integration. This resulted in a decision to develop a unified chat interface with RAG architecture leveraging Elasticsearch for improved accuracy and response relevance.

Building a Data-Centric Multi-Agent Platform for Enterprise AI

Alibaba

Alibaba shares their approach to building and deploying AI agents in production, focusing on creating a data-centric intelligent platform that combines LLMs with enterprise data. Their solution uses Spring-AI-Alibaba framework along with tools like Higress (API gateway), Otel (observability), Nacos (prompt management), and RocketMQ (data synchronization) to create a comprehensive system that handles customer queries and anomalies, achieving over 95% resolution rate for consulting issues and 85% for anomalies.

Building a Digital Workforce with Multi-Agent Systems and User-Centric Design

Monday.com

Monday.com built a digital workforce of AI agents to handle their billion annual work tasks, focusing on user experience and trust over pure automation. They developed a multi-agent system using LangGraph that emphasizes user control, preview capabilities, and explainability, achieving 100% month-over-month growth in AI usage. The system includes specialized agents for data retrieval, board actions, and answer composition, with robust fallback mechanisms and evaluation frameworks to handle the 99% of user interactions they can't initially predict.

Building a Digital Workforce with Multi-Agent Systems for Task Automation

Monday.com

Monday.com, a work OS platform processing 1 billion tasks annually, developed a digital workforce using AI agents to automate various work tasks. The company built their agent ecosystem on LangGraph and LangSmith, focusing heavily on user experience design principles including user control over autonomy, preview capabilities, and explainability. Their approach emphasizes trust as the primary adoption barrier rather than technology, implementing guardrails and human-in-the-loop systems to ensure production readiness. The system has shown significant growth with 100% month-over-month increases in AI usage since launch.

Building a Financial Data RAG System: Lessons from Search-First Architecture

Unspecified client

A case study of implementing a RAG-based chatbot for financial executives and analysts to access company data across SEC filings, earnings calls, and analyst reports. The team initially faced challenges with context preservation, search accuracy, and response quality using standard RAG approaches. They ultimately succeeded by reimagining the search architecture to focus on GPT-4 generated summaries as the primary search target, along with custom scoring profiles and sophisticated prompt engineering techniques.

Building a Food Delivery Product Knowledge Graph with LLMs

Doordash

DoorDash leveraged LLMs to transform their retail catalog management by implementing three key systems: an automated brand extraction pipeline that identifies and deduplicates new brands at scale; an organic product labeling system combining string matching with LLM reasoning to improve personalization; and a generalized attribute extraction process using LLMs with RAG to accelerate annotation for entity resolution across merchants. These innovations significantly improved product discoverability and personalization while reducing the manual effort that previously caused long turnaround times and high costs.

Building a Foundation Model Operations Platform

Humanloop

Humanloop pivoted from automated labeling to building a comprehensive LLMOps platform that helps engineers measure and optimize LLM applications through prompt engineering, management, and evaluation. The platform addresses the challenges of managing prompts as code artifacts, collecting user feedback, and running evaluations in production environments. Their solution has been adopted by major companies like Duolingo and Gusto for managing their LLM applications at scale.

Building a Global Product Catalogue with Multimodal LLMs at Scale

Shopify

Shopify addressed the challenge of fragmented product data across millions of merchants by building a Global Catalogue using multimodal LLMs to standardize and enrich billions of product listings. The system processes over 10 million product updates daily through a four-layer architecture involving product data foundation, understanding, matching, and reconciliation. By fine-tuning open-source vision language models and implementing selective field extraction, they achieve 40 million LLM inferences daily with 500ms median latency while reducing GPU usage by 40%. The solution enables improved search, recommendations, and conversational commerce experiences across Shopify's ecosystem.

Building a Gradual, Trust-Focused GenBI Agent for Enterprise Data Democratization

Northwestern Mutual

Northwestern Mutual, a 160-year-old financial services and life insurance company, developed a GenBI (Generative AI for Business Intelligence) agent to democratize data access and reduce dependency on BI teams. Faced with the challenge of balancing innovation with risk-aversion in a highly regulated industry, they adopted an incremental, phased approach that used real messy data, focused on building trust through a crawl-walk-run user rollout strategy, and delivered tangible business value at each stage. The system uses multiple specialized agents (metadata, RAG, SQL, and BI agents) to answer business questions, initially by retrieving certified reports rather than generating SQL from scratch. This approach allowed them to automate approximately 80% of the 20% of BI team capacity spent on finding and sharing reports, while proving the value of metadata enrichment through measurable improvements in LLM performance. The incremental delivery model enabled continuous leadership buy-in and risk management, with each six-week sprint producing productizable deliverables that could be evaluated independently.

Building a Guardrail System for LLM-based Menu Transcription

Doordash

Doordash developed a system to automatically transcribe restaurant menu photos using LLMs, addressing the challenge of maintaining accurate menu information on their delivery platform. Instead of relying solely on LLMs, they created an innovative guardrail framework using traditional machine learning to evaluate transcription quality and determine whether AI or human processing should be used. This hybrid approach allowed them to achieve high accuracy while maintaining efficiency and adaptability to new AI models.

Building a Healthcare Copilot for Biology and Life Science Research

Owkin

Owkin, a company focused on drug discovery and AI for healthcare, developed a copilot system in four months to help biology and life science researchers navigate complex healthcare data and answer scientific questions. The system addresses challenges unique to healthcare including strict regulations, semantic complexity, and data sensitivity by implementing two main tools: a text-to-SQL system that queries structured biological databases (using natural language to SQL translation with Polars), and a RAG-based literature search tool that retrieves relevant information from PubMed's 26 million abstracts. The copilot was deployed for academic researchers with monitoring via LangFuse and OpenTelemetry, though the team faced challenges with evaluation in a domain where questions rarely have binary answers, and noted that frameworks and models change rapidly in the LLM space.

Building a High-Quality Q&A Assistant for Database Research

Airtable

Airtable developed Omni, an AI assistant capable of building custom apps and extracting insights from complex databases containing customer feedback, marketing data, and product information. The challenge was creating a reliable Q&A agent that could overcome LLM limitations like unpredictable reasoning, premature conclusions, and hallucinations when dealing with large table schemas and vague questions. Their solution employed an agentic framework with contextual schema exploration, planning/replanning mechanisms, hybrid search combining keyword and semantic approaches, token-efficient citation systems, and comprehensive evaluation frameworks using both curated test suites and production feedback. This multi-faceted approach enabled them to deliver a production-ready assistant that users could trust, though the post doesn't provide specific quantitative results on accuracy improvements or user adoption metrics.

Building a High-Quality RAG-based Support System with LLM Guardrails and Quality Monitoring

Doordash

Doordash implemented a RAG-based chatbot system to improve their Dasher support automation, replacing a traditional flow-based system. They developed a comprehensive quality control approach combining LLM Guardrail for real-time response verification, LLM Judge for quality monitoring, and an iterative improvement pipeline. The system successfully reduced hallucinations by 90% and severe compliance issues by 99%, while handling thousands of support requests daily and allowing human agents to focus on more complex cases.

Building a Horizontal Enterprise Agent Platform with Infrastructure-First Approach

Dust.tt

Dust.tt evolved from a developer framework competitor to LangChain into a horizontal enterprise platform for deploying AI agents, achieving remarkable 88% daily active user rates in some deployments. The company focuses on building robust infrastructure for agent deployment, maintaining its own integrations with enterprise systems like Notion and Slack, while making agent creation accessible to non-technical users through careful UX design and abstraction of technical complexities.

Building a Hyper-Personalized Food Ordering Agent for E-commerce at Scale

iFood

iFood, Brazil's largest food delivery platform with 160 million monthly orders and 55 million users, built ISO, an AI agent designed to address the paradox of choice users face when ordering food. The agent uses hyper-personalization based on user behavior, interprets complex natural language intents, and autonomously takes actions like applying coupons, managing carts, and processing payments. Deployed on both the iFood app and WhatsApp, ISO handles millions of users while maintaining sub-10 second P95 latency through aggressive prompt optimization, context window management, and intelligent tool routing. The team achieved this by moving from a 30-second to a 10-second P95 latency through techniques including asynchronous processing, English-only prompts to avoid tokenization penalties, and deflating bloated system prompts by improving tool naming conventions.

Building a Knowledge as a Service Platform with LLMs and Developer Community Data

Stack Overflow

Stack Overflow addresses the challenges of LLM brain drain, answer quality, and trust by transforming their extensive developer Q&A platform into a Knowledge as a Service offering. They've developed API partnerships with major AI companies like Google, OpenAI, and GitHub, integrating their 40 billion tokens of curated technical content to improve LLM accuracy by up to 20%. Their approach combines AI capabilities with human expertise while maintaining social responsibility and proper attribution.

Building a Knowledge Base Chatbot for Data Team Support Using RAG

HP

HP's data engineering teams were spending 20-30% of their time handling support requests and SQL queries, creating a significant productivity bottleneck. Using Databricks Mosaic AI, they implemented a RAG-based knowledge base chatbot that could answer user queries about data models, platform features, and access requests in real-time. The solution, which included a web crawler for knowledge ingestion and vector search capabilities, was built in just three weeks and led to substantial productivity gains while reducing operational costs by 20-30% compared to their previous data warehouse solution.

Building a Large-Scale AI Recruiting Assistant with Experiential Memory

LinkedIn

LinkedIn developed their first AI agent, Hiring Assistant, to automate and enhance recruiting workflows at scale. The system combines large language models with novel features like experiential memory for personalization and an agent orchestration layer for complex task management. The assistant helps recruiters with tasks from job description creation to candidate sourcing and interview coordination, while maintaining human oversight and responsible AI principles.

Building a Memory System for No-Code Agent Development

Langchain

LangChain developed a memory system for their LangSmith Agent Builder, a no-code platform for creating task-specific agents. The problem was that agents performing repetitive specialized tasks needed to retain learnings across sessions to avoid poor user experience. Their solution represented memory as files in a virtual filesystem (stored in Postgres but exposed as files), allowing agents to read and modify their own memory using familiar filesystem operations. The memory system covers procedural memory (AGENTS.md, tools.json), semantic memory (agent skills, knowledge files), and enables agents to self-improve through natural language feedback, eliminating the need for manual configuration updates and creating a more iterative agent building experience.

Building a Microservices-Based Multi-Agent Platform for Financial Advisors

Prudential

Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.

Building a Multi-Agent Healthcare Analytics Assistant with LLM-Powered Natural Language Queries

Komodo Health

Komodo Health, a company with a large database of anonymized American patient medical events, developed an AI assistant over two years to answer complex healthcare analytics queries through natural language. The system evolved from a simple chaining architecture with fine-tuned models to a sophisticated multi-agent system using a supervisor pattern, where an intelligent agent-based supervisor routes queries to either deterministic workflows or sub-agents as needed. The architecture prioritizes trust by ensuring raw database outputs are presented directly to users rather than LLM-generated content, with LLMs primarily handling natural language to structured query conversion and explanations. The production system balances autonomous AI capabilities with control, avoiding the cost and latency issues of pure agentic approaches while maintaining flexibility for unexpected user queries.

Building a Multi-Agent LLM Platform for Customer Service Automation

Deutsche Telekom

Deutsche Telekom developed a comprehensive multi-agent LLM platform to automate customer service across multiple European countries and channels. They built their own agent computing platform called LMOS to manage agent lifecycles, routing, and deployment, moving away from traditional chatbot approaches. The platform successfully handled over 1 million customer queries with an 89% acceptable answer rate and showed 38% better performance compared to vendor solutions in A/B testing.

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

Building a Multi-Model AI Platform and Agent Marketplace

Quora

Quora built Poe as a unified platform providing consumer access to multiple large language models and AI agents through a single interface and subscription. Starting with experiments using GPT-3 for answer generation on Quora, the company recognized the paradigm shift toward chat-based AI interactions and developed Poe to serve as a "web browser for AI" - enabling users to access diverse models, create custom agents through prompting or server integrations, and monetize AI applications. The platform has achieved significant scale with creators earning millions annually while supporting various modalities including text, image, and voice models.

Building a Multi-Model LLM API Marketplace and Infrastructure Platform

OpenRouter

OpenRouter was founded in early 2023 to address the fragmented landscape of large language models by creating a unified API marketplace that aggregates over 400 models from 60+ providers. The company identified that the LLM inference market would not be winner-take-all, and built infrastructure to normalize different model APIs, provide intelligent routing, caching, and uptime guarantees. Their platform enables developers to switch between models with near-zero switching costs while providing better prices, uptime, and choice compared to using individual model providers directly.

Building a Multi-Provider GenAI Gateway for Enterprise-Scale LLM Access

Grab

Grab developed an AI Gateway to provide centralized, secure access to multiple GenAI providers (including OpenAI, Azure, AWS Bedrock, and Google VertexAI) for their internal developers. The gateway handles authentication, cost management, auditing, and rate limiting while providing a unified API interface. Since its launch in 2023, it has enabled over 300 unique use cases across the organization, from real-time audio analysis to content moderation, while maintaining security and cost efficiency through centralized management.

Building a Natural Language Agent Builder with Comprehensive LLMOps Practices

Vellum

Vellum, a company that has spent three years building tools for production-grade agent development, launched a beta natural language agent builder that allows users to create agents through conversation rather than drag-and-drop interfaces or code. The speaker shares lessons learned from building this meta-level agent, focusing on tool design, testing strategies, execution monitoring, and user experience considerations. Key insights include the importance of carefully designing tool abstractions from first principles, balancing vibes-based testing with rigorous test suites, storing and analyzing all execution data to iterate on agent performance, and creating enhanced UI/UX by parsing agent outputs into interactive elements beyond simple text responses.

Building a Natural Language Business Intelligence Interface with MCP

Ramp

Ramp built an MCP (Model Context Protocol) server to enable natural language querying of business spend data through their developer API. The initial prototype allowed Claude to generate visualizations and run analyses, but struggled with scale due to context window limitations and high token usage. By pivoting to a SQL-based approach using an in-memory SQLite database with a lightweight ETL pipeline, they enabled Claude to query tens of thousands of transactions efficiently. The solution includes load tools for API data extraction, data transformation capabilities, and query execution tools, allowing users to gain insights into business spend patterns through conversational queries while addressing security concerns through audit logging and OAuth scopes.

Building a Next-Generation AI-Enhanced Code Editor with Real-Time Inference

Cursor

Cursor built a modern AI-enhanced code editor by forking VS Code and incorporating advanced LLM capabilities. Their approach focused on creating a more responsive and predictive coding environment that goes beyond simple autocompletion, using techniques like mixture of experts (MoE) models, speculative decoding, and sophisticated caching strategies. The editor aims to eliminate low-entropy coding actions and predict developers' next actions, while maintaining high performance and low latency.

Building a Next-Generation AI-Powered Code Editor

Cursor

Cursor, founded by MIT graduates, developed an AI-powered code editor that goes beyond simple code completion to reimagine how developers interact with AI while coding. By focusing on innovative features like instructed edits and codebase indexing, along with developing custom models for specific tasks, they achieved rapid growth to $100M in revenue. Their success demonstrates how combining frontier LLMs with custom-trained models and careful UX design can transform developer productivity.

Building a Privacy-Preserving LLM Usage Analytics System (Clio)

Anthropic

Anthropic developed Clio, a privacy-preserving system to understand how their LLM Claude is being used in the real world while maintaining strict user privacy. The system uses Claude itself to analyze and cluster conversations, extracting high-level insights without humans ever reading the raw data. This allowed Anthropic to improve their safety evaluations, understand usage patterns across languages and domains, and detect potential misuse - all while maintaining strong privacy guarantees through techniques like minimum cluster sizes and privacy auditing.

Building a Production AI Agent System for Customer Support

Decagon

Decagon has developed a comprehensive AI agent system for customer support that handles multiple communication channels including chat, email, and voice. Their system includes a core AI agent brain, intelligent routing, agent assistance capabilities, and robust testing and monitoring infrastructure. The solution aims to improve traditionally painful customer support experiences by providing consistent, quick responses while maintaining brand voice and safely handling sensitive operations like refunds.

Building a Production Fantasy Football AI Assistant in 8 Weeks

NFL

The NFL, in collaboration with AWS Generative AI Innovation Center, developed a fantasy football AI assistant for NFL Plus users that went from concept to production in just 8 weeks. Fantasy football managers face overwhelming amounts of data and conflicting expert advice, making roster decisions stressful and time-consuming. The team built an agentic AI system using Amazon Bedrock, Strands Agent framework, and Model Context Protocol (MCP) to provide analyst-grade fantasy advice in under 5 seconds, achieving 90% analyst approval ratings. The system handles complex multi-step reasoning, accesses NFL NextGen Stats data through semantic data layers, and successfully manages peak Sunday traffic loads with zero reported incidents in the first month of 10,000+ questions.

Building a Production RAG-based Customer Support Assistant with Elasticsearch

Elastic

Elastic's Field Engineering team developed a customer support chatbot using RAG instead of fine-tuning, leveraging Elasticsearch for document storage and retrieval. They created a knowledge library of over 300,000 documents from technical support articles, product documentation, and blogs, enriched with AI-generated summaries and embeddings using ELSER. The system uses hybrid search combining semantic and BM25 approaches to provide relevant context to the LLM, resulting in more accurate and trustworthy responses.

Building a Production Text-to-SQL Assistant with Multi-Agent Architecture

LinkedIn

LinkedIn developed SQL Bot, an AI-powered assistant integrated within their DARWIN data science platform, to help employees access data insights independently. The system uses a multi-agent architecture built on LangChain and LangGraph, combining retrieval-augmented generation with knowledge graphs and LLM-based ranking and correction systems. The solution has been deployed successfully with hundreds of users across LinkedIn's business verticals, achieving a 95% query accuracy satisfaction rate and demonstrating particular success with its query debugging feature.

Building a Production Voice AI Agent for Customer Support in 100 Days

Intercom

Intercom developed Finn Voice, a voice AI agent for phone-based customer support, in approximately 100 days. The solution builds on their existing text-based AI agent Finn, which already served over 5,000 customers with a 56% average resolution rate. Finn Voice handles phone calls, answers customer questions using knowledge base content, and escalates to human agents when needed. The system uses a speech-to-text, language model, text-to-speech architecture with RAG capabilities and achieved deployment across several enterprise customers' main phone lines, offering significant cost savings compared to human-only support.

Building a Production-Grade Email AI Assistant Using RAG and Multi-Stage Retrieval

Shortwave

Shortwave built an AI email assistant that helps users interact with their email history as a knowledge base. They implemented a sophisticated Retrieval Augmented Generation (RAG) system with a four-step process: tool selection, data retrieval, question answering, and post-processing. The system combines multiple AI technologies including LLMs, embeddings, vector search, and cross-encoder models to provide context-aware responses within 3-5 seconds, while handling complex infrastructure challenges around prompt engineering, context windows, and data retrieval.

Building a Production-Grade GenAI Customer Support Assistant with Comprehensive Observability

Elastic

Elastic developed a customer support chatbot using generative AI and RAG, focusing heavily on production-grade observability practices. They implemented a comprehensive observability strategy using Elastic's own stack, including APM traces, custom dashboards, alerting systems, and detailed monitoring of LLM interactions. The system successfully launched with features like streaming responses, rate limiting, and abuse prevention, while maintaining high reliability through careful monitoring of latency, errors, and usage patterns.

Building a Production-Grade LLM Orchestration System for Conversational Search

Perplexity

Perplexity has built a conversational search engine that combines LLMs with various tools and knowledge sources. They tackled key challenges in LLM orchestration including latency optimization, hallucination prevention, and reliable tool integration. Through careful engineering and prompt management, they reduced query latency from 6-7 seconds to near-instant responses while maintaining high quality results. The system uses multiple specialized LLMs working together with search indices, tools like Wolfram Alpha, and custom embeddings to deliver personalized, accurate responses at scale.

Building a Production-Ready Business Analytics Assistant with ChatGPT

Microsoft

A detailed case study on automating data analytics using ChatGPT, where the challenge of LLMs' limitations in quantitative reasoning is addressed through a novel multi-agent system. The solution implements two specialized ChatGPT agents - a data engineer and data scientist - working together to analyze structured business data. The system uses ReAct framework for reasoning, SQL for data retrieval, and Streamlit for deployment, demonstrating how to effectively operationalize LLMs for complex business analytics tasks.

Building a Production-Ready MCP Server for AI Agents to Manage Feature Flags

DevCycle

DevCycle developed an MCP (Model Context Protocol) server to enable AI coding agents to manage feature flags directly within development workflows. The project began as a hackathon proof-of-concept that adapted their existing CLI interface to work with AI agents, allowing natural language interactions for creating flags, investigating incidents, and cleaning up stale features. Through iterative refinement, the team identified key production requirements including clear input schemas, descriptive error handling, tool call pruning, OAuth authentication via Cloudflare Workers, and remote server architecture. The result was a production-ready integration that enables developers to create and manage feature flags without leaving their code editor, with early results showing approximately 3x more users reaching SDK installation compared to their previous onboarding flow.

Building a Production-Ready Multi-Agent Coding Assistant

Replit

Replit developed a coding agent system that helps users create software applications without writing code. The system uses a multi-agent architecture with specialized agents (manager, editor, verifier) and focuses on user engagement rather than full autonomy. The agent achieved hundreds of thousands of production runs and maintains around 90% success rate in tool invocations, using techniques like code-based tool calls, memory management, and state replay for debugging.

Building a Property Management AI Copilot with LangGraph and LangSmith

AppFolio

AppFolio developed Realm-X Assistant, an AI-powered copilot for property management, using LangChain ecosystem tools. By transitioning from LangChain to LangGraph for complex workflow management and leveraging LangSmith for monitoring and debugging, they created a system that helps property managers save over 10 hours per week. The implementation included dynamic few-shot prompting, which improved specific feature performance from 40% to 80%, along with robust testing and evaluation processes to ensure reliability.

Building a Property Question-Answering Chatbot to Replace 8-Hour Email Responses with Instant AI-Powered Answers

Agoda

Agoda, an online travel platform, developed the Property AMA (Ask Me Anything) Bot to address the challenge of users waiting an average of 8 hours for property-related question responses, with only 55% of inquiries receiving answers. The solution leverages ChatGPT integrated with Agoda's Property API to provide instant, accurate answers to property-specific questions through a conversational interface deployed across desktop, mobile web, and native app platforms. The implementation includes sophisticated prompt engineering with input topic guardrails, in-context learning that fetches real-time property data, and a comprehensive evaluation framework using response labeling and A/B testing to continuously improve accuracy and reliability.

Building a RAG-Based Documentation Chatbot: Lessons from Fiddler's LLMOps Journey

Fiddler

Fiddler AI developed a documentation chatbot using OpenAI's GPT-3.5 and Retrieval-Augmented Generation (RAG) to help users find answers in their documentation. The project showcases practical implementation of LLMOps principles including continuous evaluation, monitoring of chatbot responses and user prompts, and iterative improvement of the knowledge base. Through this implementation, they identified and documented key lessons in areas like efficient tool selection, query processing, document management, and hallucination reduction.

Building a RAG-Based Premium Audit Assistant for Insurance Workflows

Verisk

Verisk developed PAAS AI, a generative AI-powered conversational assistant to help premium auditors efficiently search and retrieve information from their vast repository of insurance documentation. Using a RAG architecture built on Amazon Bedrock with Claude, along with ElastiCache, OpenSearch, and custom evaluation frameworks, the system reduced document processing time by 96-98% while maintaining high accuracy. The solution demonstrates effective use of hybrid search, careful data chunking, and comprehensive evaluation metrics to ensure reliable AI-powered customer support.

Building a Reliable AI Quote Generation Assistant with LangGraph

Tradestack

Tradestack developed an AI-powered WhatsApp assistant to automate quote generation for trades businesses, reducing quote creation time from 3.5-10 hours to under 15 minutes. Using LangGraph Cloud, they built and launched their MVP in 6 weeks, improving end-to-end performance from 36% to 85% through rapid iteration and multimodal input processing. The system incorporated sophisticated agent architectures, human-in-the-loop interventions, and robust evaluation frameworks to ensure reliability and accuracy.

Building a Rust-Based AI Agentic Framework for Multimodal Data Quality Monitoring

Zectonal

Zectonal, a data quality monitoring company, developed a custom AI agentic framework in Rust to scale their multimodal data inspection capabilities beyond traditional rules-based approaches. The framework enables specialized AI agents to autonomously call diagnostic function tools for detecting defects, errors, and anomalous conditions in large datasets, while providing full audit trails through "Agent Provenance" tracking. The system supports multiple LLM providers (OpenAI, Anthropic, Ollama) and can operate both online and on-premise, packaged as a single binary executable that the company refers to as their "genie-in-a-binary."

Building a Scalable AI Feature Evaluation System

Notion

Notion developed an advanced evaluation system for their AI features, transitioning from a manual process using JSONL files to a sophisticated automated workflow powered by Braintrust. This transformation enabled them to improve their testing and deployment of AI features like Q&A and workspace search, resulting in a 10x increase in issue resolution speed, from 3 to 30 issues per day.

Building a Scalable Conversational Video Agent with LangGraph and Twelve Labs APIs

Jockey

Jockey is an open-source conversational video agent that leverages LangGraph and Twelve Labs' video understanding APIs to process and analyze video content intelligently. The system evolved from v1.0 to v1.1, transitioning from basic LangChain to a more sophisticated LangGraph architecture, enabling better scalability and precise control over video workflows through a multi-agent system consisting of a Supervisor, Planner, and specialized Workers.

Building a Scalable LLM Gateway for E-commerce Recommendations

Mercado Libre

Mercado Libre developed a centralized LLM gateway to handle large-scale generative AI deployments across their organization. The gateway manages multiple LLM providers, handles security, monitoring, and billing, while supporting 50,000+ employees. A key implementation was a product recommendation system that uses LLMs to generate personalized recommendations based on user interactions, supporting multiple languages across Latin America.

Building a Search Engine for AI Agents: Infrastructure, Product Development, and Production Deployment

Exa.ai

Exa.ai has built the first search engine specifically designed for AI agents rather than human users, addressing the fundamental problem that existing search engines like Google are optimized for consumer clicks and keyword-based queries rather than semantic understanding and agent workflows. The company trained its own models, built its own index, and invested heavily in compute infrastructure (including purchasing their own GPU cluster) to enable meaning-based search that returns raw, primary data sources rather than listicles or summaries. Their solution includes both an API for developers building AI applications and an agentic search tool called Websites that can find and enrich complex, multi-criteria queries. The results include serving hundreds of millions of queries across use cases like sales intelligence, recruiting, market research, and research paper discovery, with 95% inbound growth and expanding from 7 to 28+ employees within a year.

Building a Secure AI Assistant for Visual Effects Artists Using Amazon Bedrock

Untold Studios

Untold Studios developed an AI assistant integrated into Slack to help their visual effects artists access internal resources and tools more efficiently. Using Amazon Bedrock with Claude 3.5 Sonnet and a serverless architecture, they created a natural language interface that handles 120 queries per day, reducing information search time from minutes to seconds while maintaining strict data security. The solution combines RAG capabilities with function calling to access multiple knowledge bases and internal systems, significantly reducing the support team's workload.

Building a Secure and Scalable LLM Gateway for Enterprise GenAI Adoption

Wealthsimple

Wealthsimple developed a comprehensive LLM platform to enable secure and productive use of generative AI across their organization. They started with a basic gateway for audit trails, evolved to include PII redaction, self-hosted models, and RAG capabilities, while focusing on user adoption and security. The platform now serves over half the company with 2,200+ daily messages, demonstrating successful enterprise-wide GenAI adoption while maintaining data security.

Building a Secure and Scalable LLM Gateway for Financial Services

Wealthsimple

Wealthsimple, a Canadian FinTech company, developed a comprehensive LLM platform to securely leverage generative AI while protecting sensitive financial data. They built an LLM gateway with built-in security features, PII redaction, and audit trails, eventually expanding to include self-hosted models, RAG capabilities, and multi-modal inputs. The platform achieved widespread adoption with over 50% of employees using it monthly, leading to improved productivity and operational efficiencies in client service workflows.

Building a Secure Enterprise AI Assistant with Amazon Bedrock for Financial Services

PayU

PayU, a Central Bank-regulated financial services company in India, faced the challenge of employees using unsecured public generative AI tools that posed data security and regulatory compliance risks. The company implemented a comprehensive enterprise AI solution using Amazon Bedrock, Open WebUI, and AWS PrivateLink to create a secure, role-based AI assistant that enables employees to perform tasks like technical troubleshooting, email drafting, and business data querying while maintaining strict data residency requirements and regulatory compliance. The solution achieved a reported 30% improvement in business analyst team productivity while ensuring sensitive data never leaves the company's VPC.

Building a Secure Enterprise AI Assistant with RAG and Custom Infrastructure

Hexagon

Hexagon's Asset Lifecycle Intelligence division developed HxGN Alix, an AI-powered digital worker to enhance user interaction with their Enterprise Asset Management products. They implemented a secure solution using AWS services, custom infrastructure, and RAG techniques. The solution successfully balanced security requirements with AI capabilities, deploying models on Amazon EKS with private subnets, implementing robust guardrails, and solving various RAG-related challenges to provide accurate, context-aware responses while maintaining strict data privacy standards.

Building a Structured AI Evaluation Framework for Educational Tools

Coursera

Coursera developed a robust AI evaluation framework to support the deployment of their Coursera Coach chatbot and AI-assisted grading tools. They transitioned from fragmented offline evaluations to a structured four-step approach involving clear evaluation criteria, curated datasets, combined heuristic and model-based scoring, and rapid iteration cycles. This framework resulted in faster development cycles, increased confidence in AI deployments, and measurable improvements in student engagement and course completion rates.

Building a Systematic LLM Evaluation Framework from Scratch

Coda

Coda's journey in developing a robust LLM evaluation framework, evolving from manual playground testing to a comprehensive automated system. The team faced challenges with model upgrades affecting prompt behavior, leading them to create a systematic approach combining automated checks with human oversight. They progressed through multiple phases using different tools (OpenAI Playground, Coda itself, Vellum, and Brain Trust), ultimately achieving scalable evaluation running 500+ automated checks weekly, up from 25 manual evaluations initially.

Building a Systematic SNAP Benefits LLM Evaluation Framework

Propel

Propel is developing a comprehensive evaluation framework for testing how well different LLMs handle SNAP (food stamps) benefit-related queries. The project aims to assess model accuracy, safety, and appropriateness in handling complex policy questions while balancing strict accuracy with practical user needs. They've built a testing infrastructure including a Slackbot called Hydra for comparing multiple LLM outputs, and plan to release their evaluation framework publicly to help improve AI models' performance on SNAP-related tasks.

Building a Universal Search Product with RAG and AI Agents

Dropbox

Dropbox developed Dash, a universal search and knowledge management product that addresses the challenges of fragmented business data across multiple applications and formats. The solution combines retrieval-augmented generation (RAG) and AI agents to provide powerful search capabilities, content summarization, and question-answering features. They implemented a custom Python interpreter for AI agents and developed a sophisticated RAG system that balances latency, quality, and data freshness requirements for enterprise use.

Building a Video Q&A System with RAG and Speaker Detection

Vimeo

Vimeo developed a sophisticated video Q&A system that enables users to interact with video content through natural language queries. The system uses RAG (Retrieval Augmented Generation) to process video transcripts at multiple granularities, combined with an innovative speaker detection system that identifies speakers without facial recognition. The solution generates accurate answers, provides relevant video timestamps, and suggests related questions to maintain user engagement.

Building a Visual Agentic Tool for AI-First Workflow Transformation

Craft

Craft, a five-year-old startup with over 1 million users and a 20-person engineering team, spent three years experimenting with AI features that lacked user stickiness before achieving a breakthrough in late 2025. During the 2025 Christmas holidays, the founder built "Craft Agents," a visual UI wrapper around Claude Code and the Claude Agent SDK, completing it in just two weeks using Electron despite no prior experience with that stack. The tool connected multiple data sources (APIs, databases, MCP servers) and provided a more accessible interface than terminal-based alternatives. After mandating company-wide adoption in January 2026, non-engineering teams—particularly customer support—became the heaviest users, automating workflows that previously took 20-30 minutes down to 2-3 minutes, while engineering teams experienced dramatic productivity gains with difficult migrations completing in a week instead of months.

Building a Voice Assistant from Open Source LLMs: A Home Project Case Study

Weights & Biases

A developer built a custom voice assistant similar to Alexa using open-source LLMs, demonstrating the journey from prototype to production-ready system. The project used Whisper for speech recognition and various LLM models (Llama 2, Mistral) running on consumer hardware, with systematic improvements through prompt engineering and fine-tuning to achieve 98% accuracy in command interpretation, showing how iterative improvement and proper evaluation frameworks are crucial for LLM applications.

Building a Voice Assistant with Open Source LLMs: From Demo to Production

Weights & Biases

A case study of building an open-source Alexa alternative using LLMs, demonstrating the journey from prototype to production. The project used Llama 2 and Mistral models running on affordable hardware, combined with Whisper for speech recognition. Through iterative improvements including prompt engineering and fine-tuning with QLoRA, the system's accuracy improved from 0% to 98%, while maintaining real-time performance requirements.

Building Agentic AI Assistant for Observability Platform

Grafana

Grafana Labs developed an agentic AI assistant integrated into their observability platform to help users query data, create dashboards, troubleshoot issues, and learn the platform. The team started with a hackathon project that ran entirely in the browser, iterating rapidly from a proof-of-concept to a production system. The assistant uses Claude as the primary LLM, implements tool calling with extensive context about Grafana's features, and employs multiple techniques including tool overloading, error feedback loops, and natural language tool responses. The solution enables users to investigate incidents, generate queries across multiple data sources, and modify visualizations through conversational interfaces while maintaining transparency by showing all intermediate steps and data to keep humans in the loop.

Building AI Assist: LLM Integration for E-commerce Product Listings

Mercari

Mercari developed an AI Assist feature to help sellers create better product listings using LLMs. They implemented a two-part system using GPT-4 for offline attribute extraction and GPT-3.5-turbo for real-time title suggestions, conducting both offline and online evaluations to ensure quality. The team focused on practical implementation challenges including prompt engineering, error handling, and addressing LLM output inconsistencies in a production environment.

Building AI Developer Tools Using LangGraph for Large-Scale Software Development

Uber

Uber's developer platform team built a suite of AI-powered developer tools using LangGraph to improve productivity for 5,000 engineers working on hundreds of millions of lines of code. The solution included tools like Validator (for detecting code violations and security issues), AutoCover (for automated test generation), and various other AI assistants. By creating domain-expert agents and reusable primitives, they achieved significant impact including thousands of daily code fixes, 10% improvement in developer platform coverage, and an estimated 21,000 developer hours saved through automated test generation.

Building AI Products at Stack Overflow: From Conversational Search to Technical Benchmarking

Stack Overflow

Stack Overflow faced a significant disruption when ChatGPT launched in late 2022, as developers began changing their workflows and asking AI tools questions that would traditionally be posted on Stack Overflow. In response, the company formed an "Overflow AI" team to explore how AI could enhance their products and create new revenue streams. The team pursued two main initiatives: first, developing a conversational search feature that evolved through multiple iterations from basic keyword search to semantic search with RAG, ultimately being rolled back due to insufficient accuracy (below 70%) for developer expectations; and second, creating a data licensing business that involved fine-tuning models with Stack Overflow's corpus and developing technical benchmarks to demonstrate improved model performance. The initiatives showcased rapid iteration, customer-focused evaluation methods, and ultimately led to a new revenue stream while strengthening Stack Overflow's position in the AI era.

Building AI-Assisted Development at Scale with Cursor

Cursor

This case study explores how Cursor's solutions team has observed enterprise companies successfully deploying AI-assisted coding in production environments. The problem addressed is helping developers leverage LLMs effectively for coding tasks while avoiding common pitfalls like context window bloat, over-reliance on AI, and hallucinations. The solution involves teaching developers to break down problems into appropriately-sized tasks, maintain clean context windows, use semantic search for brownfield codebases, and build deterministic harnesses around non-deterministic LLM outputs. Results include significant productivity gains when developers learn proper prompt engineering, context management, and maintain responsibility for AI-generated code, with specific improvements like bench scores jumping from 45% to 65% through harness optimization.

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

Building Alfred: Production-Ready Agentic Orchestration Layer for E-commerce

Loblaws

Loblaws Digital, the technology arm of one of Canada's largest retail companies, developed Alfred—a production-ready orchestration layer for running agentic AI workflows across their e-commerce, pharmacy, and loyalty platforms. The system addresses the challenge of moving agent prototypes into production at enterprise scale by providing a reusable template-based architecture built on LangGraph, FastAPI, and Google Cloud Platform components. Alfred enables teams across the organization to quickly deploy conversational commerce applications and agentic workflows (such as recipe-based shopping) while handling critical enterprise requirements including security, privacy, PII masking, observability, and integration with 50+ platform APIs through their Model Context Protocol (MCP) ecosystem.

Building Alyx: An AI Agent for LLM Observability and Debugging

Arize AI

Arize AI built "Alyx," an AI agent embedded in their observability platform to help users debug and optimize their machine learning and LLM applications. The problem they addressed was that their platform had advanced features that required significant expertise to use effectively, with customers needing guidance from solutions architects to extract maximum value. Their solution was to create an AI agent that emulates an expert solutions architect, capable of performing complex debugging workflows, optimizing prompts, generating evaluation templates, and educating users on platform features. Starting in November 2023 with GPT-3.5 and launching at their July 2024 conference, Alyx evolved from a highly structured, on-rails decision tree architecture to a more autonomous agent leveraging modern LLM capabilities. The team used their own platform to build and evaluate Alex, establishing comprehensive evaluation frameworks across multiple levels (tool calls, tasks, sessions, traces) and involving cross-functional stakeholders in defining success criteria.

Building an Agentic AI System for Healthcare Support Using LangGraph

Doctolib

Doctolib developed an agentic AI system called Alfred to handle customer support requests for their healthcare platform. The system uses multiple specialized AI agents powered by LLMs, working together in a directed graph structure using LangGraph. The initial implementation focused on managing calendar access rights, combining RAG for knowledge base integration with careful security measures and human-in-the-loop confirmation for sensitive actions. The system was designed to maintain high customer satisfaction while managing support costs efficiently.

Building an Agentic DevOps Copilot for Infrastructure Automation

Qovery

Qovery developed an agentic DevOps copilot to automate infrastructure tasks and eliminate repetitive DevOps work. The solution evolved through four phases: from basic intent-to-tool mapping, to a dynamic agentic system that plans tool sequences, then adding resilience and recovery mechanisms, and finally incorporating conversation memory. The copilot now handles complex multi-step workflows like deployments, infrastructure optimization, and configuration management, currently using Claude Sonnet 3.7 with plans for self-hosted models and improved performance.

Building an Agentic Enterprise with AI Agents in Production

Salesforce

Salesforce transformed itself into what it calls an "agentic enterprise" by deploying AI agents (branded as Agentforce) across sales, service, and marketing operations to address capacity constraints where demand exceeded headcount. The company deployed agents that autonomously handled over 2 million customer service conversations, followed up with previously untouched leads (75% of total leads), and provided 24/7 multilingual support. Key results included over $100 million in annualized cost savings from the service agent implementation, increased lead engagement leading to new revenue opportunities, and the ability to scale operations without proportional headcount increases. The initiative required significant iteration, data unification through their Data 360 platform, continuous testing and tuning of agent performance, cross-functional collaboration breaking down traditional departmental silos, and process redesigns to enable human-AI collaboration.

Building an AI Agent for Real Estate with Systematic Evaluation Framework

Rechat

Rechat developed an AI agent to assist real estate agents with tasks like contact management, email marketing, and website creation. Initially struggling with reliability and performance issues using GPT-3.5, they implemented a comprehensive evaluation framework that enabled systematic improvement through unit testing, logging, human review, and fine-tuning. This methodical approach helped them achieve production-ready reliability and handle complex multi-step commands that combine natural language with UI elements.

Building an AI Agent Platform for Enterprise Automation and Collaboration

Abundly.ai

Abundly.ai developed an AI agent platform that enables companies to deploy autonomous AI agents as digital colleagues. The company evolved from experimental hobby projects to a production platform serving multiple industries, addressing challenges in agent lifecycle management, guardrails, context engineering, and human-AI collaboration. The solution encompasses agent creation, monitoring, tool integration, and governance frameworks, with successful deployments in media (SVT journalist agent), investment screening, and business intelligence. Results include 95% time savings in repetitive tasks, improved decision quality through diligent agent behavior, and the ability for non-technical users to create and manage agents through conversational interfaces and dynamic UI generation.

Building an AI Agent Platform with Cloud-Based Virtual Machines and Extended Context

Manus

Manus AI, founded in late 2024, developed a consumer-focused AI agent platform that addresses the limitation of frontier LLMs having intelligence but lacking the ability to take action in digital environments. The company built a system where each user task is assigned a fully functional cloud-based virtual machine (Linux, with plans for Windows and Android) running real applications including file systems, terminals, VS Code, and Chromium browsers. By adopting a "less structure, more intelligence" philosophy that avoids predefined workflows and multi-role agent systems, and instead provides rich context to foundation models (primarily Anthropic's Claude), Manus created an agent capable of handling diverse long-horizon tasks from office location research to furniture shopping to data extraction, with users reporting up to 2 hours of daily GPU consumption. The platform launched publicly in March 2024 after five months of development and reportedly spent $1 million on Claude API usage in its first 14 days.

Building an AI API Gateway for Streamlined GenAI Service Development

DeliveryHero

DeliveryHero's Woowa Brothers division developed an AI API Gateway to address the challenges of managing multiple GenAI providers and streamlining development processes. The gateway serves as a central infrastructure component to handle credential management, prompt management, and system stability while supporting various GenAI services like AWS Bedrock, Azure OpenAI, and GCP Imagen. The initiative was driven by extensive user interviews and aims to democratize AI usage across the organization while maintaining security and efficiency.

Building an AI Co-Pilot Application: Patterns and Best Practices

Thoughtworks

Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to learn about building generative AI experiences beyond chat interfaces. The team implemented several key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed insights into practical LLMOps patterns for building production LLM applications with enhanced user experiences.

Building an AI Co-pilot for Product Strategy with LLM Integration Patterns

Thoughtworks

Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to explore effective patterns for LLM-powered applications beyond simple chat interfaces. The team developed and documented key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed implementation insights for building sophisticated LLM applications with better user experiences.

Building an AI Hiring Assistant with Agentic LLMs

LinkedIn

LinkedIn developed an AI Hiring Assistant as part of their LinkedIn Recruiter product to help enterprise recruiters evaluate candidate applications more efficiently. The assistant uses large language models to orchestrate complex recruitment workflows, retain knowledge across sessions, and reason over candidate profiles and external hiring systems. By taking a curated rollout approach with select enterprise customers, implementing transparency mechanisms, maintaining human-in-the-loop control, and continuously monitoring user signals for implicit and explicit learning, LinkedIn achieved significant efficiency gains where users spend 48% less time reviewing applications and review 62% fewer profiles before making hiring decisions, while also seeing a 69% higher InMail acceptance rate compared to traditional sourcing methods.

Building an AI Innovation Team and Platform with Safeguards at Scale

Twilio

Twilio's Emerging Tech and Innovation team tackled the challenge of integrating AI capabilities into their customer engagement platform while maintaining quality and trust. They developed an AI assistance platform that bridges structured and unstructured customer data, implementing a novel approach using a separate "Twilio Alpha" brand to enable rapid iteration while managing customer expectations. The team successfully balanced innovation speed with enterprise requirements through careful team structure, flexible architecture, and open communication practices.

Building an AI Interview Coach for Product Discovery Training

Product Talk

Teresa Torres, a product discovery coach, built an AI-powered interview coach to provide automated feedback to students in her continuous interviewing course. Starting with simple ChatGPT and Claude prototypes, she progressively developed a production system using Replit, Zapier, and eventually AWS Lambda and Step Functions. The system analyzes student interview transcripts against a rubric for story-based interviewing, providing detailed feedback on multiple dimensions including opening questions, scene-setting, timeline building, and redirecting generalizations. Through rigorous evaluation methodology including error analysis, code-based evals, and LLM-as-judge evals, she achieved sufficient quality to deploy the tool to course students. The tool now processes interviews automatically, with continuous monitoring and iteration based on comprehensive evaluation frameworks, and is being scaled through a partnership with Vistily for handling real customer interview data with appropriate SOC 2 compliance.

Building an AI Legal Assistant: From Early Testing to Production Deployment

Casetext

Casetext transformed their legal research platform into an AI-powered legal assistant called Co-Counsel using GPT-4, leading to a $650M acquisition by Thomson Reuters. The company shifted their entire 120-person team to focus on building this AI assistant after early access to GPT-4 showed promising results. Through rigorous testing, prompt engineering, and a test-driven development approach, they created a reliable AI system that could perform complex legal tasks like document review and research that previously took lawyers days to complete. The product achieved rapid market acceptance and true product-market fit within months of launch.

Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations

Nubank

Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.

Building an AI Sales Development Representative with Advanced RAG Knowledge Base

Alice

11X developed Alice, an AI Sales Development Representative (SDR) that automates lead generation and email outreach at scale. The key innovation was replacing a manual product library system with an intelligent knowledge base that uses advanced RAG (Retrieval Augmented Generation) techniques to automatically ingest and understand seller information from various sources including documents, websites, and videos. This system processes multiple resource types through specialized parsing vendors, chunks content strategically, stores embeddings in Pinecone vector database, and uses deep research agents for context retrieval. The result is an AI agent that sends 50,000 personalized emails daily compared to 20-50 for human SDRs, while serving 300+ business organizations with contextually relevant outreach.

Building an AI Teaching Assistant: ChatLTV at Harvard Business School

Harvard

Harvard Business School developed ChatLTV, a specialized AI teaching assistant for the Launching Tech Ventures course. Using RAG with a corpus of course materials including case studies, teaching notes, and historical Q&A, the system helped 250 MBA students prepare for classes and understand course content. The implementation leveraged Azure OpenAI for security, Pinecone for vector storage, and Langchain for development, resulting in over 3000 student queries and improved class preparation and engagement.

Building an AI Tutor with Enhanced LLM Accuracy Through Knowledge Base Integration

Clipping

Clipping developed an AI tutor called ClippingGPT to address the challenge of LLM hallucinations and accuracy in educational settings. By implementing embeddings and training the model on a specialized knowledge base, they created a system that outperformed GPT-4 by 26% on the Brazilian Diplomatic Career Examination. The solution focused on factual recall from a reliable proprietary knowledge base before generating responses, demonstrating how domain-specific knowledge integration can enhance LLM accuracy for educational applications.

Building an AI-Assisted Content Creation Platform for Language Learning

Babbel

Babbel developed an AI-assisted content creation tool to streamline their traditional 35-hour content creation pipeline for language learning materials. The solution integrates LLMs with human expertise through a gradio-based interface, enabling prompt management, content generation, and evaluation while maintaining quality standards. The system successfully reduced content creation time while maintaining high acceptance rates (>85%) from editors.

Building an AI-Generated Movie Quiz Game with RAG and Real-Time Multiplayer

Datastax

Datastax developed UnReel, a multiplayer movie trivia game that combines AI-generated questions with real-time gaming. The system uses RAG to generate movie-related questions and fake movie quotes, implemented through Langflow, with data storage in Astra DB and real-time multiplayer functionality via PartyKit. The project demonstrates practical challenges in production AI deployment, particularly in fine-tuning LLM outputs for believable content generation and managing distributed system state.

Building an AI-Native Browser with Integrated LLM Tools and Evaluation Systems

The Browser Company

The Browser Company transitioned from their Arc browser to building Dia, an AI-native browser, requiring a fundamental shift in how they approached product development and LLMOps. The company invested heavily in tooling for rapid prototyping, evaluation systems, and automated prompt optimization using techniques like Jeba (a sample-efficient prompt optimization method). They created a "model behavior" discipline to define and ship desired LLM behaviors, treating it as a craft analogous to product design. Additionally, they built security considerations into the product design from the ground up, particularly addressing prompt injection vulnerabilities through user confirmation workflows. The result was a browser that provides an AI assistant alongside users, personalizing experiences and helping with tasks, while enabling their entire company—from CEO to strategy team members—to iterate on AI features.

Building an AI-Native Code Editor in a Competitive Market

Cursor

Cursor, an AI-powered code editor startup, entered an extremely competitive market dominated by Microsoft's GitHub Copilot and well-funded competitors like Poolside, Augment, and Magic.dev. Despite initial skepticism from advisors about competing against Microsoft's vast resources and distribution, Cursor succeeded by focusing on the right short-term product decisions—specifically deep IDE integration through forking VS Code and delivering immediate value through "Cursor Tab" code completion. The company differentiated itself through rapid iteration, concentrated talent, bottom-up adoption among developers, and eventually building their own fast agent models. Cursor demonstrated that startups can compete against tech giants by moving quickly, dog-fooding their own product, and correctly identifying what developers need in the near term rather than betting solely on long-term agent capabilities.

Building an AI-Powered Browser Extension for Product Documentation with RAG and Chain-of-Thought

Reforge

Reforge developed a browser extension to help product professionals draft and improve documents like PRDs by integrating expert knowledge directly into their workflow. The team evolved from simple RAG (Retrieve and Generate) to a sophisticated Chain-of-Thought approach that classifies document types, generates tailored suggestions, and filters content based on context. Operating with a lean team of 2-3 people, they built the extension through rapid prototyping and iterative development, integrating into popular tools like Google Docs, Notion, and Confluence. The extension uses OpenAI models with Pinecone for vector storage, emphasizing privacy by not storing user data, and leverages innovative testing approaches like analyzing course recommendation distributions and reference counts to optimize model performance without accessing user content.

Building an AI-Powered Email Writing Assistant with Personalized Style Matching

Ghostwriter

Shortwave developed Ghostwriter, an AI writing feature that helps users compose emails that match their personal writing style. The system uses embedding-based semantic search to find relevant past emails, combines them with system prompts and custom instructions, and uses fine-tuned LLMs to generate contextually appropriate suggestions. The solution addresses two key challenges: making AI-generated text sound authentic to each user's style and incorporating accurate, relevant information from their email history.

Building an AI-Powered Help Desk with RAG and Model Evaluation

Vimeo

Vimeo developed a prototype AI help desk chat system that leverages RAG (Retrieval Augmented Generation) to provide accurate customer support responses using their existing Zendesk help center content. The system uses vector embeddings to store and retrieve relevant help articles, integrates with various LLM providers through Langchain, and includes comprehensive testing of different models (Google Vertex AI Chat Bison, GPT-3.5, GPT-4) for performance and cost optimization. The prototype demonstrates successful integration of modern LLMOps practices including prompt engineering, model evaluation, and production-ready architecture considerations.

Building an AI-Powered Interview Coach with Comprehensive Evaluation Framework

Product Talk

Teresa Torres, founder of Product Talk, describes her journey building an AI interview coach over four months to help students in her Continuous Discovery course practice customer interviewing skills. Starting from a position of limited AI engineering experience, she developed a production system that analyzes interview transcripts and provides detailed feedback across four dimensions of interviewing technique. The case study focuses extensively on her implementation of a comprehensive evaluation (eval) framework, including human annotation, code-based assertions, and LLM-as-judge evaluations, to ensure quality and reliability of the AI coach's feedback before deploying it to real students.

Building an AI-Powered Software Development Platform with Multiple LLM Integration

Lovable

Lovable addresses the challenge of making software development accessible to non-programmers by creating an AI-powered platform that converts natural language descriptions into functional applications. The solution integrates multiple LLMs (including OpenAI and Anthropic models) in a carefully orchestrated system that prioritizes speed and reliability over complex agent architectures. The platform has achieved significant success, with over 1,000 projects being built daily and a rapidly growing user base that doubled its paying customers in a recent month.

Building an Asynchronous Event-Driven Agentic Framework for AI-Powered App Building

Airtable

Airtable built a custom agentic framework to power AI features including Omni (conversational app builder) and Field Agents (AI-powered fields). The problem was that early AI capabilities couldn't handle complex tasks requiring dynamic decision-making, data retrieval, or multi-step reasoning. The solution was an asynchronous event-driven state machine architecture with three core components: a context manager for maintaining information, a tool dispatcher for executing predefined actions, and a decision engine (LLM-powered) for autonomous planning. The framework enables agents to reason through complex tasks, self-correct errors, and handle large context windows through trimming and summarization strategies, resulting in production AI agents capable of automating thousands of hours of work.

Building an Autonomous AI Software Engineer with Multi-Turn RL and Codebase Understanding

Devin

Cognition, the company behind Devon (an AI software engineer), addresses the challenge of enabling AI agents to work effectively within large, existing codebases where traditional LLMs struggle with limited context windows and complex dependencies. Their solution involves creating DeepWiki, a continuously-updated interactive knowledge graph and wiki system that indexes codebases using both code and metadata (pull requests, git history, team discussions), combined with Devon Search for deep codebase research, and custom post-training using multi-turn reinforcement learning to optimize models for specific narrow domains. Results include Devon being used by teams worldwide to autonomously go from ticket to pull request, the release of Kevin 32B (an open-source model achieving 91% correctness on CUDA kernel generation, outperforming frontier models like GPT-4), and thousands of open-source projects incorporating DeepWiki into their official documentation.

Building an Enterprise AI Productivity Platform: From Slack Bot to Integrated AI Workforce

Toqan

Proess (previously called Prous) developed Toqan, an internal AI productivity platform that evolved from a simple Slack bot to a comprehensive enterprise AI system serving 30,000+ employees across 100+ portfolio companies. The platform addresses the challenge of enterprise AI adoption by providing access to multiple LLMs through conversational interfaces, APIs, and system integrations, while measuring success through user engagement metrics like daily active users and "super users" who ask 5+ questions per day. The solution demonstrates how large organizations can systematically deploy AI tools across diverse business functions while maintaining security and enabling bottom-up adoption through hands-on training and cultural change management.

Building an Enterprise GenAI Platform with Standardized LLMOps Framework

FactSet

FactSet, a financial data and analytics provider, faced challenges with fragmented LLM development approaches across teams, leading to collaboration barriers and inconsistent quality. They implemented a standardized LLMOps framework using Databricks Mosaic AI and MLflow, enabling unified governance, efficient model development, and improved deployment capabilities. This transformation resulted in significant performance improvements, including a 70% reduction in response time for code generation and 60% reduction in end-to-end latency for formula generation, while maintaining high accuracy and enabling cost-effective use of fine-tuned open-source models alongside commercial LLMs.

Building an Enterprise LLMOps Stack: Lessons from Doordash

Doordash

The ML Platform team at Doordash shares their exploration and strategy for building an enterprise LLMOps stack, discussing the unique challenges of deploying LLM applications at scale. The presentation covers key components needed for production LLM systems, including gateway services, prompt management, RAG implementations, and fine-tuning capabilities, while drawing insights from industry leaders like LinkedIn and Uber's approaches to LLMOps architecture.

Building an Enterprise RAG-based AI Assistant with Vector Search and LLM Integration

Elastic

Elastic developed ElasticGPT, an internal generative AI assistant built on their own technology stack to provide secure, context-aware knowledge discovery for their employees. The system combines RAG (Retrieval Augmented Generation) capabilities through their SmartSource framework with private access to OpenAI's GPT models, all built on Elasticsearch as a vector database. The solution demonstrates how to build a production-grade AI assistant that maintains security and compliance while delivering efficient knowledge retrieval and generation capabilities.

Building an Enterprise-Grade AI Agent for Recruiting at Scale

LinkedIn

LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.

Building an Evaluation-First Development Strategy for AI Service Agents

Monday

Monday Service built an AI-native Enterprise Service Management platform featuring customizable, role-based AI agents to automate customer service across IT, HR, and Legal departments. The team embedded evaluation into their development cycle from Day 0, creating a dual-layered approach with offline "safety net" evaluations for regression testing and online "monitor" evaluations for real-time production quality. This eval-driven development framework, built on LangGraph agents with LangSmith and Vitest integration, achieved 8.7x faster evaluation feedback loops (from 162 seconds to 18 seconds), comprehensive testing across hundreds of examples in minutes, real-time end-to-end quality monitoring on production traces using multi-turn evaluators, and GitOps-style CI/CD deployment with evaluations managed as version-controlled code.

Building an Event Assistant Agent in 5 Days with Agentforce and Data Cloud RAG

Salesforce

Salesforce's engineering team built "Ask Astro Agent," an AI-powered event assistant for their Dreamforce conference, in just five days by migrating from a homegrown OpenAI-based solution to their Agentforce platform with Data Cloud RAG capabilities. The agent helped attendees find information grounded in FAQs, manage schedules, and receive personalized session recommendations. The team leveraged vector and hybrid search indexing, streaming data updates via Mulesoft, knowledge article integration, and Salesforce's native tooling to create a production-ready agent that demonstrated the power of their enterprise AI stack while handling real-time event queries from thousands of attendees.

Building an Internal AI-Powered Customer Reference Discovery Platform

Databricks

Databricks faced a significant challenge in helping sales and marketing teams discover and utilize their vast collection of over 2,400 customer stories scattered across multiple platforms including YouTube, LinkedIn, internal documents, and their website. The tribal knowledge problem meant that finding the right customer reference at the right time was difficult, leading to overused references, missed opportunities, and inefficient manual searching. To solve this, they built Reffy—a full-stack agentic application using RAG (Retrieval-Augmented Generation), Vector Search, AI Functions, and Lakebase on the Databricks platform. Since its launch in December 2025, over 1,800 employees have executed more than 7,500 queries, resulting in faster campaign execution, more relevant storytelling, and democratized access to customer proof points that were previously siloed in tribal knowledge.

Building an Internal Background Coding Agent with Full Development Environment Integration

Ramp

Ramp built Inspect, an internal background coding agent that automates code generation while closing the verification loop with comprehensive testing and validation capabilities. The agent runs in sandboxed VMs on Modal with full access to all engineering tools including databases, CI/CD pipelines, monitoring systems, and feature flags. Within months of deployment, Inspect reached approximately 30% of all pull requests merged to frontend and backend repositories, demonstrating rapid adoption without mandating usage. The system's key innovation is providing agents with the same context and tools as human engineers while enabling unlimited concurrent sessions with near-instant startup times.

Building an Internal ChatGPT for Enterprise: From Failed Support Bot to Company-Wide AI Tool

Grab

Grab's ML Platform team was overwhelmed with support inquiries in Slack channels, prompting an engineer to experiment with building an LLM-powered chatbot for platform documentation. After the initial attempt failed due to token limitations and poor embedding search results, the project pivoted to creating GrabGPT—an internal ChatGPT-like tool for all employees. Deployed over a weekend with Google authentication and leveraging Grab's existing model-serving infrastructure (Catwalk), GrabGPT rapidly grew from 300 users on day one to becoming nearly universally adopted across the company, with over 3,000 users and 600 daily active users within three months. The success was attributed to data security controls, global accessibility (especially in regions where ChatGPT is blocked), model-agnostic architecture supporting multiple LLM providers, and full auditability for governance.

Building an Internal ChatGPT-like Tool for Enterprise-wide AI Access

Grab

Grab's ML Platform team faced overwhelming support channel inquiries that consumed engineering time with repetitive questions. An engineer initially attempted to build a RAG-based chatbot for platform documentation but encountered context window limitations with GPT-3.5-turbo and scalability issues. Pivoting from this failed experiment, the engineer built GrabGPT, an internal ChatGPT-like tool accessible to all employees, deployed over a weekend using existing frameworks and Grab's model-serving platform. The tool rapidly scaled to nearly company-wide adoption, with over 3000 users within three months and 600 daily active users, providing secure, auditable, and globally accessible LLM capabilities across multiple model providers including OpenAI, Claude, and Gemini.

Building an LLM-Powered Support Response System

Stripe

Stripe developed an LLM-based system to help support agents handle customer inquiries more efficiently by providing relevant response prompts. The solution evolved from a simple GPT implementation to a sophisticated multi-stage framework incorporating fine-tuned models for question validation, topic classification, and response generation. Despite strong offline performance, the team faced challenges with agent adoption and online monitoring, leading to valuable lessons about the importance of UX consideration, online feedback mechanisms, and proper data management in LLM production systems.

Building an On-Premise Health Insurance Appeals Generation System

HealthInsuranceLLM

Development of an LLM-based system to help generate health insurance appeals, deployed on-premise with limited resources. The system uses fine-tuned models trained on publicly available medical review board data to generate appeals for insurance claim denials. The implementation includes Kubernetes deployment, GPU inference, and a Django frontend, all running on personal hardware with multiple internet providers for reliability.

Building Analytics Applications with LLMs for E-commerce Review Analysis

Microsoft

The case study explores how Large Language Models (LLMs) can revolutionize e-commerce analytics by analyzing customer product reviews. Traditional methods required training multiple models for different tasks like sentiment analysis and aspect extraction, which was time-consuming and lacked explainability. By implementing OpenAI's LLMs with careful prompt engineering, the solution enables efficient multi-task analysis including sentiment analysis, aspect extraction, and topic clustering while providing better explainability for stakeholders.

Building and Automating Comprehensive LLM Evaluation Framework for SNAP Benefits

Propel

Propel developed a sophisticated evaluation framework for testing and benchmarking LLM performance in handling SNAP (food stamp) benefit inquiries. The company created two distinct evaluation approaches: one for benchmarking current base models on SNAP topics, and another for product development. They implemented automated testing using Promptfoo and developed innovative ways to evaluate model responses, including using AI models as judges for assessing response quality and accessibility.

Building and Debugging Web Automation Agents with LangChain Ecosystem

Airtop

Airtop developed a web automation platform that enables AI agents to interact with websites through natural language commands. They leveraged the LangChain ecosystem (LangChain, LangSmith, and LangGraph) to build flexible agent architectures, integrate multiple LLM models, and implement robust debugging and testing processes. The platform successfully enables structured information extraction and real-time website interactions while maintaining reliability and scalability.

Building and Deploying a Pokemon-Playing LLM Agent at Anthropic

Anthropic

David Hershey from Anthropic developed a side project that evolved into a significant demonstration of LLM agent capabilities, where Claude (Anthropic's LLM) plays Pokemon through an agent framework. The system processes screen information, makes decisions, and executes actions, demonstrating long-horizon decision making and learning. The project not only served as an engaging public demonstration but also provided valuable insights into model capabilities and improvements across different versions.

Building and Deploying AI Agents for Account Qualification

Unify

Unify developed an AI agent system for automating account qualification in sales processes, using LangGraph for agent orchestration and LangSmith for experimentation and tracing. They evolved their agent architecture through multiple iterations, focusing on improving planning, reflection, and execution capabilities while optimizing for speed and user experience. The final system features real-time progress visualization and parallel tool execution, demonstrating practical solutions to common challenges in deploying LLM-based agents in production.

Building and Deploying an AI-Powered Incident Summary Generator

Incident.io

incident.io developed an AI feature to automatically generate and suggest incident summaries using OpenAI's models. The system processes incident updates, Slack conversations, and metadata to create comprehensive summaries that help newcomers get up to speed quickly. The feature achieved a 63% direct acceptance rate, with an additional 26% of suggestions being edited before use, demonstrating strong practical utility in production.

Building and Deploying Production AI Agents for Enterprise Data Analysis

Asterrave

Rosco's CTO shares their two-year journey of rebuilding their product around AI agents for enterprise data analysis. They focused on enabling agents to reason rather than rely on static knowledge, developing discrete tool calls for data warehouse queries, and creating effective agent-computer interfaces. The team discovered key insights about model selection, response formatting, and multi-agent architectures while avoiding fine-tuning and third-party frameworks. Their solution successfully enabled AI agents to query enterprise data warehouses with proper security credentials and user permissions.

Building and Deploying Production LLM Code Review Agents: Architecture and Best Practices

Ellipsis

Ellipsis developed an AI-powered code review system that uses multiple specialized LLM agents to analyze pull requests and provide feedback. The system employs parallel comment generators, sophisticated filtering pipelines, and advanced code search capabilities backed by vector stores. Their approach emphasizes accuracy over latency, uses extensive evaluation frameworks including LLM-as-judge, and implements robust error handling. The system successfully processes GitHub webhooks and provides automated code reviews with high accuracy and low false positive rates.

Building and Deploying Repository-Specific AI Assistants for GitHub

PeterCat.ai

PeterCat.ai developed a system to create customized AI assistants for GitHub repositories, focusing on improving code review and issue management processes. The solution combines LLMs with RAG for enhanced context awareness, implements PR review and issue handling capabilities, and uses a GitHub App for seamless integration. Within three months of launch, the system was adopted by 178 open source projects, demonstrating its effectiveness in streamlining repository management and developer support.

Building and Deploying the Codex App: A Multi-Agent AI Development Environment

OpenAI

OpenAI's Codex team developed a dedicated GUI application for AI-powered coding that serves as a command center for multi-agent systems, moving beyond traditional IDE and terminal interfaces. The team addressed the challenge of making AI coding agents accessible to broader audiences while maintaining professional-grade capabilities for software developers. By combining the GPT-5.3 Codex model with agent skills, automations, and a purpose-built interface, they created a production system that enables delegation-based development workflows where users supervise AI agents performing complex coding tasks. The result was over one million downloads in the first week, widespread internal adoption at OpenAI including by research teams, and a strategic shift positioning AI coding tools for mainstream use, culminating in a Super Bowl advertisement.

Building and Evaluating a Financial Earnings Call Summarization System

Aiera

Aiera, an investor intelligence platform, developed a system for automated summarization of earnings call transcripts. They created a custom dataset from their extensive collection of earnings call transcriptions, using Claude 3 Opus to extract targeted insights. The project involved comparing different evaluation metrics including ROUGE and BERTScore, ultimately finding Claude 3.5 Sonnet performed best for their specific use case. Their evaluation process revealed important insights about the trade-offs between different scoring methodologies and the challenges of evaluating generative AI outputs in production.

Building and Evaluating a RAG-based Menopause Information Chatbot

Vira Health

Vira Health developed and evaluated an AI chatbot to provide reliable menopause information using peer-reviewed position statements from The Menopause Society. They implemented a RAG (Retrieval Augmented Generation) architecture using GPT-4, with careful attention to clinical safety and accuracy. The system was evaluated using both AI judges and human clinicians across four criteria: faithfulness, relevance, harmfulness, and clinical correctness, showing promising results in terms of safety and effectiveness while maintaining strict adherence to trusted medical sources.

Building and Evaluating Legal AI at Scale with Domain Expert Integration

Harvey

Harvey, a legal AI company, has developed a comprehensive approach to building and evaluating AI systems for legal professionals, serving nearly 400 customers including one-third of the largest 100 US law firms. The company addresses the complex challenges of legal document analysis, contract review, and legal drafting through a suite of AI products ranging from general-purpose assistants to specialized workflows for large-scale document extraction. Their solution integrates domain experts (lawyers) throughout the entire product development process, implements multi-layered evaluation systems combining human preference judgments with automated LLM-based evaluations, and has built custom benchmarks and tooling to assess quality in this nuanced domain where mistakes can have career-impacting consequences.

Building and Evaluating Legal AI with Multi-Modal Evaluation Systems

Unify

Harvey, a legal AI company, has developed a comprehensive approach to building and evaluating AI systems for legal professionals, addressing the unique challenges of document complexity, nuanced outputs, and high-stakes accuracy requirements. Their solution combines human-in-the-loop evaluation with automated model-based assessments, custom benchmarks like BigLawBench, and a "lawyer-in-the-loop" product development philosophy that embeds legal domain experts throughout the engineering process. The company has achieved significant scale with nearly 400 customers globally, including one-third of the largest 100 US law firms, demonstrating measurable improvements in evaluation quality and product iteration speed through their systematic LLMOps approach.

Building and Evaluating Maya: An AI-Powered Data Pipeline Generation System

Maia

Matillion developed Maya, a digital data engineer product that uses LLMs to help data engineers build data pipelines more productively. Starting as a simple chatbot co-pilot in mid-2022, Maya evolved into a core interface for the Data Productivity Cloud (DPC), generating data pipelines through natural language prompts. The company faced challenges transitioning from informal "vibes-based" evaluation to rigorous testing frameworks required for enterprise deployment. They implemented a multi-phase approach: starting with simple certification exam tests, progressing to LLM-as-judge evaluation with human-in-the-loop validation, and finally building automated testing harnesses integrated with Langfuse for observability. This evolution enabled them to confidently upgrade models (like moving to Claude Sonnet 3.5 within 24 hours) and successfully launch Maya to enterprise customers in June 2024, while navigating challenges around PII handling in trace data and integrating MLOps skillsets into traditional software engineering teams.

Building and Evaluating Production AI Agents: From Function Calling to Complex Multi-Agent Systems

Google Deepmind

This case study explores the evolution of LLM-based systems in production through discussions with Raven Kumar from Google DeepMind about building products like Notebook LM, Project Mariner, and working with the Gemini and Gemma model families. The conversation covers the rapid progression from simple function calling to complex agentic systems capable of multi-step reasoning, the critical importance of evaluation harnesses as competitive advantages, and practical considerations around context engineering, tool orchestration, and model selection. Key insights include how model improvements are causing teams to repeatedly rebuild agent architectures, the importance of shipping products quickly to learn from real users, and strategies for evaluating increasingly complex multi-modal agentic systems across different scales from edge devices to cloud-based deployments.

Building and Evaluating Production Voice Agents: From Custom Infrastructure to Platform Solutions

Nomore Engineering

A team explored building a phone agent system for handling doctor appointments in Polish primary care, initially attempting to build their own infrastructure before evaluating existing platforms. They implemented a complex system involving speech-to-text, LLMs, text-to-speech, and conversation orchestration, along with comprehensive testing approaches. After building the complete system, they ultimately decided to use a third-party platform (Vapi.ai) due to the complexities of maintaining their own infrastructure, while gaining valuable insights into voice agent architecture and testing methodologies.

Building and Evolving a Production GenAI Application Stack

LinkedIn

LinkedIn's journey in developing their GenAI application tech stack, transitioning from simple prompt-based solutions to complex conversational agents. The company evolved from Java-based services to a Python-first approach using LangChain, implemented comprehensive prompt management, developed a skill-based task automation framework, and built robust conversational memory infrastructure. This transformation included migrating existing applications while maintaining production stability and enabling both commercial and fine-tuned open-source LLM deployments.

Building and Implementing a Company-Wide GenAI Strategy

Thumbtack

Thumbtack developed and implemented a comprehensive generative AI strategy focusing on three key areas: enhancing their consumer product with LLMs for improved search and data analysis, transforming internal operations through AI-powered business processes, and boosting employee productivity. They established new infrastructure and policies for secure LLM deployment, demonstrated value through early wins in policy violation detection, and successfully drove company-wide adoption through executive sponsorship and careful expectation management.

Building and Managing Production Agents with Testing and Evaluation Infrastructure

Nearpod

Nearpod, an edtech company, implemented a sophisticated agent-based architecture to help teachers generate educational content. They developed a framework for building, testing, and deploying AI agents with robust evaluation capabilities, ensuring 98-100% accuracy while managing costs. The system includes specialized agents for different tasks, an agent registry for reuse across teams, and extensive testing infrastructure to ensure reliable production deployment of non-deterministic systems.

Building and Operating a CLI-Based LLM Coding Assistant

Anthropic

Anthropic developed Claude Code, a CLI-based coding assistant that provides direct access to their Sonnet LLM for software development tasks. The tool started as an internal experiment but gained rapid adoption within Anthropic, leading to its public release. The solution emphasizes simplicity and Unix-like utility design principles, achieving an estimated 2-10x developer productivity improvement for active users while maintaining a pay-as-you-go pricing model averaging $6/day per active user.

Building and Operating an MCP Server for LLM-Powered Cloud Infrastructure Queries

CloudQuery

CloudQuery built a Model Context Protocol (MCP) server in Go to enable Claude and Cursor to directly query their cloud infrastructure database. They encountered significant challenges with LLM tool selection, context window limitations, and non-deterministic behavior. By rewriting tool descriptions to be longer and more domain-specific, renaming tools to better match user intent, implementing schema filtering to reduce token usage by 90%, and embedding recommended multi-tool workflows, they dramatically improved how the LLM engaged with their system. The solution transformed Claude's interaction from hallucinating queries to systematically following a discovery-to-execution pipeline.

Building and Operating Production LLM Agents: Lessons from the Trenches

Ellipsis

A comprehensive analysis of 15 months experience building LLM agents, focusing on the practical aspects of deployment, testing, and monitoring. The case study covers essential components of LLMOps including evaluation pipelines in CI, caching strategies for deterministic and cost-effective testing, and observability requirements. The author details specific challenges with prompt engineering, the importance of thorough logging, and the limitations of existing tools while providing insights into building reliable AI agent systems.

Building and Optimizing a RAG-based Customer Service Chatbot

HDI

HDI, a German insurance company, implemented a RAG-based chatbot system to help customer service agents quickly find and access information across multiple knowledge bases. The system processes complex insurance documents, including tables and multi-column layouts, using various chunking strategies and vector search optimizations. After 120 experiments to optimize performance, the production system now serves 800+ users across multiple business lines, handling 26 queries per second with 88% recall rate and 6ms query latency.

Building and Optimizing AI Programming Agents with MLOps Infrastructure at Scale

Weights & Biases

This case study describes Weights & Biases' development of programming agents that achieved top performance on the SWEBench benchmark, demonstrating how MLOps infrastructure can systematically improve AI agent performance through experimental workflows. The presenter built "Tiny Agent," a command-line programming agent, then optimized it through hundreds of experiments using OpenAI's O1 reasoning model to achieve the #1 position on SWEBench leaderboard. The approach emphasizes systematic experimentation with proper tracking, evaluation frameworks, and infrastructure scaling, while introducing tools like Weave for experiment management and WB Launch for distributed computing. The work also explores reinforcement learning for agent improvement and introduces the concept of "researcher agents" that can autonomously improve AI systems.

Building and Orchestrating Multi-Agent Systems at Scale with CrewAI

CrewAI

CrewAI developed a production-ready framework for building and orchestrating multi-agent AI systems, demonstrating its capabilities through internal use cases including marketing content generation, lead qualification, and documentation automation. The platform has achieved significant scale, executing over 10 million agents in 30 days, and has been adopted by major enterprises. The case study showcases how the company used their own technology to scale their operations, from automated content creation to lead qualification, while addressing key challenges in production deployment of AI agents.

Building and Scaling a Production Generative AI Assistant for Professional Networking

LinkedIn

LinkedIn developed a generative AI-powered experience to enhance job searches and professional content browsing. The system uses a RAG-based architecture with specialized AI agents to handle different query types, integrating with internal APIs and external services. Key challenges included evaluation at scale, API integration, maintaining consistent quality, and managing computational resources while keeping latency low. The team achieved basic functionality quickly but spent significant time optimizing for production-grade reliability.

Building and Scaling AI-Powered Password Detection in Production

Github

Github developed and deployed Copilot secret scanning to detect generic passwords in codebases using AI/LLMs, addressing the limitations of traditional regex-based approaches. The team iteratively improved the system through extensive testing, prompt engineering, and novel resource management techniques, ultimately achieving a 94% reduction in false positives while maintaining high detection accuracy. The solution successfully scaled to handle enterprise workloads through sophisticated capacity management and workload-aware request handling.

Building and Scaling an AI Coding Agent Through Rapid Iteration and User Feedback

Anthropic

Anthropic developed Claude Code, an AI-powered coding agent that started as an internal prototyping tool and evolved into a widely-adopted product through organic growth and rapid iteration. The team faced challenges in making an LLM-based coding assistant that could handle complex, multi-step software engineering tasks while remaining accessible and customizable across diverse developer environments. Their solution involved a minimalist terminal-first interface, extensive customization capabilities through hooks and sub-agents, rigorous internal dogfooding with over 1,000 Anthropic employees, and tight feedback loops that enabled weekly iteration cycles. The product achieved high viral adoption internally before external launch, expanded beyond professional developers to designers and product managers who now contribute code directly, and established a fast-shipping culture where features often go from prototype to production within weeks based on real user feedback rather than extensive upfront planning.

Building and Scaling an Enterprise AI Assistant with GPT Models

Instacart

Instacart developed Ava, an internal AI assistant powered by GPT-4 and GPT-3.5, which evolved from a hackathon project to a company-wide productivity tool. The assistant features a web interface, Slack integration, and a prompt exchange platform, achieving widespread adoption with over half of Instacart employees using it monthly and 900 weekly users. The system includes features like conversation search, automatic model upgrades, and thread summarization, significantly improving productivity across engineering and non-engineering teams.

Building and Scaling an LLM-Powered Query Assistant in Production

Honeycomb

Honeycomb implemented a Query Assistant powered by LLMs to help users better understand and utilize their observability platform's querying capabilities. The feature was developed rapidly with a "ship to learn" mindset, using GPT-3.5 Turbo and text embeddings. While the initial adoption varied across pricing tiers (82% Enterprise/Pro, 75% Self-Serve, 39% Free) and some metrics didn't meet expectations, it achieved significant successes: teams using Query Assistant showed 26.5% retention in manual querying vs 4.5% for non-users, higher complex query creation (33% vs 15.7%), and increased board creation (11% vs 3.6%). Notably, the implementation proved extremely cost-effective at around $30/month in OpenAI costs, demonstrated strong integration with existing workflows, and revealed unexpected user behaviors like handling DSL expressions and trace IDs. The project validated Honeycomb's approach to AI integration while providing valuable insights for future AI features.

Building and Scaling Codex: OpenAI's Production Coding Agent

OpenAI

OpenAI developed Codex, a coding agent that serves as an AI-powered software engineering teammate, addressing the challenge of accelerating software development workflows. The solution combines a specialized coding model (GPT-5.1 Codex Max), a custom API layer with features like context compaction, and an integrated harness that works through IDE extensions and CLI tools using sandboxed execution environments. Since launching and iterating based on user feedback in August, Codex has grown 20x, now serves many trillions of tokens per week, has become the most-served coding model both in first-party use and via API, and has enabled dramatic productivity gains including shipping the Sora Android app (which became the #1 app in the app store) in just 28 days with 2-3 engineers, demonstrating significant acceleration in production software development at scale.

Building and Scaling Conversational Voice AI Agents for Enterprise Go-to-Market

Thoughtly / Gladia

Thoughtly, a voice AI platform founded in late 2023, provides conversational AI agents for enterprise sales and customer support operations. The company orchestrates speech-to-text, large language models, and text-to-speech systems to handle millions of voice calls with sub-second latency requirements. By optimizing every layer of their stack—from telephony providers to LLM inference—and implementing sophisticated caching, conditional navigation, and evaluation frameworks, Thoughtly delivers 3x conversion rates over traditional methods and 15x ROI for customers. The platform serves enterprises with HIPAA and SOC 2 compliance while handling both inbound customer support and outbound lead activation at massive scale across multiple languages and regions.

Building and Scaling Enterprise LLMOps Platforms: From Team Topology to Production

Various

A comprehensive overview of how enterprises are implementing LLMOps platforms, drawing from DevOps principles and experiences. The case study explores the evolution from initial AI adoption to scaling across teams, emphasizing the importance of platform teams, enablement, and governance. It highlights the challenges of testing, model management, and developer experience while providing practical insights into building robust AI infrastructure that can support multiple teams within an organization.

Building and Scaling GitHub Copilot: From Prototype to Enterprise AI Coding Assistant

GitHub

GitHub shares the three-year journey of developing GitHub Copilot, an LLM-powered code completion tool, from concept to general availability. The team followed a "find it, nail it, scale it" framework to identify the problem space (helping developers code faster), create a smooth product experience through rapid iteration and A/B testing, and scale to enterprise readiness. Starting with a focused problem of function-level code completion in IDEs, they leveraged OpenAI's LLMs and Microsoft Azure infrastructure, implementing techniques like neighboring tabs processing, caching for consistency, and security filters. Through technical previews and community feedback, they achieved a 55% faster coding speed and 74% reduction in developer frustration, while addressing responsible AI concerns through code reference tools and vulnerability filtering.

Building and Scaling Internal Data Agents and AI-Powered Frontend Development Tools

Vercel

Vercel developed two significant production AI applications: DZ, an internal text-to-SQL data agent that enables employees to query Snowflake using natural language in Slack, and V0, a public-facing AI tool for generating full-stack web applications. The company initially built DZ as a traditional tool-based agent but completely rebuilt it as a coding-style agent with simplified architecture (just two tools: bash and SQL execution), dramatically improving performance by leveraging models' native coding capabilities. V0 evolved from a 2023 prototype targeting frontend engineers into a comprehensive full-stack development tool as models improved, finding strong product-market fit with tech-adjacent users and enabling significant internal productivity gains. Both products demonstrate Vercel's philosophy that building custom agents is straightforward and preferable to buying off-the-shelf solutions, with the company successfully deploying these AI systems at scale while maintaining reliability and supporting their core infrastructure business.

Building and Scaling LLM Applications at Discord

Discord

Discord shares their comprehensive approach to building and deploying LLM-powered features, from ideation to production. They detail their process of identifying use cases, defining requirements, prototyping with commercial LLMs, evaluating prompts using AI-assisted evaluation, and ultimately scaling through either hosted or self-hosted solutions. The case study emphasizes practical considerations around latency, quality, safety, and cost optimization while building production LLM applications.

Building and Scaling Production Code Agents: Lessons from Replit

Replit

Replit developed and deployed a production-grade code agent that helps users create and modify code through natural language interaction. The team faced challenges in defining their target audience, detecting failure cases, and implementing comprehensive evaluation systems. They scaled from 3 to 20 engineers working on the agent, developed custom evaluation frameworks, and successfully launched features like rapid build mode that reduced initial application setup time from 7 to 2 minutes. The case study highlights key learnings in agent development, testing, and team scaling in a production environment.

Building and Scaling Production-Ready AI Agents: Lessons from Agent Force

Salesforce

Salesforce introduced Agent Force, a low-code/no-code platform for building, testing, and deploying AI agents in enterprise environments. The case study explores the challenges of moving from proof-of-concept to production, emphasizing the importance of comprehensive testing, evaluation, monitoring, and fine-tuning. Key insights include the need for automated evaluation pipelines, continuous monitoring, and the strategic use of fine-tuning to improve performance while reducing costs.

Building and Sunsetting Ada: An Internal LLM-Powered Chatbot Assistant

Leboncoin

Leboncoin, a French e-commerce platform, built Ada—an internal LLM-powered chatbot assistant—to provide employees with secure access to GenAI capabilities while protecting sensitive data from public LLM services. Starting in late 2023, the project evolved from a general-purpose Claude-based chatbot to a suite of specialized RAG-powered assistants integrated with internal knowledge sources like Confluence, Backstage, and organizational data. Despite achieving strong technical results and valuable learning outcomes around evaluation frameworks, retrieval optimization, and enterprise LLM deployment, the project was phased out in early 2025 in favor of ChatGPT Enterprise with EU data residency, allowing the team to redirect their expertise toward more user-facing use cases while reducing operational overhead.

Building and Testing a Production LLM-Powered Quiz Application

Google

A case study of transforming a traditional trivia quiz application into an LLM-powered system using Google's Vertex AI platform. The team evolved from using static quiz data to leveraging PaLM and later Gemini models for dynamic quiz generation, addressing challenges in prompt engineering, validation, and testing. They achieved significant improvements in quiz accuracy from 70% with Gemini Pro to 91% with Gemini Ultra, while implementing robust validation methods using LLMs themselves to evaluate quiz quality.

Building and Testing Production AI Applications at CircleCI

CircleCI

CircleCI shares their experience building AI-enabled applications like their error summarizer tool, focusing on the challenges of testing and evaluating LLM-powered applications in production. They discuss implementing model-graded evals, handling non-deterministic outputs, managing costs, and building robust testing strategies that balance thoroughness with practicality. The case study provides insights into applying traditional software development practices to AI applications while addressing unique challenges around evaluation, cost management, and scaling.

Building ART·E: Reinforcement Learning for Email Search Agent Development

OpenPipe

OpenPipe developed ART·E, an email research agent that outperforms OpenAI's o3 model on email search tasks. The project involved creating a synthetic dataset from the Enron email corpus, implementing a reinforcement learning training pipeline using Group Relative Policy Optimization (GRPO), and developing a multi-objective reward function. The resulting model achieved higher accuracy while being faster and cheaper than o3, taking fewer turns to answer questions correctly and hallucinating less frequently, all while being trained on a single H100 GPU for under $80.

Building Ask Learn: A Large-Scale RAG-Based Knowledge Service for Azure Documentation

Microsoft

Microsoft's Skilling organization built "Ask Learn," a retrieval-augmented generation (RAG) system that powers AI-driven question-answering capabilities for Microsoft Q&A and serves as ground truth for Microsoft Copilot for Azure. Starting from a 2023 hackathon project, the team evolved a naïve RAG implementation into an advanced RAG system featuring sophisticated pre- and post-processing pipelines, continuous content ingestion from Microsoft Learn documentation, vector database management, and comprehensive evaluation frameworks. The system handles massive scale, provides accurate and verifiable answers, and serves multiple use cases including direct question answering, grounding data for other chat handlers, and fallback functionality when the Copilot cannot complete requested tasks.

Building Claude Code: Scaling AI-Powered Development from Terminal Prototype to Production

Anthropic

Anthropic's Boris Churnney, creator of Claude Code, describes the journey from an accidental terminal prototype in September 2024 to a production coding tool used by 70% of startups and responsible for 4% of all public commits globally. Starting as a simple API testing tool, Claude Code evolved through continuous user feedback and rapid iteration, with the entire codebase rewritten every few months to adapt to improving model capabilities. The tool achieved remarkable productivity gains at Anthropic itself, with engineers seeing 70% productivity increases per capita despite team doubling, and total productivity improvements of 150% since launch. The development philosophy centered on building for future model capabilities rather than current ones, anticipating improvements 6 months ahead, and minimizing scaffolding that would become obsolete with each new model release.

Building Custom AI Review Dashboards for Production LLM Monitoring

Anterior

Anterior developed "Scalpel," a custom review dashboard enabling a small team of clinicians to review over 100,000 medical decisions made by their AI system. The dashboard was built around three core principles: optimizing context surfacing for high-quality reviews, streamlining review flow sequences to minimize time per review, and designing reviews to generate actionable data for AI system improvements. This approach allowed domain experts to efficiently evaluate AI outputs while providing structured feedback that could be directly translated into system enhancements, demonstrating how custom tooling can bridge the gap between production AI performance and iterative improvement processes.

Building Customer Intelligence MCP Server for AI Agent Integration

Dovetail

Dovetail, a customer intelligence platform, developed an MCP (Model Context Protocol) server to enable AI agents to access and utilize customer feedback data stored in their platform. The solution addresses the challenge of teams wanting to integrate their customer intelligence into internal AI workflows, allowing for automated report generation, roadmap development, and faster decision-making across product management, customer success, and design teams.

Building Deep Research: A Production AI Research Assistant Agent

Google Deepmind

Google Deepmind developed Deep Research, a feature that acts as an AI research assistant using Gemini to help users learn about any topic in depth. The system takes a query, browses the web for about 5 minutes, and outputs a comprehensive research report that users can review and ask follow-up questions about. The system uses iterative planning, transparent research processes, and a sophisticated orchestration backend to manage long-running autonomous research tasks.

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

Building Effective Agents: Practical Framework and Design Principles

Anthropic

Anthropic presents a practical framework for building production-ready AI agents, addressing the challenge of when and how to deploy agentic systems effectively. The presentation introduces three core principles: selective use of agents for appropriate use cases, maintaining simplicity in design, and adopting the agent's perspective during development. The solution emphasizes a checklist-based approach for evaluating agent suitability considering task complexity, value justification, capability validation, and error costs. Results include successful deployment of coding agents and other domain-specific agents that share a common backbone of environment, tools, and system prompts, demonstrating that simple architectures can deliver sophisticated behavior when properly designed and iterated upon.

Building Enterprise-Grade GenAI Platform with Multi-Cloud Architecture

Coinbase

Coinbase developed CB-GPT, an enterprise GenAI platform, to address the challenges of deploying LLMs at scale across their organization. Initially focused on optimizing cost versus accuracy, they discovered that enterprise-grade LLM deployment requires solving for latency, availability, trust and safety, and adaptability to the rapidly evolving LLM landscape. Their solution was a multi-cloud, multi-LLM platform that provides unified access to models across AWS Bedrock, GCP VertexAI, and Azure, with built-in RAG capabilities, guardrails, semantic caching, and both API and no-code interfaces. The platform now serves dozens of internal use cases and powers customer-facing applications including a conversational chatbot launched in June 2024 serving all US consumers.

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

Building Enterprise-Scale AI Applications with LangChain and LangSmith

Rakuten

Rakuten Group leveraged LangChain and LangSmith to build and deploy multiple AI applications for both their business clients and employees. They developed Rakuten AI for Business, a comprehensive AI platform that includes tools like AI Analyst for market intelligence, AI Agent for customer support, and AI Librarian for documentation management. The team also created an employee-focused chatbot platform using OpenGPTs package, achieving rapid development and deployment while maintaining enterprise-grade security and scalability.

Building Evaluation Frameworks for AI Product Managers: A Workshop on Production LLM Testing

Arize

This workshop, presented by Aman, an AI product manager at Arize, addresses the challenge of shipping reliable AI applications in production by establishing evaluation frameworks specifically designed for product managers. The problem identified is that LLMs inherently hallucinate and are non-deterministic, making traditional software testing approaches insufficient. The solution involves implementing "LLM as a judge" evaluation systems, building comprehensive datasets, running experiments with prompt variations, and establishing human-in-the-loop validation workflows. The approach demonstrates how product managers can move from "vibe coding" to "thrive coding" by using data-driven evaluation methods, prompt playgrounds, and continuous monitoring. Results show that systematic evaluation can catch issues like mismatched tone, missing features, and hallucinations before production deployment, though the workshop candidly acknowledges that evaluations themselves require validation and iteration.

Building Evaluation Systems for AI-Powered Healthcare at Scale

Sword Health

Sword Health developed Phoenix, an AI care specialist that provides clinical support to patients during physical therapy sessions and between appointments. The company addressed the challenge of deploying large language models safely in healthcare by implementing a comprehensive evaluation framework combining offline and online assessments. Their approach includes building diverse evaluation datasets through strategic sampling and synthetic data generation, developing multiple types of evaluators (human-based, code-based, and LLM-as-judge), conducting vibe checks before release, and maintaining continuous monitoring in production through guardrails, A/B testing, manual audits, and automated evaluation of production traces. This eval-driven development process enables iterative improvement, quality assurance, objective model comparison, and cost optimization while ensuring patient safety.

Building Fair Housing Guardrails for Real Estate LLMs: Zillow's Multi-Strategy Approach to Preventing Discrimination

Zillow

Zillow developed a comprehensive Fair Housing compliance system for LLMs in real estate applications, combining three distinct strategies to prevent discriminatory responses: prompt engineering, stop lists, and a custom classifier model. The system addresses critical Fair Housing Act requirements by detecting and preventing responses that could enable steering or discrimination based on protected characteristics. Using a BERT-based classifier trained on carefully curated and augmented datasets, combined with explicit stop lists and prompt engineering, Zillow created a dual-layer protection system that validates both user inputs and model outputs. The approach achieved high recall in detecting non-compliant content while maintaining reasonable precision, demonstrating how domain-specific guardrails can be successfully implemented for LLMs in regulated industries.

Building Foundation Models for Computer Use Agents

Tzafon

Tzafon, a research lab focused on training foundation models for computer use agents, tackled the challenge of enabling LLMs to autonomously interact with computers through visual understanding and action execution. The company identified fundamental limitations in existing models' ability to ground visual information and coordinate actions, leading them to develop custom infrastructure (Waypoint) for data generation at scale, fine-tune vision encoders on screenshot data, and ultimately pre-train models from scratch with specialized computer interaction capabilities. While initial approaches using supervised fine-tuning and reinforcement learning on successful trajectories showed limited generalization, their focus on solving the grounding problem through improved vision-language integration and domain-specific pre-training has positioned them to release models and desktop applications for autonomous computer use, though performance on benchmarks like OS World remains a challenge across the industry.

Building Fully Autonomous Coding Agents for Non-Technical Users

Replit

Replit developed autonomous coding agents designed specifically for non-technical users, evolving from basic code completion tools to fully autonomous agents capable of running for hours while handling all technical decisions. The company identified that autonomy shouldn't be conflated with long runtimes but rather defined by the agent's ability to make technical decisions without user intervention. Their solution involved three key pillars: leveraging frontier model capabilities, implementing comprehensive autonomous testing using browser automation and Playwright, and sophisticated context management through sub-agent orchestration. The approach reduced context compression needs significantly (from 35 to 45-50 memories per compression), enabled agents to run coherently for extended periods without technical user input, and achieved order-of-magnitude improvements in testing cost and latency compared to computer vision approaches.

Building Gemini Deep Research: An Agentic Research Assistant with Custom-Tuned Models

Google Deepmind

Google DeepMind developed Gemini Deep Research, an AI-powered research assistant that autonomously browses the web for 5-10 minutes to generate comprehensive research reports with citations. The product addresses the challenge of users wanting to go from "zero to 50" on new topics quickly, automating what would typically require opening dozens of browser tabs and hours of manual research. The team solved key technical challenges around agentic planning, transparent UX design with editable research plans, asynchronous orchestration, and post-training custom models (initially Gemini 1.5 Pro, moving toward 2.0 Flash) to reliably perform iterative web search and synthesis. The product launched in December 2024 and has been widely praised as potentially the most useful public-facing AI agent to date, with users reporting it can compress hours or days of research work into minutes.

Building GitHub Copilot: Working with OpenAI's LLMs in Production

GitHub

GitHub developed GitHub Copilot by integrating OpenAI's large language models, starting with GPT-3 and evolving through multiple iterations of the Codex model. The problem was creating an effective AI-powered code generation tool that could work seamlessly within developer IDEs. The solution involved extensive prompt crafting to create optimal "pseudo-documents" that guide the model toward better completions, fine-tuning on specific codebases, and implementing contextual improvements such as incorporating code from neighboring editor tabs and file paths. The results included dramatic improvements in code acceptance rates, with the multilingual model eventually solving over 90% of test problems compared to about 50% initially, and noticeable quality improvements particularly for non-top-five programming languages when new model versions were deployed.

Building Healthcare-Specific LLM Pipelines for Oncology Patient Timelines

Roche Diagnostics / John Snow Labs

Roche Diagnostics developed an AI-assisted data abstraction solution using healthcare-specific LLMs to extract and structure oncology patient timelines from unstructured clinical notes. The system leverages natural language processing and machine learning to automatically detect medical concepts, focusing particularly on chemotherapy treatment timelines. The solution addresses the challenge of processing diverse, unstructured healthcare data formats while maintaining high accuracy through domain-specific LLMs and carefully engineered prompts.

Building Internal LLM Tools with Security and Privacy Focus

Wealthsimple

Wealthsimple developed an internal LLM Gateway and suite of generative AI tools to enable secure and privacy-preserving use of LLMs across their organization. The gateway includes features like PII redaction, multi-model support, and conversation checkpointing. They achieved significant adoption with over 50% of employees using the tools, primarily for programming support, content generation, and information retrieval. The platform also enabled operational improvements like automated customer support ticket triaging using self-hosted models.

Building ISO: A Hyperpersonalized AI Food Ordering Agent for Millions of Users

iFood

iFood, Brazil's largest food delivery company, built Ailo, an AI-powered food ordering agent to address the decision paralysis users face when choosing what to eat from overwhelming options. The agent operates both within the iFood app and on WhatsApp, providing hyperpersonalized recommendations based on user behavior, handling complex intents beyond simple search, and autonomously taking actions like applying coupons, managing carts, and facilitating payments. Through careful context management, latency optimization (reducing P95 from 30 to 10 seconds), and sophisticated evaluation frameworks, the team deployed ISO to millions of users in Brazil, demonstrating significant improvements in user experience through proactive engagement and intelligent personalization.

Building LinkedIn's First Production Agent: Hiring Assistant Platform and Architecture

LinkedIn

LinkedIn evolved from simple GPT-based collaborative articles to sophisticated AI coaches and finally to production-ready agents, culminating in their Hiring Assistant product announced in October 2025. The company faced the challenge of moving from conversational assistants with prompt chains to task automation using agent-based architectures that could handle high-scale candidate evaluation while maintaining quality and enabling rapid iteration. They built a comprehensive agent platform with modular sub-agent architecture, centralized prompt management, LLM inference abstraction, messaging-based orchestration for resilience, and a skill registry for dynamic tool discovery. The solution enabled parallel development of agent components, independent quality evaluation, and the ability to serve both enterprise recruiters and SMB customers with variations of the same underlying platform, processing thousands of candidate evaluations at scale while maintaining the flexibility to iterate on product design.

Building Low-Latency Voice AI Agents for Home Services

Elyos AI

Elyos AI built end-to-end voice AI agents for home services companies (plumbers, electricians, HVAC installers) to handle customer calls, emails, and messages 24/7. The company faced challenges achieving human-like conversation latency (targeting sub-400ms response times) while maintaining reliability and accuracy for complex workflows including appointment booking, payment processing, and emergency dispatch. Through careful orchestration, they optimized speech-to-text, LLM, and text-to-speech components, implemented just-in-time context engineering, state machine-based workflows, and parallel monitoring streams to achieve consistent performance with approximately 85% call automation (15% requiring human involvement).

Building Multi-Agent AI Systems for Developer Support and Infrastructure Operations

Electrolux

Electrolux, a Swedish home appliances manufacturer with over 100 years of history, developed "Infra Assistant," an AI-powered multi-agent system to support their internal development teams and reduce bottlenecks in their platform engineering organization. The company faced challenges with their small Site Reliability Engineering (SRE) team being overwhelmed with repetitive support requests via Slack channels. Using Amazon Bedrock agents with both retrieval-augmented generation (RAG) and multi-agent collaboration patterns, they built a sophisticated system that answers questions based on organizational documentation, executes operations via API integrations, and can even troubleshoot cloud infrastructure issues autonomously. The system has proven cost-efficient compared to manual effort, successfully handles repetitive tasks like access management, and provides context-aware responses by accessing multiple organizational knowledge sources, though challenges remain around response latency and achieving consistent accuracy across all interactions.

Building Multi-Agent Systems with MCP and Pydantic AI for Document Processing

Deepsense

Deepsense AI built a multi-agent system for a customer who operates a document processing platform that handles various file types and data sources at scale. The problem was to create both an MCP (Model Context Protocol) server for the platform's internal capabilities and a demonstration multi-agent system that could structure data on demand from documents. Using Pydantic AI as the core agent framework and Anthropic's Claude models, the team developed a solution where users specify goals for document processing, and the system automatically extracts structured information into tables. The implementation involved creating custom MCP servers, integrating with Databricks MCP, and applying 10 key lessons learned around tool design, token optimization, model selection, observability, testing, and security. The result was a modular, scalable system that demonstrates practical patterns for building production-ready agentic applications.

Building Omega: A Multi-Agent Sales Assistant Embedded in Slack

Netguru

Netguru developed Omega, an AI agent designed to support their sales team by automating routine tasks and reinforcing workflow processes directly within Slack. The problem they faced was that as their sales team scaled, key information became scattered across multiple systems (Slack, CRM, call transcripts, shared drives), slowing down coordination and making it difficult to maintain consistency with their Sales Framework 2.0. Omega was built as a modular, multi-agent system using AutoGen for role-based orchestration, deployed on serverless AWS infrastructure (Lambda, Step Functions) with integrations to Google Drive, Apollo, and BlueDot for call transcription. The solution provides context-aware assistance for preparing expert calls, summarizing sales conversations, navigating documentation, generating proposal feature lists, and tracking deal momentum—all within the team's existing Slack workflow, resulting in improved efficiency and process consistency.

Building Open-Source RL Environments from Real-World Coding Tasks for Model Training

Cline

Cline's head of AI presents their experience operating a model-agnostic AI coding agent platform, arguing that the industry has over-invested in "clever scaffolding" like RAG and tool-calling frameworks when frontier models can succeed with simpler approaches. The real bottleneck to progress, they contend, isn't prompt engineering or agent architecture but rather the quality of benchmarks and RL environments used to train models. Cline developed an automated "RL environments factory" system that transforms real-world coding tasks captured from actual user interactions into standardized, containerized training environments. They announce Cline Bench, an open-source benchmark derived from genuine software development work, inviting the community to contribute by simply working on open-source projects with Cline and opting into the initiative, thereby creating a shared substrate for improving frontier models.

Building Personalized Financial and Gardening Experiences with LLMs

Bud Financial / Scotts Miracle-Gro

This case study explores how Bud Financial and Scotts Miracle-Gro leverage Google Cloud's AI capabilities to create personalized customer experiences. Bud Financial developed a conversational AI solution for personalized banking interactions, while Scotts Miracle-Gro implemented an AI assistant called MyScotty for gardening advice and product recommendations. Both companies utilize various Google Cloud services including Vertex AI, GKE, and AI Search to deliver contextual, regulated, and accurate responses to their customers.

Building Product Copilots: Engineering Challenges and Best Practices

Various

A comprehensive study examining the challenges faced by 26 professional software engineers in building AI-powered product copilots. The research reveals significant pain points across the entire engineering process, including prompt engineering difficulties, orchestration challenges, testing limitations, and safety concerns. The study provides insights into the need for better tooling, standardized practices, and integrated workflows for developing AI-first applications.

Building Production Agentic AI Systems for IT Operations and Support Automation

WEX

WEX, a global commerce platform processing over $230 billion in transactions annually, built a production agentic AI system called "Chat GTS" to address their 40,000+ annual IT support requests. The company's Global Technology Services team developed specialized agents using AWS Bedrock and Agent Core Runtime to automate repetitive operational tasks, including network troubleshooting and autonomous EBS volume management. Starting with Q&A capabilities, they evolved into event-driven agents that can autonomously respond to CloudWatch alerts, execute remediation playbooks via SSM documents exposed as MCP tools, and maintain infrastructure drift through automated pull requests. The system went from pilot to production in under 3 months, now serving over 2,000 internal users, with multi-agent architectures handling both user-initiated chat interactions and autonomous incident response workflows.

Building Production Agentic Systems with Platform-Level LLMOps Features

Anthropic

Anthropic's presentation at the AI Engineer conference outlined their platform evolution for building high-performance agentic systems, using Claude Code as the primary example. The company identified three core challenges in production LLM deployments: harnessing model capabilities through API features, managing context windows effectively, and providing secure computational infrastructure for autonomous agent operation. Their solution involved developing platform-level features including extended thinking modes, tool use APIs, Model Context Protocol (MCP) for standardized external system integration, memory management for selective context retrieval, context editing capabilities, and secure code execution environments with container orchestration. The combination of memory tools and context editing demonstrated a 39% performance improvement on internal benchmarks, while their infrastructure solutions enabled Claude Code to run autonomously on web and mobile platforms with session persistence and secure sandboxing.

Building Production AI Agents and Agentic Platforms at Scale

Vercel

This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.

Building Production AI Agents for E-commerce and Food Delivery at Scale

Prosus

This case study explores how Prosus builds and deploys AI agents across e-commerce and food delivery businesses serving two billion customers globally. The discussion covers critical lessons learned from deploying conversational agents in production, with a particular focus on context engineering as the most important factor for success—more so than model selection or prompt engineering alone. The team found that successful production deployments require hybrid approaches combining semantic and keyword search, generative UI experiences that mix chat with dynamic visual components, and sophisticated evaluation frameworks. They emphasize that technology has advanced faster than user adoption, leading to failures when pure chatbot interfaces were tested, and success only came through careful UI/UX design, contextual interventions, and extensive testing with both synthetic and real user data.

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

Building Production AI Agents Platform for Non-Technical Users

Zapier

Zapier developed Zapier Agents, an AI-powered automation platform that allows non-technical users to build and deploy AI agents for business process automation. The company learned that building production AI agents is challenging due to the non-deterministic nature of AI and unpredictable user behavior. They implemented comprehensive instrumentation, feedback collection systems, and a hierarchical evaluation framework including unit tests, trajectory evaluations, and A/B testing to create a data flywheel for continuous improvement of their AI agent platform.

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

Building Production AI Agents with API Platform and Multi-Modal Capabilities

Manus AI

Manus AI demonstrates their production-ready AI agent platform through a technical workshop showcasing their API and application framework. The session covers building complex AI applications including a Slack bot, web applications, browser automation, and invoice processing systems. The platform addresses key production challenges such as infrastructure scaling, sandboxed execution environments, file handling, webhook management, and multi-turn conversations. Through live demonstrations and code walkthroughs, the workshop illustrates how their platform enables developers to build and deploy AI agents that handle millions of daily conversations while providing consistent pricing and functionality across web, mobile, Slack, and API interfaces.

Building Production AI Agents with Vector Databases and Automated Data Collection

Devin Kearns

Over 18 months, a company built and deployed autonomous AI agents for business automation, focusing on lead generation and inbox management. They developed a comprehensive approach using vector databases (Pinecone), automated data collection, structured prompt engineering, and custom tools through n8n for deployment. Their solution emphasizes the importance of up-to-date data, proper agent architecture, and tool integration, resulting in scalable AI agent teams that can effectively handle complex business workflows.

Building Production AI Agents: Lessons from Claude Code and Enterprise Deployments

Anthropic

Anthropic's Applied AI team shares learnings from building and deploying AI agents in production throughout 2024-2025, focusing on their Claude Code product and enterprise customer implementations. The presentation covers the evolution from simple Q&A chatbots and RAG systems to sophisticated agentic architectures that run LLMs in loops with tools. Key technical challenges addressed include context engineering, prompt optimization, tool design, memory management, and handling long-running tasks that exceed context windows. The team transitioned from workflow-based architectures (chained LLM calls with deterministic logic) to agent-based systems where models autonomously use tools to solve open-ended problems, resulting in more robust error handling and the ability to tackle complex tasks like multi-hour coding sessions.

Building Production AI Coding Assistants and Agents at Scale

Sourcegraph

Sourcegraph's CTO discusses the evolution from their code search engine to building Cody, an enterprise AI coding assistant, and AMP, a coding agent released in 2024. The company serves hundreds of Fortune 500 companies and government agencies, deploying LLM-powered tools that achieve 30-60% developer productivity gains. Their approach emphasizes multi-model architectures, rapid iteration without traditional code review processes, and building application scaffolds around frontier models to generate training data for next-generation systems. The discussion explores the transition from chat-based LLM applications (requiring sophisticated RAG systems) to agentic architectures (using simple tool-calling loops), the challenges of scaling in enterprise environments, and philosophical debates about whether pure model scaling will lead to AGI or whether alternating between application development and model training is necessary for continued progress.

Building Production AI Products: A Framework for Continuous Calibration and Development

OpenAI / Various

AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.

Building Production Analytics Agents with Semantic Layer Integration

Wobby

Wobby, a company that helps business teams get insights from their data warehouses in under one minute, shares their journey building production-ready analytics agents over two years. The team developed three specialized agents (Quick, Deep, and Steward) that work with semantic layers to answer business questions. Their solution emphasizes Slack/Teams integration for adoption, building their own semantic layer to encode business logic, preferring prompt-based logic over complex workflows, implementing comprehensive testing strategies beyond just evals, and optimizing for latency through caching and progressive disclosure. The approach led to successful adoption by clients, with analytics agents being actively used in production to handle ad-hoc business intelligence queries.

Building Production Audio Agents with Real-Time Speech-to-Speech Models

OpenAI

OpenAI's solution architecture team presents their learnings on building practical audio agents using speech-to-speech models in production environments. The presentation addresses the evolution from slow, brittle chained architectures combining speech-to-text, LLM processing, and text-to-speech into unified real-time APIs that reduce latency and improve user experience. Key considerations include balancing trade-offs across latency, cost, accuracy, user experience, and integrations depending on use case requirements. The talk covers architectural patterns like tool delegation to specialized agents, prompt engineering for voice expressiveness, evaluation strategies including synthetic conversations, and asynchronous guardrails implementation. Examples from Lemonade and Tinder demonstrate successful production deployments focusing on evaluation frameworks and brand customization respectively.

Building Production Evaluation Systems for GitHub Copilot at Scale

Github

This case study examines the challenges of building evaluation systems for AI products in production, drawing from the author's experience leading the evaluation team at GitHub Copilot serving 100M developers. The problem addressed was the gap between evaluation tooling and developer workflows, as most AI teams consist of engineers rather than data scientists, yet evaluation tools are designed for data science workflows. The solution involved building a comprehensive evaluation stack including automated harnesses for code completion testing, A/B testing infrastructure, and implicit user behavior metrics like acceptance rates. The results showed that while sophisticated evaluation systems are valuable, successful AI products in practice rely heavily on rapid iteration, monitoring in production, and "vibes-based" testing, with the dominant strategy being to ship fast and iterate based on real user feedback rather than extensive offline evaluation.

Building Production LLM Applications with DSPy Framework

AlixPartners

A technical consultant presents a comprehensive workshop on using DSPy, a declarative framework for building modular LLM-powered applications in production. The presenter demonstrates how DSPy enables rapid iteration on LLM applications by treating LLMs as first-class citizens in Python programs, with built-in support for structured outputs, type guarantees, tool calling, and automatic prompt optimization. Through multiple real-world use cases including document classification, contract analysis, time entry correction, and multi-modal processing, the workshop shows how DSPy's core primitives—signatures, modules, tools, adapters, optimizers, and metrics—allow teams to build production-ready systems that are transferable across models, optimizable without fine-tuning, and maintainable at scale.

Building Production LLM Pipelines for Insurance Risk Assessment and Document Processing

Vouch

Vouch Insurance implemented a production machine learning system using Metaflow to handle risk classification and document processing for their technology-focused insurance business. The system combines traditional data warehousing with LLM-powered predictions, processing structured and unstructured data through hourly pipelines. They built a comprehensive stack that includes data transformation, LLM integration via OpenAI, and a FastAPI service layer with an SDK for easy integration by product engineers.

Building Production Multi-Agent Research Systems with Claude

Anthropic

Anthropic developed a production-grade multi-agent research system for their Claude Research feature that uses multiple LLM agents working in parallel to explore complex topics across web, Google Workspace, and integrated data sources. The system employs an orchestrator-worker pattern where a lead agent coordinates specialized subagents that search and filter information simultaneously, addressing challenges in agent coordination, evaluation, and reliability. Internal evaluations showed the multi-agent approach with Claude Opus 4 and Sonnet 4 outperformed single-agent Claude Opus 4 by 90.2% on research tasks, with token usage explaining 80% of performance variance, though the architecture consumes approximately 15× more tokens than standard chat interactions, requiring careful consideration of economic viability and deployment strategies.

Building Production Security Features with LangChain and LLMs

Elastic

Elastic developed three security-focused generative AI features - Automatic Import, Attack Discovery, and Elastic AI Assistant - by integrating LangChain and LangGraph into their Search AI Platform. The solution leverages RAG and controllable agents to expedite labor-intensive SecOps tasks, including ES|QL query generation and data integration automation. The implementation includes LangSmith for debugging and performance monitoring, reaching over 350 users in production.

Building Production Web Agents for Food Ordering

iFood

A team at Prosus built web agents to help automate food ordering processes across their e-commerce platforms. Rather than relying on APIs, they developed web agents that could interact directly with websites, handling complex tasks like searching, navigating menus, and placing orders. Through iterative development and optimization, they achieved an 80% success rate target for specific e-commerce tasks by implementing a modular architecture that separated planning and execution, combined with various operational modes for different scenarios.

Building Production-Grade Agentic AI Analytics: Lessons from Real-World Deployment

Tellius

Tellius shares hard-won lessons from building their agentic analytics platform that transforms natural language questions into trustworthy SQL-based insights. The core problem addressed is that chat-based analytics requires far more than simple text-to-SQL conversion—it demands deterministic planning, governed semantic layers, ambiguity management, multi-step consistency, transparency, performance engineering, and comprehensive observability. Their solution architecture separates language understanding from execution through typed plan artifacts that validate against schemas and policies before execution, implements clarification workflows for ambiguous queries, maintains plan/result fingerprinting for consistency, provides inline transparency with preambles and lineage, enforces latency budgets across execution hops, and treats feedback as governed policy changes. The result is a production system that achieves determinism, explainability, and sub-second interactive performance while avoiding the common pitfalls that cause 95% of AI pilot failures.

Building Production-Grade AI Agents with Distributed Architecture and Error Recovery

Parcha

Parcha's journey in building enterprise-grade AI Agents for automating compliance and operations workflows, evolving from a simple Langchain-based implementation to a sophisticated distributed system. They overcame challenges in reliability, context management, and error handling by implementing async processing, coordinator-worker patterns, and robust error recovery mechanisms, while maintaining clean context windows and efficient memory management.

Building Production-Grade AI Agents with Guardrails, Context Management, and Security

Portia / Riff / Okta

This panel discussion features founders from Portia AI and Rift.ai (formerly Databutton) discussing the challenges of moving AI agents from proof-of-concept to production. The speakers address critical production concerns including guardrails for agent reliability, context engineering strategies, security and access control challenges, human-in-the-loop patterns, and identity management. They share real-world customer examples ranging from custom furniture makers to enterprise CRM enrichment, emphasizing that while approximately 40% of companies experimenting with AI have agents in production, the journey requires careful attention to trust, security, and supportability. Key solutions include conditional example-based prompting, sandboxed execution environments, role-based access controls, and keeping context windows smaller for better precision rather than utilizing maximum context lengths.

Building Production-Grade AI Agents with Observability, Evaluation, and Insights

Langchain

Langchain discusses the evolution of their LangSmith platform for managing AI agents in production, addressing the challenge of bringing rigor and reliability to deployed LLM applications. The company describes launching two major feature sets: Insights, which automatically discovers patterns and trends in millions of production traces to help teams understand user interactions and agent behavior, and thread-based evaluations, which enable assessment of multi-turn conversations and complete user sessions rather than just individual interactions. These features aim to help teams transition from informal "vibe testing" to more methodical approaches as agents move from initial prototypes to production deployments handling millions of daily traces, with the goal of reducing unknowns and improving reliability in production AI systems.

Building Production-Grade AI Agents: Overcoming Reasoning and Tool Challenges

Kentauros AI

Kentauros AI presents their experience building production-grade AI agents, detailing the challenges in developing agents that can perform complex, open-ended tasks in real-world environments. They identify key challenges in agent reasoning (big brain, little brain, and tool brain problems) and propose solutions through reinforcement learning, generalizable algorithms, and scalable data approaches. Their evolution from G2 to G5 agent architectures demonstrates practical solutions to memory management, task-specific reasoning, and skill modularity.

Building Production-Grade Conversational Analytics with LangGraph and Waii

Waii

The case study demonstrates how to build production-ready conversational analytics applications by integrating LangGraph's multi-agent framework with Waii's advanced text-to-SQL capabilities. The solution tackles complex database operations through sophisticated join handling, knowledge graph construction, and agentic flows, enabling natural language interactions with complex data structures while maintaining high accuracy and scalability.

Building Production-Grade Generative AI Applications with Comprehensive LLMOps

Block (Square)

Block (Square) implemented a comprehensive LLMOps strategy across multiple business units using a combination of retrieval augmentation, fine-tuning, and pre-training approaches. They built a scalable architecture using Databricks' platform that allowed them to manage hundreds of AI endpoints while maintaining operational efficiency, cost control, and quality assurance. The solution enabled them to handle sensitive data securely, optimize model performance, and iterate quickly while maintaining version control and monitoring capabilities.

Building Production-Grade Heterogeneous RAG Systems

AWS GenAIIC

AWS GenAIIC shares practical insights from implementing RAG systems with heterogeneous data formats in production. The case study explores using routers for managing diverse data sources, leveraging LLMs' code generation capabilities for structured data analysis, and implementing multimodal RAG solutions that combine text and image data. The solutions include modular components for intent detection, data processing, and retrieval across different data types with examples from multiple industries.

Building Production-Grade LLM Applications: An Architectural Guide

Github

A comprehensive technical guide on building production LLM applications, covering the five key steps from problem definition to evaluation. The article details essential components including input processing, enrichment tools, and responsible AI implementations, using a practical customer service example to illustrate the architecture and deployment considerations.

Building Production-Grade LLM Evaluation Systems for HR Tech Interview Intelligence

Zebra

Spotted Zebra, an HR tech company building AI-powered hiring software for large enterprises, faced challenges scaling their interview intelligence product when transitioning from slow research-phase development to rapid client-driven iterations. The company developed a comprehensive evaluation framework centered on six key lessons: codifying human judgment through golden examples, versioning prompts systematically, using LLM-as-a-judge for open-ended tasks, building adversarial testing banks, implementing robust API logging, and treating evaluation as a strategic capability. This approach enabled faster development cycles, improved product quality, better client communication around fairness and transparency, and successful compliance certification (ISO 42001), positioning them for EU AI Act requirements.

Building Production-Grade RAG Systems for Financial Document Analysis

Microsoft

Microsoft's team shares their experience implementing a production RAG system for analyzing financial documents, including analyst reports and SEC filings. They tackled complex challenges around metadata extraction, chart/graph analysis, and evaluation methodologies. The system needed to handle tens of thousands of documents, each containing hundreds of pages with tables, graphs, and charts spanning different time periods and fiscal years. Their solution incorporated multi-modal models for image analysis, custom evaluation frameworks, and specialized document processing pipelines.

Building Production-Ready Agentic AI Systems in Financial Services

Fitch Group

Jayeeta Putatunda, Director of AI Center of Excellence at Fitch Group, shares lessons learned from deploying agentic AI systems in the financial services industry. The discussion covers the challenges of moving from proof-of-concept to production, emphasizing the importance of evaluation frameworks, observability, and the "data prep tax" required for reliable AI agent deployments. Key insights include the need to balance autonomous agents with deterministic workflows, implement comprehensive logging at every checkpoint, combine LLMs with traditional predictive models for numerical accuracy, and establish strong business-technical partnerships to define success metrics. The conversation highlights that while agentic frameworks enable powerful capabilities, production success requires careful system design, multi-layered evaluation, human-in-the-loop validation patterns, and a focus on high-ROI use cases rather than chasing the latest model architectures.

Building Production-Ready Agentic Systems with the Claude Developer Platform

Anthropic

Anthropic's Claude Developer Platform team discusses their evolution from a simple API to a comprehensive platform for building autonomous AI agents in production. The conversation covers their philosophy of "unhobbling" models by reducing scaffolding and giving Claude more autonomous decision-making capabilities through tools like web search, code execution, and context management. They introduce the Claude Code SDK as a general-purpose agentic harness that handles the tool-calling loop automatically, making it easier for developers to prototype and deploy agents. The platform addresses key production challenges including prompt caching, context window management, observability for long-running tasks, and agentic memory, with a roadmap focused on higher-order abstractions and self-improving systems.

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

Building Production-Ready AI Agents and Monitoring Systems

Portkey, Airbyte, Comet

The panel discussion and demo sessions showcase how companies like Portkey, Airbyte, and Comet are tackling the challenges of deploying LLMs and AI agents in production. They address key issues including monitoring, observability, error handling, data movement, and human-in-the-loop processes. The solutions presented range from AI gateways for enterprise deployments to experiment tracking platforms and tools for building reliable AI agents, demonstrating both the challenges and emerging best practices in LLMOps.

Building Production-Ready AI Agents for Internal Workflow Automation

Vercel

Vercel, a web hosting and deployment platform, addressed the challenge of identifying and implementing successful AI agent projects across their organization by focusing on employee pain points—specifically repetitive, boring tasks that humans disliked. The company deployed three internal production agents: a lead processing agent that automated sales qualification and research (saving hundreds of days of manual work), an anti-abuse agent that accelerated content moderation decisions by 59%, and a data analyst agent that automated SQL query generation for business intelligence. Their methodology centered on asking employees "What do you hate most about your job?" to identify tasks that were repetitive enough for current AI models to handle reliably while still delivering high business impact.

Building Production-Ready AI Agents: Lessons from BeeAI Framework Development

IBM

IBM Research's team spent a year developing and deploying AI agents in production, leading to the creation of the open-source BeeAI Framework. The project addressed the challenge of making LLM-powered agents accessible to developers while maintaining production-grade reliability. Their journey included creating custom evaluation frameworks, developing novel user interfaces for agent interaction, and establishing robust architecture patterns for different use cases. The team successfully launched an open-source stack that gained particular traction with TypeScript developers.

Building Production-Ready AI Agents: OpenAI Codex CLI Architecture and Agent Loop Design

OpenAI

OpenAI's Codex CLI is a cross-platform software agent that executes reliable code changes on local machines, demonstrating production-grade LLMOps through its sophisticated agent loop architecture. The system orchestrates interactions between users, language models, and tools through an iterative process that manages inference calls, tool execution, and conversation state. Key technical achievements include stateless request handling for Zero Data Retention compliance, strategic prompt caching optimization to achieve linear rather than quadratic performance, automatic context window management through intelligent compaction, and robust handling of multi-turn conversations while maintaining conversation coherence across potentially hundreds of model-tool iterations.

Building Production-Ready AI Analytics Agents Through Advanced Prompt Engineering

Explai

Explai, a company building AI-powered data analytics companions, encountered significant challenges when deploying multi-agent LLM systems for enterprise analytics use cases. Their initial approach of pre-loading agent contexts with extensive domain knowledge, business information, and intermediate results led to context pollution and degraded instruction following at scale. Through iterative learning over two years, they developed three key prompt engineering tactics: reversing the traditional RAG approach by using trigger messages with pull-based document retrieval, writing structured artifacts instead of raw data to context, and allowing agents to generate full executable code in sandboxed environments. These tactics enabled more autonomous agent behavior while maintaining accuracy and reducing context window bloat, ultimately creating a more robust production system for complex, multi-step data analysis workflows.

Building Production-Ready AI Analytics with LLMs: Lessons from Jira Integration

Luna

Luna developed an AI-powered Jira analytics system using GPT-4 and Claude 3.7 to extract actionable insights from complex project management data, helping engineering and product teams track progress, identify risks, and predict delays. Through iterative development, they identified seven critical lessons for building reliable LLM applications in production, including the importance of data quality over prompt engineering, explicit temporal context handling, optimal temperature settings for structured outputs, chain-of-thought reasoning for accuracy, focused constraints to reduce errors, leveraging reasoning models effectively, and addressing the "yes-man" effect where models become overly agreeable rather than critically analytical.

Building Production-Ready AI Assistant with Agentic Architecture

Shopify

Shopify developed Sidekick, an AI-powered assistant that helps merchants manage their stores through natural language interactions, evolving from a simple tool-calling system into a sophisticated agentic platform. The team faced scaling challenges with tool complexity and system maintainability, which they addressed through Just-in-Time instructions, robust LLM evaluation systems using Ground Truth Sets, and Group Relative Policy Optimization (GRPO) training. Their approach resulted in improved system performance and maintainability, though they encountered and had to address reward hacking issues during reinforcement learning training.

Building Production-Ready Conversational AI Voice Agents: Latency, Voice Quality, and Integration Challenges

Deepgram

Deepgram, a leader in transcription services, shares insights on building effective conversational AI voice agents. The presentation covers critical aspects of implementing voice AI in production, including managing latency requirements (targeting 300ms benchmark), handling end-pointing challenges, ensuring voice quality through proper prosody, and integrating LLMs with speech-to-text and text-to-speech services. The company introduces their new text-to-speech product Aura, designed specifically for conversational AI applications with low latency and natural voice quality.

Building Production-Ready CRM Integration for ChatGPT using Model Context Protocol

Hubspot

HubSpot developed the first third-party CRM connector for ChatGPT using the Model Context Protocol (MCP), creating a remote MCP server that enables 250,000+ businesses to perform deep research through conversational AI without requiring local installations. The solution involved building a homegrown MCP server infrastructure using Java and Dropwizard, implementing OAuth-based user-level permissions, creating a distributed service discovery system for automatic tool registration, and designing a query DSL that allows AI models to generate complex CRM searches through natural language interactions.

Building Production-Ready Customer Support AI Agents: Challenges and Solutions

Gradient Labs

Gradient Labs shares their experience building and deploying AI agents for customer support automation in production. While prototyping with LLMs is relatively straightforward, deploying agents to production introduces complex challenges around state management, knowledge integration, tool usage, and handling race conditions. The company developed a state machine-based architecture with durable execution engines to manage these challenges, successfully handling hundreds of conversations per day with high customer satisfaction.

Building Production-Ready Healthcare AI That Scales With Model Progress

Anterior

This case study examines Anterior's experience building LLM-powered products for healthcare prior authorization over three years. The company faced the challenge of building production systems around rapidly evolving AI capabilities, where approaches designed around current model limitations could quickly become obsolete. Through experimentation with techniques like hierarchical query reasoning, finetuning, domain knowledge injection, and expert review systems, they learned which approaches compound with model progress versus those that compete with it. The result was a framework for "Sour Lesson-pilled" product development that emphasizes building systems that benefit from model improvements rather than being made redundant by them, with key surviving techniques including dynamic domain knowledge injection and scalable expert review infrastructure.

Building Production-Ready LLM Agents with State Management and Workflow Engineering

Renovai

A comprehensive technical presentation on building production-grade LLM agents, covering the evolution from basic agents to complex multi-agent systems. The case study explores implementing state management for maintaining conversation context, workflow engineering patterns for production deployment, and advanced techniques including multimodal agents using GPT-4V for web navigation. The solution demonstrates practical approaches to building reliable, maintainable agent systems with proper tracing and debugging capabilities.

Building Production-Ready SQL and Charting Agents with RAG Integration

Numbers Station

Numbers Station addresses the challenge of overwhelming data team requests in enterprises by developing an AI-powered self-service analytics platform. Their solution combines LLM agents with RAG and a comprehensive knowledge layer to enable accurate SQL query generation, chart creation, and multi-agent workflows. The platform demonstrated significant improvements in real-world benchmarks compared to vanilla LLM approaches, reducing setup time from weeks to hours while maintaining high accuracy through contextual knowledge integration.

Building Production-Scale AI Agents with Extended GenAI Tech Stack

LinkedIn

LinkedIn extended their generative AI application tech stack to support building complex AI agents that can reason, plan, and act autonomously while maintaining human oversight. The evolution from their original GenAI stack to support multi-agent orchestration involved leveraging existing infrastructure like gRPC for agent definitions, messaging systems for multi-agent coordination, and comprehensive observability through OpenTelemetry and LangSmith. The platform enables agents to work both synchronously and asynchronously, supports background processing, and includes features like experiential memory, human-in-the-loop controls, and cross-device state synchronization, ultimately powering products like LinkedIn's Hiring Assistant which became globally available.

Building Production-Scale AI Search with Knowledge Graphs, MCP, and DSPy

Dropbox

Dropbox faced the challenge of enabling users to search and query their work content scattered across 50+ SaaS applications and tabs, which proprietary LLMs couldn't access. They built Dash, an AI-powered universal search and agent platform using a sophisticated context engine that combines custom connectors, content understanding, knowledge graphs, and index-based retrieval (primarily BM25) over federated approaches. The system addresses MCP scalability challenges through "super tools," uses LLM-as-a-judge for relevancy evaluation (achieving high agreement with human evaluators), and leverages DSPy for prompt optimization across 30+ prompts in their stack. This infrastructure enables cross-app intelligence with fast, accurate, and ACL-compliant retrieval for agentic queries at enterprise scale.

Building Production-Scale Code Completion Tools with Continuous Evaluation and Prompt Engineering

Gitlab

Gitlab's ModelOps team developed a sophisticated code completion system using multiple LLMs, implementing a continuous evaluation and improvement pipeline. The system combines both open-source and third-party LLMs, featuring a comprehensive architecture that includes continuous prompt engineering, evaluation benchmarks, and reinforcement learning to consistently improve code completion accuracy and usefulness for developers.

Building QueryAnswerBird: An AI Data Analyst with Text-to-SQL and RAG

Delivery Hero

Woowa Brothers, part of Delivery Hero, developed QueryAnswerBird (QAB), an LLM-based AI data analyst to address employee challenges with SQL query generation and data literacy. Through a company-wide survey, they identified that 95% of employees used data for work, but over half struggled with SQL due to time constraints or difficulty translating business logic into queries. The solution leveraged RAG, LangChain, and GPT-4 to build a Slack-integrated assistant that automatically generates SQL queries from natural language, interprets queries, validates syntax, and explores tables. After winning first place at an internal hackathon in 2023, a dedicated task force spent six months developing the production system with comprehensive LLMOps practices including A/B testing, monitoring dashboards, API load balancing, GPT caching, and CI/CD deployment, conducting over 500 tests to optimize performance.

Building QueryAnswerBird: An LLM-Powered AI Data Analyst with RAG and Text-to-SQL

Delivery Hero

Woowa Brothers, part of Delivery Hero, developed QueryAnswerBird (QAB), an LLM-based AI data analyst to address the challenge that while 95% of employees used data in their work, over half struggled with SQL proficiency and data extraction reliability. The solution leveraged GPT-4, RAG architecture, LangChain, and comprehensive LLMOps practices to create a Slack-based chatbot that could generate SQL queries from natural language, interpret queries, validate syntax, and provide data discovery features. The development involved building automated unstructured data pipelines with vector stores, implementing multi-chain RAG architecture with router supervisors, establishing LLMOps infrastructure including A/B testing and monitoring dashboards, and conducting over 500 experiments to optimize performance, resulting in a 24/7 accessible service that provides high-quality query responses within 30 seconds to 1 minute.

Building Reliable Agentic Systems in Production

Factory.ai

Factory.ai shares their experience building reliable AI agent systems for software engineering automation. They tackle three key challenges: planning (keeping agents focused on goals), decision-making (improving accuracy and consistency), and environmental grounding (interfacing with real-world systems). Their approach combines techniques from robotics like model predictive control, consensus mechanisms for decision-making, and careful tool/interface design for production deployment.

Building Reliable AI Agents for Application Development with Multi-Agent Architecture

Replit

Replit developed an AI agent system to help users create applications from scratch, addressing the challenge of blank page syndrome in software development. They implemented a multi-agent architecture with manager, editor, and verifier agents, focusing on reliability and user engagement. The system incorporates advanced prompt engineering techniques, human-in-the-loop workflows, and comprehensive monitoring through LangSmith, resulting in a powerful tool that simplifies application development while maintaining user control and visibility.

Building Reliable AI Agents Through Production Monitoring and Intent Discovery

Raindrop

Raindrop, a monitoring platform for AI products, addresses the challenge of building reliable AI agents in production where traditional offline evaluations fail to capture real-world usage patterns. The company developed a "Sentry for AI products" approach that emphasizes experimentation, production monitoring, and discovering user intents through clustering and signal detection. Their solution combines explicit signals (like thumbs up/down, regenerations) and implicit signals (detecting refusals, task failures, user frustration) to identify issues that don't manifest as traditional software errors. The platform trains custom models to detect issues across production data at scale, enabling teams to discover unknown problems, track their impact on users, and fix them systematically without breaking existing functionality.

Building Reliable AI DevOps Agents: Engineering Practices for Nondeterministic LLM Output

Trunk

Trunk developed an AI DevOps agent to handle root cause analysis (RCA) for test failures in CI pipelines, facing challenges with nondeterministic LLM outputs. They applied traditional software engineering principles adapted for LLMs, including starting with narrow use cases, switching between models (Claude to Gemini) for better tool calling, implementing comprehensive testing with mocked LLM responses, and establishing feedback loops through internal usage and user feedback collection. The approach resulted in a more reliable agent that performs well on specific tasks like analyzing test failures and posting summaries to GitHub PRs.

Building Reliable Background Coding Agents with Verification Loops

Spotify

Spotify developed a background coding agent system to automate large-scale software maintenance across thousands of components, addressing the challenge of ensuring reliable and correct code changes without direct human supervision. The solution centers on implementing strong verification loops consisting of deterministic verifiers (for formatting, building, and testing) and an LLM-as-judge layer to prevent the agent from making out-of-scope changes. After generating over 1,500 pull requests, the system demonstrates that verification loops are essential for maintaining predictability, with the judge layer vetoing approximately 25% of proposed changes and the agent successfully course-correcting about half the time, significantly reducing the risk of functionally incorrect code reaching production.

Building Reliable LLM Workflows in Biotech Research

Moderna

Moderna Therapeutics applies large language models primarily for document reformatting and regulatory submission preparation within their research organization, deliberately avoiding autonomous agents in favor of highly structured workflows. The team, led by Eric Maher in research data science, focuses on automating what they term "intellectual drudgery" - reformatting laboratory records and experiment documentation into regulatory-compliant formats. Their approach prioritizes reliability over novelty, implementing rigorous evaluation processes matched to consequence levels, with particular emphasis on navigating the complex security and permission mapping challenges inherent in regulated biotech environments. The team employs a "non-LLM filter" methodology, only reaching for generative AI after exhausting simpler Python or traditional ML approaches, and leverages serverless infrastructure like Modal and reactive notebooks with Marimo to enable rapid experimentation and deployment.

Building Reliable Multi-Agent Systems for Application Development

Replit

Replit developed a sophisticated AI agent system to help users create applications from scratch, focusing on reliability and human-in-the-loop workflows. Their solution employs a multi-agent architecture with specialized roles, advanced prompt engineering techniques, and a custom DSL for tool execution. The system includes robust version control, clear user feedback mechanisms, and comprehensive observability through LangSmith, successfully lowering the barrier to entry for software development while maintaining user engagement and control.

Building Resilient Multi-Provider AI Agent Infrastructure for Financial Services

Gradient Labs

Gradient Labs built an AI agent that handles customer interactions for financial services companies, requiring high reliability in production. The company architected a sophisticated failover system that spans multiple LLM providers (OpenAI, Anthropic, Google) and hosting platforms (native APIs, Azure, AWS, GCP), enabling both traffic distribution across rate limits and automatic failover during errors, rate limiting, or latency spikes. They use Temporal for durable execution to checkpoint progress across long-running agentic workflows, and have implemented both provider-level and model-level failover strategies with tailored prompts for backup models, ensuring continuous operation even during catastrophic provider outages.

Building Robust Evaluation Systems for Auto-Generated Video Titles

Loom

Loom developed a systematic approach to evaluating and improving their AI-powered video title generation feature. They created a comprehensive evaluation framework combining code-based scorers and LLM-based judges, focusing on specific quality criteria like relevance, conciseness, and engagement. This methodical approach to LLMOps enabled them to ship AI features faster and more confidently while ensuring consistent quality in production.

Building Robust Evaluation Systems for GitHub Copilot

Github

This case study explores how Github developed and evolved their evaluation systems for Copilot, their AI code completion tool. Initially skeptical about the feasibility of code completion, the team built a comprehensive evaluation framework called "harness lib" that tested code completions against actual unit tests from open source repositories. As the product evolved to include chat capabilities, they developed new evaluation approaches including LLM-as-judge for subjective assessments, along with A/B testing and algorithmic evaluations for function calls. This systematic approach to evaluation helped transform Copilot from an experimental project to a robust production system.

Building Robust Legal Document Processing Applications with LLMs

Anzen

The case study explores how Anzen builds robust LLM applications for processing insurance documents in environments where accuracy is critical. They employ a multi-model approach combining specialized models like LayoutLM for document structure analysis with LLMs for content understanding, implement comprehensive monitoring and feedback systems, and use fine-tuned classification models for initial document sorting. Their approach demonstrates how to effectively handle LLM hallucinations and build production-grade systems with high accuracy (99.9% for document classification).

Building Robust LLM Evaluation Frameworks: W&B's Evaluation-Driven Development Approach

Weights & Biases

Weights & Biases details their evaluation-driven development approach in upgrading Wandbot to version 1.1, showcasing how systematic evaluation can guide LLM application improvements. The case study describes the development of a sophisticated auto-evaluation framework aligned with human annotations, implementing comprehensive metrics across response quality and context assessment. Key improvements include enhanced data ingestion with better MarkdownX parsing, a query enhancement system using Cohere for language detection and intent classification, and a hybrid retrieval system combining FAISS, BM25, and web knowledge integration. The new version demonstrated significant improvements across multiple metrics, with GPT-4-1106-preview-v1.1 showing superior performance in answer correctness, relevancy, and context recall compared to previous versions.

Building Secure and Private Enterprise Search with LLMs

Slack

Slack built an enterprise search feature that extends their AI-powered search capabilities to external sources like Google Drive and GitHub while maintaining strict security and privacy standards. The problem was enabling users to search across multiple knowledge sources without compromising data security or violating privacy principles. Their solution uses a federated, real-time approach with OAuth-based authentication, Retrieval Augmented Generation (RAG), and LLMs hosted in an AWS escrow VPC to ensure customer data never leaves Slack's trust boundary, isn't used for model training, and respects user permissions. The result is a production system that surfaces relevant, up-to-date, permissioned content from both internal and external sources while maintaining enterprise-grade security standards, with explicit user and admin control over data access.

Building Secure Generative AI Applications at Scale: Amazon's Journey from Experimental to Production

Amazon

Amazon faced the challenge of securing generative AI applications as they transitioned from experimental proof-of-concepts to production systems like Rufus (shopping assistant) and internal employee chatbots. The company developed a comprehensive security framework that includes enhanced threat modeling, automated testing through their FAST (Framework for AI Security Testing) system, layered guardrails, and "golden path" templates for secure-by-default deployments. This approach enabled Amazon to deploy customer-facing and internal AI applications while maintaining security, compliance, and reliability standards through continuous monitoring, evaluation, and iterative refinement processes.

Building State-of-the-Art AI Programming Agents with OpenAI's o1 Model

Weights & Biases

Weights & Biases developed an advanced AI programming agent using OpenAI's o1 model that achieved state-of-the-art performance on the SWE-Bench-Verified benchmark, successfully resolving 64.6% of software engineering issues. The solution combines o1 with custom-built tools, including a Python code editor toolset, memory components, and parallel rollouts with crosscheck mechanisms, all developed and evaluated using W&B's Weave toolkit and newly created Eval Studio platform.

Building Stateful AI Agents with In-Context Learning and Memory Management

Letta

Letta addresses the fundamental limitation of current LLM-based agents: their inability to learn and retain information over time, leading to degraded performance as context accumulates. The platform enables developers to build stateful agents that learn by updating their context windows rather than model parameters, making learning interpretable and model-agnostic. The solution includes a developer platform with memory management tools, context window controls, and APIs for creating production agents that improve over time. Real-world deployments include a support agent that has been learning from Discord interactions for a month and recommendation agents for Built Rewards, demonstrating that agents with persistent memory can achieve performance comparable to fine-tuned models while remaining flexible and debuggable.

Building Trust in RAG Systems Through Structured Feedback and User Collaboration

Needl.ai

Needl.ai's AskNeedl product faced challenges with user trust in their RAG-based AI system, where issues like missing citations, incomplete answers, and vague responses undermined confidence despite technical correctness. The team addressed this through a structured feedback loop involving query logging, pattern annotation, themed QA sets, and close collaboration with early adopter users from compliance and market analysis domains. Without retraining the underlying model, they improved retrieval strategies, tuned prompts for clarity, enhanced citation formatting, and prioritized fixes based on high-frequency queries and high-trust personas, ultimately transforming scattered user frustration into actionable improvements that restored trust in production.

Building Trustworthy LLM Agents for Automated Expense Management

Ramp

Ramp developed and deployed a suite of LLM-powered agents to automate expense management workflows, with a particular focus on their "policy agent" that automates expense approvals. The company faced the challenge of building AI systems that finance teams could trust in a domain where low-quality outputs could quickly erode confidence. Their solution emphasized explainable reasoning with citations, built-in uncertainty handling, collaborative context refinement, user-controlled autonomy levels, and comprehensive evaluation frameworks. Since deployment, the policy agent has handled over 65% of expense approvals autonomously, demonstrating that carefully designed LLM systems can deliver significant automation value while maintaining user trust through transparency and control.

Building Trustworthy LLM-Powered Agents for Automated Expense Management

Ramp

Ramp developed a suite of LLM-backed agents to automate expense management processes, focusing on building user trust through transparent reasoning, escape hatches for uncertainty, and collaborative context management. The team addressed the challenge of deploying LLMs in a finance environment where accuracy and trust are critical by implementing clear explanations for decisions, allowing users to control agent autonomy levels, and creating feedback loops for continuous improvement. Their policy agent now handles over 65% of expense approvals automatically while maintaining user confidence through transparent decision-making and the ability to defer to human judgment when uncertain.

Building Uma: In-House AI Research and Custom Fine-Tuning for Marketplace Intelligence

Upwork

Upwork developed Uma, their "mindful AI" assistant, by rejecting off-the-shelf LLM solutions in favor of building custom-trained models using proprietary platform data and in-house AI research. The company hired expert freelancers to create high-quality training datasets, generated synthetic data anchored in real platform interactions, and fine-tuned open-source LLMs specifically for hiring workflows. This approach enabled Uma to handle complex, business-critical tasks including crafting job posts, matching freelancers to opportunities, autonomously coordinating interviews, and evaluating candidates. The strategy resulted in models that substantially outperform generic alternatives on domain-specific tasks while reducing costs by up to 10x and improving reliability in production environments. Uma now operates as an increasingly agentic system that takes meaningful actions across the full hiring lifecycle.

Building Unified API Infrastructure for AI Integration at Scale

Merge

Merge, a unified API provider founded in 2020, helps companies offer native integrations across multiple platforms (HR, accounting, CRM, file storage, etc.) through a single API. As AI and LLMs emerged, Merge adapted by launching Agent Handler, an MCP-based product that enables live API calls for agentic workflows while maintaining their core synced data product for RAG-based use cases. The company serves major LLM providers including Mistral and Perplexity, enabling them to access customer data securely for both retrieval-augmented generation and real-time agent actions. Internally, Merge has adopted AI tools across engineering, support, recruiting, and operations, leading to increased output and efficiency while maintaining their core infrastructure focus on reliability and enterprise-grade security.

Building Verifiable Retrieval Infrastructure for Agentic Systems

Hornet

Hornet is developing a retrieval engine specifically designed for AI agents, addressing the challenge that their API surface isn't in any LLM's pre-training data and traditional documentation-in-prompt approaches proved insufficient. Their solution centers on making the entire API surface verifiable through three validation layers (syntactic, semantic, and behavioral), structured similarly to code with configuration files that agents can write, edit, and test. This approach enables agents to not only use Hornet but to learn, configure, and optimize retrieval on their own through feedback loops, similar to how coding agents verify output through compilers and tests, ultimately creating self-improving systems where agents can tune their own context retrieval without human intervention.

Building Voice-Enabled AI Assistants with Real-Time Processing

Bee

A detailed exploration of building real-time voice-enabled AI assistants, featuring multiple approaches from different companies and developers. The case study covers how to achieve low-latency voice processing, transcription, and LLM integration for interactive AI assistants. Solutions demonstrated include both commercial services like Deepgram and open-source implementations, with a focus on achieving sub-second latency, high accuracy, and cost-effective deployment.

Business Intelligence Agent for Automotive Dealers with Dynamic UI and Instant Actions

Prosus

Prosus, a machine learning engineering team, built an AI-powered business intelligence assistant for Otomoto, Poland's largest secondhand car dealer platform with thousands of dealers and millions of users. The problem was that dealers were overwhelmed by the platform's rich data and struggled to organize listings and take actionable insights. The initial chat-based agent achieved only 10% engagement with negligible repeat usage, revealing "chat fatigue" - users didn't know what to ask and found the open text box intimidating. The solution involved moving away from pure chat interfaces to a dynamic UI with context-aware action buttons, interactive responses with clickable elements, streaming for perceived faster responses, and purpose-built data aggregation tools using CSV format to reduce token consumption. Results showed that users were significantly more likely to engage when presented with clickable buttons rather than open-ended questions, with button clicks leading to follow-up questions and improved engagement metrics.

Challenges and Opportunities in Building Product Copilots: An Industry Interview Study

Microsoft / GitHub

Microsoft and GitHub researchers conducted a comprehensive interview study with 26 professional software engineers across various companies who are building AI-powered product copilots—conversational agents that assist users with natural language interactions. The study identified significant pain points across the entire engineering lifecycle, including the time-consuming and fragile nature of prompt engineering, difficulties in orchestration and managing multi-turn workflows, the lack of standardized testing and benchmarking approaches, challenges in learning best practices in a rapidly evolving field, and concerns around safety, privacy, and compliance. The research reveals that existing software engineering processes and tools have not yet adapted to the unique challenges of building AI-powered applications, leaving engineers to improvise without established best practices. Through subsequent brainstorming sessions, the researchers collaboratively identified opportunities for improved tooling, including prompt linters, automated benchmark creation, better visibility into model behavior, and more integrated development workflows.

Challenges in Building Enterprise Chatbots with LLMs: A Banking Case Study

Invento Robotics

A bank's attempt to implement a customer support chatbot using GPT-4 and RAG reveals the complexities and challenges of deploying LLMs in production. What was initially estimated as a three-month project struggled to deliver after a year, highlighting key challenges in domain knowledge management, retrieval effectiveness, conversation flow design, state management, latency, and regulatory compliance.

Charlotte AI: Agentic AI for Cloud Detection and Response

Crowdstrike

CrowdStrike developed Charlotte AI, an agentic AI system that automates cloud security incident detection, investigation, and response workflows. The system addresses the challenge of rapidly increasing cloud threats and alert volumes by providing automated triage, investigation assistance, and incident response recommendations for cloud security teams. Charlotte AI integrates with CrowdStrike's Falcon platform to analyze security events, correlate cloud control plane and workload-level activities, and generate detailed incident reports with actionable recommendations, significantly reducing the manual effort required for tier-one security operations.

Claude Code Agent Architecture: Single-Threaded Master Loop for Autonomous Coding

Anthropic

Anthropic's Claude Code implements a production-ready autonomous coding agent using a deceptively simple architecture centered around a single-threaded master loop (codenamed nO) enhanced with real-time steering capabilities, comprehensive developer tools, and controlled parallelism through limited sub-agent spawning. The system addresses the complexity of autonomous code generation and editing by prioritizing debuggability and transparency over multi-agent swarms, using a flat message history design with TODO-based planning, diff-based workflows, and robust safety measures including context compression and permission systems. The architecture achieved significant user engagement, requiring Anthropic to implement weekly usage limits due to users running Claude Code continuously, demonstrating the effectiveness of the simple-but-disciplined approach to agentic system design.

Clinical-Grade Patient Education Agent with LangGraph and LangSmith

Lubu Labs

Lubu Labs built a production AI agent for a digital health platform that helps patients understand their health test results from camera-based scans measuring 30+ vital signs. The system needed to provide plain-language medical explanations, answer follow-up questions conversationally, and route uncertain cases to clinicians—all while meeting healthcare regulatory requirements. The solution used LangGraph for explicit control flow with confidence-based routing decisions, RAG over a versioned medical knowledge base, and LangSmith for audit-grade observability. Key results included approximately 15% of conversations appropriately triggering human review, an 80% accuracy rate in routing decisions validated by clinicians, a 40% reduction in false positive reviews after threshold tuning, and very low rates of inappropriate clinical advice in production validated through weekly audits.

Cloud-Native Synthetic Data Generator for Data Pipeline Testing

GoDaddy

GoDaddy faced challenges in testing data pipelines without production data due to privacy concerns and the labor-intensive nature of manual test data creation. They built a cloud-native synthetic data generator that combines LLM intelligence (via their internal GoCode API) with scalable traditional data generation tools (Databricks Labs Datagen and EMR Serverless). The system uses LLMs to understand schemas and automatically generate intelligent data generation templates rather than generating each row directly, achieving a 99.9% cost reduction compared to pure LLM generation. This hybrid approach resulted in a 90% reduction in time spent creating test data, complete elimination of production data in test environments, and 5x faster pipeline development cycles.

Collaborative Prompt Engineering Platform for Production LLM Development

LinkedIn

LinkedIn developed a collaborative prompt engineering platform using Jupyter Notebooks to bridge the gap between technical and non-technical teams in developing LLM-powered features. The platform enabled rapid prototyping and testing of prompts, with built-in access to test data and external APIs, leading to successful deployment of features like AccountIQ which reduced company research time from two hours to five minutes. The solution addressed challenges in LLM configuration management, prompt template handling, and cross-functional collaboration while maintaining production-grade quality.

Company-Wide AI Integration: From Experimentation to Production at Scale

Trivago

Trivago transformed its approach to AI between 2023 and 2025, moving from isolated experimentation to company-wide integration across nearly 700 employees. The problem addressed was enabling a relatively small workforce to achieve outsized impact through AI tooling and cultural transformation. The solution involved establishing an AI Ambassadors group, deploying internal AI tools like trivago Copilot (used daily by 70% of employees), implementing governance frameworks for tool procurement and compliance, and fostering knowledge-sharing practices across departments. Results included over 90% daily or weekly AI adoption, 16 days saved per person per year through AI-driven efficiencies (doubled from 2023), 70% positive sentiment toward AI tools, and concrete production deployments including an IT support chatbot with 35% automatic resolution rate, automated competitive intelligence systems, and AI-powered illustration agents for internal content creation.

Company-Wide GenAI Transformation Through Hackathon-Driven Culture and Centralized Infrastructure

Agoda

Agoda transformed from GenAI experiments to company-wide adoption through a strategic approach that began with a 2023 hackathon, grew into a grassroots culture of exploration, and was supported by robust infrastructure including a centralized GenAI proxy and internal chat platform. Starting with over 200 developers prototyping 40+ ideas, the initiative evolved into 200+ applications serving both internal productivity (73% employee adoption, 45% of tech support tickets automated) and customer-facing features, demonstrating how systematic enablement and community-driven innovation can scale GenAI across an entire organization.

Comprehensive Debugging and Observability Framework for Production Agent AI Systems

DocuSign

The presentation addresses the critical challenge of debugging and maintaining agent AI systems in production environments. While many organizations are eager to implement and scale AI agents, they often hit productivity plateaus due to insufficient tooling and observability. The speaker proposes a comprehensive rubric for assessing AI agent systems' operational maturity, emphasizing the need for complete visibility into environment configurations, system logs, model versioning, prompts, RAG implementations, and fine-tuning pipelines across the entire organization.

Comprehensive LLM Evaluation Framework for Production AI Code Assistants

Github

Github describes their robust evaluation framework for testing and deploying new LLM models in their Copilot product. The team runs over 4,000 offline tests, including automated code quality assessments and chat capability evaluations, before deploying any model changes to production. They use a combination of automated metrics, LLM-based evaluation, and manual testing to assess model performance, quality, and safety across multiple programming languages and frameworks.

Comprehensive Security and Risk Management Framework for Enterprise LLM Deployments

PredictionGuard

PredictionGuard presents a comprehensive framework for addressing key challenges in deploying LLMs securely in enterprise environments. The case study outlines solutions for hallucination detection, supply chain vulnerabilities, server security, data privacy, and prompt injection attacks. Their approach combines traditional security practices with AI-specific safeguards, including the use of factual consistency models, trusted model registries, confidential computing, and specialized filtering layers, all while maintaining reasonable latency and performance.

Contact Center Transformation with AI-Powered Customer Service and Agent Assistance

Canada Life

Canada Life, a leading financial services company serving 14 million customers (one in three Canadians), faced significant contact center challenges including 5-minute average speed to answer, wait times up to 40 minutes, complex routing, high transfer rates, and minimal self-service options. The company migrated 21 business units from a legacy system to Amazon Connect in 7 months, implementing AI capabilities including chatbots, call summarization, voice-to-text, automated authentication, and proficiency-based routing. Results included 94% reduction in wait time, 10% reduction in average handle time, $7.5 million savings in first half of 2025, 92% reduction in average speed to answer (now 18 seconds), 83% chatbot containment rate, and 1900 calls deflected per week. The company plans to expand AI capabilities including conversational AI, agent assist, next best action, and fraud detection, projecting $43 million in cost savings over five years.

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

Context Engineering and Tool Design for Background Coding Agents at Scale

Spotify

Spotify deployed a background coding agent to automate large-scale software maintenance across thousands of repositories, initially experimenting with open-source tools like Goose and Aider before building a custom agentic loop, and ultimately adopting Claude Code with the Anthropic Agent SDK. The primary challenge shifted from building the agent to effective context engineering—crafting prompts that produce reliable, mergeable pull requests at scale. Through extensive experimentation, Spotify developed prompt engineering principles (tailoring to the agent, stating preconditions, using examples, defining end states through tests) and designed a constrained tool ecosystem (limited bash commands, custom verify tool, git tool) to maintain predictability. The system has successfully merged approximately 50 migrations with thousands of AI-generated pull requests into production, demonstrating that careful prompt design and strategic tool limitation are critical for production LLM deployments in code generation scenarios.

Context Engineering for Agentic AI Systems

Dropbox

Dropbox evolved their Dash AI assistant from a traditional RAG-based search system into an agentic AI capable of interpreting, summarizing, and acting on information. As they added more tools and capabilities, they encountered "analysis paralysis" where too many tool options degraded model performance and accuracy, particularly in longer-running jobs. Their solution centered on context engineering: limiting tool definitions by consolidating retrieval through a universal search index, filtering context using a knowledge graph to surface only relevant information, and introducing specialized agents for complex tasks like query construction. These strategies improved decision-making speed, reduced token consumption, and maintained model focus on the actual task rather than tool selection.

Context Engineering for AI-Assisted Employee Onboarding

Etsy

Etsy explored using prompt engineering as an alternative to fine-tuning for AI-assisted employee onboarding, focusing on Travel & Entertainment policy questions and community forum support. They implemented a RAG-style approach using embeddings-based search to augment prompts with relevant Etsy-specific documents. The system achieved 86% accuracy on T&E policy questions and 72% on community forum queries, with various prompt engineering techniques like chain-of-thought reasoning and source citation helping to mitigate hallucinations and improve reliability.

Context Engineering for Background Coding Agents at Scale

Spotify

Spotify built a background coding agent system to automate large-scale software maintenance and migrations across thousands of repositories. The company initially experimented with open-source agents like Goose and Aider, then built a custom agentic loop, before ultimately adopting Claude Code from Anthropic. The core challenge centered on context engineering—crafting effective prompts and selecting appropriate tools to enable the agent to reliably generate mergeable pull requests. By developing sophisticated prompt engineering practices and carefully constraining the agent's toolset, Spotify has successfully applied this system to approximately 50 migrations with thousands of merged PRs across hundreds of repositories.

Context Engineering for Production AI Agents at Scale

Manus

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

Context Engineering for Production AI Assistants at Scale

Spotify

Shopify developed Sidekick, an AI assistant serving millions of merchants on their commerce platform. The challenge was managing context windows effectively while maintaining performance, latency, and cost efficiency for an agentic system operating at massive scale. Their solution involved sophisticated "context engineering" techniques including aggressive token management (removing processed tool messages, trimming old conversation turns), a three-tier memory system (explicit user preferences, implicit user profiles, and episodic memory via RAG), and just-in-time instruction injection that collocates instructions with tool outputs. These techniques reportedly improved instruction adherence by 5-10% while reducing jailbreak likelihood and maintaining acceptable latency despite the system managing over 20 tools and handling complex multi-step agentic workflows.

Context Engineering Platform for Multi-Domain RAG and Agentic Systems

Contextual

Contextual has developed an end-to-end context engineering platform designed to address the challenges of building production-ready RAG and agentic systems across multiple domains including e-commerce, code generation, and device testing. The platform combines multimodal ingestion, hierarchical document processing, hybrid search with reranking, and dynamic agents to enable effective reasoning over large document collections. In a recent context engineering hackathon, Contextual's dynamic agent achieved competitive results on a retail dataset of nearly 100,000 documents, demonstrating the value of constrained sub-agents, turn limits, and intelligent tool selection including MCP server management.

Context Engineering Strategies for Production AI Agents

Manus

Manus AI developed a production AI agent system that uses context engineering instead of fine-tuning to enable rapid iteration and deployment. The company faced the challenge of building an effective agentic system that could operate reliably at scale while managing complex multi-step tasks. Their solution involved implementing several key strategies including KV-cache optimization, tool masking instead of removal, file system-based context management, attention manipulation through task recitation, and deliberate error preservation for learning. These approaches allowed Manus to achieve faster development cycles, improved cost efficiency, and better agent performance across millions of users while maintaining system stability and scalability.

Context Rot: Evaluating LLM Performance Degradation with Increasing Input Tokens

ChromaDB

ChromaDB's technical report examines how large language models (LLMs) experience performance degradation as input context length increases, challenging the assumption that models process context uniformly. Through evaluation of 18 state-of-the-art models including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 across controlled experiments, the research reveals that model reliability decreases significantly with longer inputs, even on simple tasks like retrieval and text replication. The study demonstrates that factors like needle-question similarity, presence of distractors, haystack structure, and semantic relationships all impact performance non-uniformly as context length grows, suggesting that current long-context benchmarks may not adequately reflect real-world performance challenges.

Context-Aware AI Code Generation and Assistant at Scale

Windsurf

Windsurf, an AI coding toolkit company, addresses the challenge of generating contextually relevant code for individual developers and organizations. While generating generic code has become straightforward, the real challenge lies in producing code that fits into existing large codebases, adheres to organizational standards, and aligns with personal coding preferences. Windsurf's solution centers on a sophisticated context management system that combines user behavioral heuristics (cursor position, open files, clipboard content, terminal activity) with hard evidence from the codebase (code, documentation, rules, memories). Their approach optimizes for relevant context selection rather than simply expanding context windows, leveraging their background in GPU optimization to efficiently find and process relevant context at scale.

Context-Aware Item Recommendations Using Hybrid LLM and Embedding-Based Retrieval

DoorDash

DoorDash's Core Consumer ML team developed a GenAI-powered context shopping engine to address the challenge of lost user intent during in-app searches for items like "fresh vegetarian sushi." The traditional search system struggled to preserve specific user context, leading to generic recommendations and decision fatigue. The team implemented a hybrid approach combining embedding-based retrieval (EBR) using FAISS with LLM-based reranking to balance speed and personalization. The solution achieved end-to-end latency of approximately six seconds with store page loads under two seconds, while significantly improving user satisfaction through dynamic, personalized item carousels that maintained user context and preferences. This hybrid architecture proved more practical than pure LLM or deep neural network approaches by optimizing for both performance and cost efficiency.

Context-Seeking Conversational AI for Health Information Navigation

Google

Google Research developed a "Wayfinding AI" prototype based on Gemini to address the challenge of people struggling to find relevant, personalized health information online. Through formative user research with 33 participants and iterative design, they created an AI agent that proactively asks clarifying questions to understand user goals and context before providing answers. In a randomized study with 130 participants, the Wayfinding AI was significantly preferred over a baseline Gemini model across multiple dimensions including helpfulness, relevance, goal understanding, and tailoring, demonstrating that a context-seeking, conversational approach creates more empowering health information experiences than traditional question-answering systems.

Contextual Agent Playbooks and Tools: Enterprise-Scale AI Coding Agent Integration

LinkedIn

LinkedIn faced the challenge that while AI coding agents were powerful, they lacked organizational context about the company's thousands of microservices, internal frameworks, data infrastructure, and specialized systems. To address this, they built CAPT (Contextual Agent Playbooks & Tools), a unified framework built on the Model Context Protocol (MCP) that provides AI agents with access to internal tools and executable playbooks encoding institutional workflows. The system enables over 1,000 engineers to perform complex tasks like experiment cleanup, data analysis, incident debugging, and code review with significant productivity gains: 70% reduction in issue triage time, 3× faster data analysis workflows, and automated debugging that cuts time spent by more than half in many cases.

Conversational AI Data Agent for Financial Analytics

Uber

Uber developed Finch, a conversational AI agent integrated into Slack, to address the inefficiencies of traditional financial data retrieval processes where analysts had to manually navigate multiple platforms, write complex SQL queries, or wait for data science team responses. The solution leverages generative AI, RAG, and self-querying agents to transform natural language queries into structured data retrieval, enabling real-time financial insights while maintaining enterprise-grade security through role-based access controls. The system reportedly reduces query response times from hours or days to seconds, though the text lacks quantified performance metrics or third-party validation of claimed benefits.

Converting Natural Language to Structured GraphQL Queries Using LLMs

Cato Networks

Cato Networks implemented a natural language search interface for their SASE management console's events page using Amazon Bedrock's foundation models. They transformed free-text queries into structured GraphQL queries by employing prompt engineering and JSON schema validation, reducing query time from minutes to near-instant while making the system more accessible to new users and non-English speakers. The solution achieved high accuracy with an error rate below 0.05 while maintaining reasonable costs and latency.

Cost Optimization and Performance Panel Discussion: Strategies for Running LLMs in Production

Various

A panel discussion featuring experts from Neva, Intercom, Prompt Layer, and OctoML discussing strategies for optimizing costs and performance when running LLMs in production. The panel explores various approaches from using API services to running models in-house, covering topics like model compression, hardware selection, latency optimization, and monitoring techniques. Key insights include the trade-offs between API usage and in-house deployment, strategies for cost reduction, and methods for performance optimization.

Cost-Effective LLM Transaction Categorization for Business Banking

ANNA

ANNA, a UK business banking provider, implemented LLMs to automate transaction categorization for tax and accounting purposes across diverse business types. They achieved this by combining traditional ML with LLMs, particularly focusing on context-aware categorization that understands business-specific nuances. Through strategic optimizations including offline predictions, improved context utilization, and prompt caching, they reduced their LLM costs by 75% while maintaining high accuracy in their AI accountant system.

Customer Service Transformation with AI-Based Email Automation and Chatbot Implementation

Sixt

Sixt, a mobility service provider with over €4 billion in revenue, transformed their customer service operations using generative AI to handle the complexity of multiple product lines across 100+ countries. The company implemented "Project AIR" (AI-based Replies) to automate email classification, generate response proposals, and deploy chatbots across multiple channels. Within five months of ideation, they moved from proof-of-concept to production, achieving over 90% classification accuracy using Amazon Bedrock with Anthropic Claude models (up from 70% with out-of-the-box solutions), while reducing classification costs by 70%. The solution now handles customer inquiries in multiple languages, integrates with backend reservation systems, and has expanded from email automation to messaging and chatbot services deployed across all corporate countries by Q1 2025.

Data and AI Governance Integration in Enterprise GenAI Adoption

Various

A panel discussion featuring leaders from Mercado Libre, ATB Financial, LBLA retail, and Collibra discussing how they are implementing data and AI governance in the age of generative AI. The organizations are leveraging Google Cloud's Dataplex and other tools to enable comprehensive data governance, while also exploring GenAI applications for automating governance tasks, improving data discovery, and enhancing data quality management. Their approaches range from careful regulated adoption in finance to rapid e-commerce implementation, all emphasizing the critical connection between solid data governance and successful AI deployment.

Data Engineering Challenges and Best Practices in LLM Production

QuantumBlack

Data engineers from QuantumBlack discuss the evolving landscape of data engineering with the rise of LLMs, highlighting key challenges in handling unstructured data, maintaining data quality, and ensuring privacy. They share experiences dealing with vector databases, data freshness in RAG applications, and implementing proper guardrails when deploying LLM solutions in enterprise settings.

Debating the Value and Future of LLMOps: Industry Perspectives

Various

A detailed discussion between Patrick Barker (CTO of Guaros) and Farud (ML Engineer from Iran) about the relevance and future of LLMOps, with Patrick arguing that LLMOps represents a distinct field from traditional MLOps due to different user profiles and tooling needs, while Farud contends that LLMOps may be overhyped and should be viewed as an extension of existing MLOps practices rather than a separate discipline.

Democratizing Prompt Engineering Through Platform Architecture and Employee Empowerment

Pinterest

Pinterest developed a comprehensive LLMOps platform strategy to enable their 570 million user visual discovery platform to rapidly adopt generative AI capabilities. The company built a multi-layered architecture with vendor-agnostic model access, centralized proxy services, and employee-facing tools, combined with innovative training approaches like "Prompt Doctors" and company-wide hackathons. Their solution included automated batch labeling systems, a centralized "Prompt Hub" for prompt development and evaluation, and an "AutoPrompter" system that uses LLMs to automatically generate and optimize prompts through iterative critique and refinement. This approach enabled non-technical employees to become effective prompt engineers, resulted in the fastest-adopted platform at Pinterest, and demonstrated that democratizing AI capabilities across all employees can lead to breakthrough innovations.

Deploying Agentic AI for Clinical Trial Protocol Deviation Monitoring

Bayezian Limited

Bayezian Limited deployed a multi-agent AI system to monitor protocol deviations in clinical trials, where traditional manual review processes were time-consuming and error-prone. The system used specialized LLM agents, each responsible for checking specific protocol rules (visit timing, medication use, inclusion criteria, etc.), working on top of a pipeline that processed clinical documents and used FAISS for semantic retrieval of protocol requirements. While the system successfully identified patterns early and improved reviewer efficiency by shifting focus from manual checking to intelligent triage, it encountered significant challenges including handover failures between agents, memory lapses causing coordination breakdowns, and difficulties handling real-world data ambiguities like time windows and exceptions. The team improved performance through structured memory snapshots, flexible prompt engineering, stronger handoff signals, and process tracking, ultimately creating a useful but imperfect system that highlighted the gap between agentic AI theory and production reality.

Deploying Agentic Code Review at Scale with GPT-5 Codex

OpenAI

OpenAI addresses the challenge of verifying AI-generated code at scale by deploying an autonomous code reviewer built on GPT-5-Codex and GPT-5.1-Codex-Max. As autonomous coding systems produce code volumes that exceed human oversight capacity, the risk of severe bugs and vulnerabilities increases. The solution involves training a dedicated agentic code reviewer with repository-wide tool access and code execution capabilities, optimizing for precision over recall to maintain developer trust and minimize false alarms. The system now reviews over 100,000 external PRs daily, with authors making code changes in response to 52.7% of comments internally, demonstrating actionable impact while maintaining a low "alignment tax" on developer workflows.

Deploying AI Agents for Scalable Immigration Automation

Navismart AI

Navismart AI developed a multi-agent AI system to automate complex immigration processes that traditionally required extensive human expertise. The platform addresses challenges including complex sequential workflows, varying regulatory compliance across different countries, and the need for human oversight in high-stakes decisions. Built on a modular microservices architecture with specialized agents handling tasks like document verification, form filling, and compliance checks, the system uses Kubernetes for orchestration and scaling. The solution integrates REST APIs for inter-agent communication, implements end-to-end encryption for security, and maintains human-in-the-loop capabilities for critical decisions. The team started with US immigration processes due to their complexity and is expanding to other countries and domains like education.

Deploying AI Coding Agents in Highly Regulated Environments with Secure Infrastructure

ONA

ONA addresses the challenge faced by companies in highly regulated sectors (finance, government) that need to leverage AI coding assistants while maintaining strict data security and compliance requirements. The problem stems from the fact that many organizations initially ban AI tools like ChatGPT due to data leakage concerns, but employees use them anyway (with surveys showing 45% admit using banned AI tools and 58% sending sensitive data to public AI services). ONA's solution is a software engineering agent platform that runs entirely within the customer's own virtual private cloud (VPC), using isolated disposable development environments (virtual machines with dev containers), providing admin controls and audit logs, and ensuring all data remains within the customer's network with client-side encryption. The platform enables secure AI-assisted development with direct connections to customers' Git providers and LLM services without ONA accessing any code or sensitive data.

Deploying an AI SDR Chatbot for Lead Qualification with Production-Grade Observability

Lubu Labs

Lubu Labs deployed an AI SDR (Sales Development Representative) chatbot for a loyalty platform to qualify inbound leads, answer product questions, and route conversations appropriately. The implementation faced challenges around quality drift on real traffic, debugging complex tool and model interactions, and occasional duplicate CRM actions that could damage revenue operations. The team used LangSmith's tracing, feedback loops, and evaluation workflows to make the system debuggable and production-ready, implementing idempotent tool calls, structured state management with LangGraph, and regression testing against representative conversation datasets to ensure reliable operation.

Deploying Generative AI at Scale Across 5,000 Developers

Liberty IT

Liberty IT, the technology division of Fortune 100 insurance company Liberty Mutual, embarked on a large-scale deployment of generative AI tools across their global workforce of over 5,000 developers and 50,000+ employees. The initiative involved rolling out custom GenAI platforms including Liberty GPT (an internal ChatGPT variant) to 70% of employees and GitHub Copilot to over 90% of IT staff within the first year. The company faced challenges including rapid technology evolution, model availability constraints, cost management, RAG implementation complexity, and achieving true adoption beyond basic usage. Through building a centralized AI platform with governance controls, implementing comprehensive learning programs across six streams, supporting 28 different models optimized for various use cases, and developing custom dashboards for cost tracking and observability, Liberty IT successfully navigated these challenges while maintaining enterprise security and compliance requirements.

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

Detecting and Mitigating Prompt Injection via Control Characters in ChatGPT

Dropbox

Dropbox's security team discovered that control characters like backspace and carriage return can be used to circumvent prompt constraints in OpenAI's GPT-3.5 and GPT-4 models. By inserting large sequences of these characters, they were able to make the models forget context and instructions, leading to prompt injection vulnerabilities. This research revealed previously undocumented behavior that could be exploited in LLM-powered applications, highlighting the importance of proper input sanitization for secure LLM deployments.

Distributed Agent Systems Architecture for AI Agent Platform

Dust.tt

Dust.tt, an AI agent platform that allows users to build custom AI agents connected to their data and tools, presented their technical approach to building distributed agent systems at scale. The company faced challenges with their original synchronous, stateless architecture when deploying AI agents that could run for extended periods, handle tool orchestration, and maintain state across failures. Their solution involved redesigning their infrastructure around a continuous orchestration loop with versioning systems for idempotency, using Temporal workflows for coordination, and implementing a database-driven communication protocol between agent components. This architecture enables reliable, scalable deployment of AI agents that can handle complex multi-step tasks while surviving infrastructure failures and preventing duplicate actions.

Document Metadata Extraction at Scale Using Generative AI for Healthcare and Financial Services

AArete

AArete, a management and technology consulting firm serving healthcare payers and financial services, developed Doxy AI to extract structured metadata from complex business documents like provider and vendor contracts. The company evolved from manual document processing (100 documents per week per person) through rules-based approaches (50-60% accuracy) to a generative AI solution built on AWS Bedrock using Anthropic's Claude models. The production system achieved 99% accuracy while processing up to 500,000 documents per week, resulting in a 97% reduction in manual effort and $330 million in client savings through improved contract analysis, claims overpayment identification, and operational efficiency.

Document Processing Automation with LLMs: Evolution of Evaluation Strategies

Tola Capital / Klarity

Klarity, a document processing automation company, transformed their approach to evaluating LLM systems in production as they moved from traditional ML to generative AI. The company processes over half a million documents for B2B SaaS customers, primarily handling complex financial and accounting workflows. Their journey highlights the challenges and solutions in developing robust evaluation frameworks for LLM-powered systems, particularly focusing on non-deterministic performance, rapid feature development, and the gap between benchmark performance and real-world results.

Document-Wide AI Editing in Microsoft Word Add-In

Harvey

Harvey developed an AI-powered Word Add-In that enables comprehensive document-wide edits on 100+ page legal documents through a single query. The system addresses the challenges of OOXML complexity by creating reversible mappings between document structure and natural language, while using an orchestrator-subagent architecture to overcome position bias and ensure thorough coverage. The solution transforms hours of manual legal editing into seamless single-query interactions, supporting complex use cases like contract conformance, template creation, and jurisdiction-specific adaptations.

Dogfooding AI Features in GitLab's Development Workflow

Gitlab

GitLab shares their experience of integrating and testing their AI-powered features suite, GitLab Duo, within their own development workflows. The case study demonstrates how different teams within GitLab leverage AI capabilities for various tasks including code review, documentation, incident response, and feature testing. The implementation has resulted in significant efficiency gains, reduced manual effort, and improved quality across their development processes.

Domain Adaptation of LLMs for Enterprise Use Through Multi-Task Fine-Tuning

Wix

Wix developed a customized LLM for their enterprise needs by applying multi-task supervised fine-tuning (SFT) and domain adaptation using full weights fine-tuning (DAPT). Despite having limited data and tokens, their smaller customized model outperformed GPT-3.5 on various Wix-specific tasks. The project focused on three key components: comprehensive evaluation benchmarks, extensive data collection methods, and advanced modeling processes to achieve full domain adaptation capabilities.

Domain-Adapted Foundation Models for Enterprise-Scale LLM Deployment

LinkedIn

LinkedIn developed a family of domain-adapted foundation models (EON models) to enhance their GenAI capabilities across their platform serving 1B+ members. By adapting open-source models like Llama through multi-task instruction tuning and safety alignment, they created cost-effective models that maintain high performance while being 75x more cost-efficient than GPT-4. The EON-8B model demonstrated significant improvements in production applications, including a 4% increase in candidate-job-requirements matching accuracy compared to GPT-4o mini in their Hiring Assistant product.

Domain-Native LLM Application for Healthcare Insurance Administration

Anterior

Anterior, a clinician-led healthcare technology company, developed an AI system called Florence to automate medical necessity reviews for health insurance providers covering 50 million lives in the US. The company addressed the "last mile problem" in LLM applications by building an adaptive domain intelligence engine that enables domain experts to continuously improve model performance through systematic failure analysis, domain knowledge injection, and iterative refinement. Through this approach, they achieved 99% accuracy in care request approvals, moving beyond the 95% baseline achieved through model improvements alone.

Domain-Specific Agentic AI for Personalized Korean Skincare Recommendations

Glowe / Weaviate

Glowe, developed by Weaviate, addresses the challenge of finding effective skincare product combinations by building a domain-specific AI agent that understands Korean skincare science. The solution leverages dual embedding strategies with TF-IDF weighting to capture product effects from 94,500 user reviews, uses Weaviate's vector database for similarity search, and employs Gemini 2.5 Flash for routine generation. The system includes an agentic chat interface powered by Elysia that provides real-time personalized guidance, resulting in scientifically-grounded skincare recommendations based on actual user experiences rather than marketing claims.

DoorDash Summer 2025 Intern Projects: LLM-Powered Feature Extraction and RAG Chatbot Infrastructure

Doordash

DoorDash's Summer 2025 interns developed multiple LLM-powered production systems to solve operational challenges. The first project automated never-delivered order feature extraction using a custom DistilBERT model that processes customer-Dasher conversations, achieving 0.8289 F1 score while reducing manual review burden. The second built a scalable chatbot-as-a-service platform using RAG architecture, enabling any team to deploy knowledge-based chatbots with centralized embedding management and customizable prompt templates. These implementations demonstrate practical LLMOps approaches including model comparison, data balancing techniques, and infrastructure design for enterprise-scale conversational AI systems.

Dutch YouTube Interface Localization and Content Management

Tastewise

This appears to be the Dutch footer section of YouTube's interface, showcasing the platform's localization and content management system. However, without more context about specific LLMOps implementation details, we can only infer that YouTube likely employs language models for content translation, moderation, and user interface localization.

Dynamic Context Discovery for Production Coding Agents

Cursor

Cursor, a coding agent platform, developed a "dynamic context discovery" approach to optimize how their AI agents use context windows and token budgets when working on long-running software development tasks. Instead of loading all potentially relevant information upfront (static context), their system enables agents to dynamically pull only the necessary context as needed. They implemented five key techniques: converting long tool outputs to files, using chat history files during summarization, supporting the Agent Skills standard, selectively loading MCP tools (reducing tokens by 46.9%), and treating terminal sessions as files. This approach improves token efficiency and response quality by reducing context window bloat and preventing information overload for the underlying LLM.

Dynamic Knowledge and Instruction RAG System for Production Chatbots

Wix

Wix developed an innovative approach to enhance their AI Site-Chat system by creating a hybrid framework that combines LLMs with traditional machine learning classifiers. They introduced DDKI-RAG (Dynamic Domain Knowledge and Instruction Retrieval-Augmented Generation), which addresses limitations of traditional RAG systems by enabling real-time learning and adaptability based on site owner feedback. The system uses a novel classification approach combining LLMs for feature extraction with CatBoost for final classification, allowing chatbots to continuously improve their responses and incorporate unwritten domain knowledge.

Dynamic LLM Selection and Prompt Optimization Through Automated Evaluation and User Feedback

Beekeeper

Beekeeper, a digital workplace platform for frontline workers, faced the challenge of selecting and optimizing LLMs and prompts across rapidly evolving models while personalizing responses for different users and use cases. They built an Amazon Bedrock-powered system that continuously evaluates multiple model/prompt combinations using synthetic test data and real user feedback, ranks them on a live leaderboard based on quality, cost, and speed metrics, and automatically routes requests to the best-performing option. The system also mutates prompts based on user feedback to create personalized variations while using drift detection to ensure quality standards are maintained. This approach resulted in 13-24% better ratings on responses when aggregated per tenant, reduced manual labor in model selection, and enabled rapid adaptation to new models and user preferences.

Dynamic Prompt Injection for Reliable AI Agent Behavior

Control Plain

Control Plain addressed the challenge of unreliable AI agent behavior in production environments by developing "intentional prompt injection," a technique that dynamically injects relevant instructions at runtime based on semantic matching rather than bloating system prompts with edge cases. Using an airline customer support agent as their test case, they demonstrated that this approach improved reliability from 80% to 100% success rates on challenging passenger modification scenarios while maintaining clean, maintainable prompts and avoiding "prompt debt."

Email Classification System Using Foundation Models and Prompt Engineering

Travelers Insurance

Travelers Insurance developed an automated email classification system using Amazon Bedrock and Anthropic's Claude models to categorize millions of service request emails into 13 different categories. Through advanced prompt engineering techniques and without model fine-tuning, they achieved 91% classification accuracy, potentially saving tens of thousands of manual processing hours. The system combines email text analysis, PDF processing using Amazon Textract, and foundation model-based classification in a serverless architecture.

Emotionally Aware AI Tutoring Agents with Multimodal Affect Detection

GlowingStar

GlowingStar Inc. develops emotionally aware AI tutoring agents that detect and respond to learner emotional states in real-time to provide personalized learning experiences. The system addresses the gap in current AI agents that focus solely on cognitive processing without emotional attunement, which is critical for effective learning and engagement. By incorporating multimodal affect detection (analyzing tone of voice, facial expressions, interaction patterns, latency, and silence) into an expanded agent architecture, the platform aims to deliver world-class personalized education while navigating significant challenges around emotional data privacy, cross-cultural generalization, and ethical deployment in sensitive educational contexts.

Empowering Non-Technical Domain Experts to Drive AI Quality in Conversational AI

Portola

Portola built Tolan, an AI companion app focused on creating authentic emotional connections through natural voice conversations. The challenge was ensuring conversation quality, emotional intelligence, and authentic behavior—qualities that couldn't be captured by automated evaluations alone. Portola's solution involved creating a workflow that empowered non-technical subject matter experts (behavioral researchers, writers, game designers) to review logs, curate problem-specific datasets, iterate on prompts using playground environments, and deploy changes directly to production without engineering handoffs. This approach resulted in a 4x improvement in prompt iteration velocity and systematic improvements in conversation quality, memory authenticity, and brand voice consistency.

End-to-End LLM Observability for RAG-Powered AI Assistant

Splunk

Splunk built an AI Assistant leveraging Retrieval-Augmented Generation (RAG) to answer FAQs using curated public content from .conf24 materials. The system was developed in a hackathon-style sprint using their internal CIRCUIT platform. To operationalize this LLM-powered application at scale, Splunk integrated comprehensive observability across the entire RAG pipeline—from prompt handling and document retrieval to LLM generation and output evaluation. By instrumenting structured logs, creating unified dashboards in Splunk Observability Cloud, and establishing proactive alerts for quality degradation, hallucinations, and cost overruns, they achieved full visibility into response quality, latency, source document reliability, and operational health. This approach enabled rapid iteration, reduced mean time to resolution for quality issues, and established reproducible governance practices for production LLM deployments.

Engineering Principles and Practices for Production LLM Systems

Langchain

This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.

Enhanced Agentic RAG for On-Call Engineering Support

Uber

Uber developed Genie, an internal on-call copilot that uses an enhanced agentic RAG (EAg-RAG) architecture to provide real-time support for engineering security and privacy queries through Slack. The system addressed significant accuracy issues in traditional RAG approaches by implementing LLM-powered agents for query optimization, source identification, and context refinement, along with enriched document processing that improved table extraction and metadata enhancement. The enhanced system achieved a 27% relative improvement in acceptable answers and a 60% relative reduction in incorrect advice, enabling deployment across critical security and privacy channels while reducing the support load on subject matter experts and on-call engineers.

Enhanced Agentic-RAG for Internal On-Call Support Copilot

Uber

Uber developed Genie, an internal on-call copilot powered by LLMs, to provide real-time support for engineering queries in Slack. When initial testing revealed significant accuracy issues with responses in the engineering security and privacy domain, the team transitioned from traditional RAG to an Enhanced Agentic RAG (EAg-RAG) architecture. This involved enriched document processing with custom Google Docs loaders and LLM-powered content formatting, plus pre- and post-processing agents for query optimization, source identification, and context refinement. The improvements resulted in a 27% relative increase in acceptable answers and a 60% relative reduction in incorrect advice, enabling deployment across critical security and privacy channels while reducing the support load on subject matter experts.

Enhancing E-commerce Search with GPT-based Query Expansion

Whatnot

Whatnot improved their e-commerce search functionality by implementing a GPT-based query expansion system to handle misspellings and abbreviations. The system processes search queries offline through data collection, tokenization, and GPT-based correction, storing expansions in a production cache for low-latency serving. This approach reduced irrelevant content by more than 50% compared to their previous method when handling misspelled queries and abbreviations.

Enhancing E-commerce Search with LLM-Powered Semantic Retrieval

Picnic

Picnic, an e-commerce grocery delivery company, implemented LLM-enhanced search retrieval to improve product and recipe discovery across multiple languages and regions. They used GPT-3.5-turbo for prompt-based product description generation and OpenAI's text-embedding-3-small model for embedding generation, combined with OpenSearch for efficient retrieval. The system employs precomputation and caching strategies to maintain low latency while serving millions of customers across different countries.

Enhancing E-commerce Search with LLMs at Scale

Instacart

Instacart integrated LLMs into their search stack to improve query understanding, product attribute extraction, and complex intent handling across their massive grocery e-commerce platform. The solution addresses challenges with tail queries, product attribute tagging, and complex search intents while considering production concerns like latency, cost optimization, and evaluation metrics. The implementation combines offline and online LLM processing to enhance search relevance and enable new capabilities like personalized merchandising and improved product discovery.

Enhancing Healthcare Service Delivery with RAG and LLM-Powered Search

Accolade

Accolade, facing challenges with fragmented healthcare data across multiple platforms, implemented a Retrieval Augmented Generation (RAG) solution using Databricks' DBRX model to improve their internal search capabilities and customer service. By consolidating their data in a lakehouse architecture and leveraging LLMs, they enabled their teams to quickly access accurate information and better understand customer commitments, resulting in improved response times and more personalized care delivery.

Enhancing Memory Retrieval Systems Using LangSmith Testing and Evaluation

New Computer

New Computer improved their AI assistant Dot's memory retrieval system using LangSmith for testing and evaluation. By implementing synthetic data testing, comparison views, and prompt optimization, they achieved 50% higher recall and 40% higher precision in their dynamic memory retrieval system compared to their baseline implementation.

Enhancing Vector Similarity Search with LLM-Based Reranking

Grab

Grab experimented with combining vector similarity search and LLMs to improve search result relevance. The approach uses vector similarity search (using FAISS and OpenAI embeddings) for initial candidate retrieval, followed by LLM-based reranking of results using GPT-4. Testing on synthetic datasets showed superior performance for complex queries involving constraints and negations compared to traditional vector search alone, though with comparable results for simpler queries.

Enterprise Agent Orchestration Platform for Secure LLM Deployment

Airia

This case study explores how Airia developed an orchestration platform to help organizations deploy AI agents in production environments. The problem addressed is the significant complexity and security challenges that prevent businesses from moving beyond prototype AI agents to production-ready systems. The solution involves a comprehensive platform that provides agent building capabilities, security guardrails, evaluation frameworks, red teaming, and authentication controls. Results include successful deployments across multiple industries including hospitality (customer profiling across hotel chains), HR, legal (contract analysis), marketing (personalized content generation), and operations (real-time incident response through automated data aggregation), with customers reporting significant efficiency gains while maintaining enterprise security standards.

Enterprise Agentic AI Deployment: Panel Discussion on Production Realities and Technical Bottlenecks

Various

This panel discussion features leaders from Writer, You.com, Glean, and Google discussing the current state of deploying agentic AI systems in enterprise environments. The panelists address the gap between prototype development (which can now take 90 seconds) and production-ready systems that Fortune 500 companies can rely on. They identify key technical bottlenecks including data quality and governance issues, information retrieval challenges, function calling limitations, security vulnerabilities, and the difficulty of verifying agent actions. The consensus is that while every large enterprise has built some AI agents adding business value, they are far from having 50% of enterprise work handled by AI, with action agents for larger enterprises likely requiring several more years for major adoption.

Enterprise Agentic AI for Customer Support and Sales Using Amazon Bedrock AgentCore

Swisscom

Swisscom, Switzerland's leading telecommunications provider, implemented Amazon Bedrock AgentCore to build and scale enterprise AI agents for customer support and sales operations across their organization. The company faced challenges in orchestrating AI agents across different departments while maintaining Switzerland's strict data protection compliance, managing secure cross-departmental authentication, and preventing redundant efforts. By leveraging Amazon Bedrock AgentCore's Runtime, Identity, and Memory services along with the Strands Agents framework, Swisscom deployed two B2C use cases—personalized sales pitches and automated technical support—achieving stakeholder demos within 3-4 weeks, handling thousands of monthly requests with low latency, and establishing a scalable foundation that enables secure agent-to-agent communication while maintaining regulatory compliance.

Enterprise AI Adoption Journey: From Experimentation to Core Operations

Credal

A comprehensive analysis of how enterprises adopt and scale AI/LLM technologies, based on observations from multiple companies. The journey typically progresses through four stages: early experimentation, chat with docs workflows, enterprise search, and core operations integration. The case study explores key challenges including data security, use case discovery, and technical implementation hurdles, while providing insights into critical decisions around build vs. buy, platform selection, and LLM provider strategy.

Enterprise AI Agent Development: Lessons from Production Deployments

IBM, The Zig, Augmented AI Labs

This panel discussion features three companies - IBM, The Zig, and Augmented AI Labs - sharing their experiences building and deploying AI agents in enterprise environments. The panelists discuss the challenges of scaling AI agents, including cost management, accuracy requirements, human-in-the-loop implementations, and the gap between prototype demonstrations and production realities. They emphasize the importance of conservative approaches, proper evaluation frameworks, and the need for human oversight in high-stakes environments, while exploring emerging standards like agent communication protocols and the evolving landscape of enterprise AI adoption.

Enterprise AI Agents with Structured and Unstructured Data Integration

Snowflake

This case study explores the challenges and solutions for deploying AI agents in enterprise environments, focusing on the integration of structured database data with unstructured documents through retrieval augmented generation (RAG). The presentation by Snowflake's Jeff Holland outlines a comprehensive agentic workflow that addresses common enterprise challenges including semantic mapping, ambiguity resolution, data model complexity, and query classification. The solution demonstrates a working prototype with fitness wearable company Whoop, showing how agents can combine sales data, manufacturing data, and forecasting information with unstructured Slack conversations to provide real-time business intelligence and recommendations for product launches.

Enterprise AI Platform Deployment for Multi-Company Productivity Enhancement

Payfit, Alan

This case study presents the deployment of Dust.tt's AI platform across multiple companies including Payfit and Alan, focusing on enterprise-wide productivity improvements through LLM-powered assistants. The companies implemented a comprehensive AI strategy involving both top-down leadership support and bottom-up adoption, creating custom assistants for various workflows including sales processes, customer support, performance reviews, and content generation. The implementation achieved significant productivity gains of approximately 20% across teams, with some specific use cases reaching 50% improvements, while addressing challenges around security, model selection, and user adoption through structured rollout processes and continuous iteration.

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

Enterprise AI Transformation: Holiday Extras' ChatGPT Enterprise Implementation Case Study

Holiday Extras

Holiday Extras successfully deployed ChatGPT Enterprise across their organization, demonstrating how enterprise-wide AI adoption can transform business operations and culture. The implementation led to significant measurable outcomes including 500+ hours saved weekly, $500k annual savings, and 95% weekly adoption rate. The company leveraged AI across multiple functions - from multilingual content creation and data analysis to engineering support and customer service - while improving their NPS from 60% to 70%. The case study provides valuable insights into successful enterprise AI deployment, showing how proper implementation can drive both efficiency gains and cultural transformation toward data-driven operations, while empowering employees across technical and non-technical roles.

Enterprise Autonomous Software Engineering with AI Droids

Factory

Factory.ai built an enterprise-focused autonomous software engineering platform using AI "droids" that can handle complex coding tasks independently. The founders met at a LangChain hackathon and developed a browser-based system that allows delegation rather than collaboration, enabling developers to assign tasks to AI agents that can work across entire codebases, integrate with enterprise tools, and complete large-scale migrations. Their approach focuses on enterprise customers with legacy codebases, achieving dramatic results like reducing 4-month migration projects to 3.5 days, while maintaining cost efficiency through intelligent retrieval rather than relying on large context windows.

Enterprise Challenges and Opportunities in Large-Scale LLM Deployment

Barclays

A senior leader in industry discusses the key challenges and opportunities in deploying LLMs at enterprise scale, highlighting the differences between traditional MLOps and LLMOps. The presentation covers critical aspects including cost management, infrastructure needs, team structures, and organizational adaptation required for successful LLM deployment, while emphasizing the importance of leveraging existing MLOps practices rather than completely reinventing the wheel.

Enterprise Document Data Extraction Using Agentic AI Workflows

Box

Box, an enterprise content platform serving over 115,000 customers including two-thirds of the Fortune 500, transformed their document data extraction capabilities by evolving from simple single-shot LLM prompting to sophisticated agentic AI workflows. Initially successful with basic document extraction using off-the-shelf models like GPT, Box encountered significant challenges when customers demanded extraction from complex 300-page documents with hundreds of fields, multilingual content, and poor OCR quality. The company implemented an agentic architecture using directed graphs that orchestrate multiple AI models, tools for validation and cross-checking, and iterative refinement processes. This approach dramatically improved accuracy and reliability while maintaining the flexibility to handle diverse document types and complex extraction requirements across their enterprise customer base.

Enterprise GenAI Implementation Strategies Across Industries

AstraZeneca / Adobe / Allianz Technology

A panel discussion featuring leaders from AstraZeneca, Adobe, and Allianz Technology sharing their experiences implementing GenAI in production. The case study covers how these enterprises prioritized use cases, managed legal considerations, and scaled AI adoption. Key successes included AstraZeneca's viral research assistant tool, Adobe's approach to legal frameworks for AI, and Allianz's code modernization efforts. The discussion highlights the importance of early legal engagement, focusing on impactful use cases, and treating AI implementation as a cultural transformation rather than just a tool rollout.

Enterprise GenAI Virtual Assistant for Operations and Underwriting Knowledge Access

Radian

Radian Group, a financial services company serving the mortgage and real estate ecosystem, developed the Radian Virtual Assistant (RVA) to address the challenge of inefficient information access among operations and underwriting teams who were spending excessive time searching through thousands of pages of documentation. The solution leverages AWS Bedrock Knowledge Base to create an enterprise-grade GenAI assistant that provides natural language querying capabilities across multiple knowledge sources including SharePoint and Confluence. The implementation achieved significant measurable results including 70% reduction in guideline triage time, 30% faster training ramp-up for new employees, and 96% positive user feedback, while maintaining enterprise security, governance, and scalability requirements through AWS services and role-based access controls.

Enterprise Infrastructure Challenges for Agentic AI Systems in Production

Various (Meta / Google / Monte Carlo / Azure)

A panel discussion featuring engineers from Meta, Google, Monte Carlo, and Microsoft Azure explores the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion reveals that agentic workloads differ dramatically from traditional software systems, requiring complete reimagining of reliability, security, networking, and observability approaches. Key challenges include non-deterministic behavior leading to incidents like chatbots selling cars for $1, massive scaling requirements as agents work continuously, and the need for new health checking mechanisms, semantic caching, and comprehensive evaluation frameworks to manage systems where 95% of outcomes are unknown unknowns.

Enterprise Knowledge Base Assistant Using Multi-Model GenAI Architecture

Accenture

Accenture developed Knowledge Assist, a generative AI solution for a public health sector client to transform how enterprise knowledge is accessed and utilized. The solution combines multiple foundation models through Amazon Bedrock to provide accurate, contextual responses to user queries in multiple languages. Using a hybrid intent approach and RAG architecture, the system achieved over 50% reduction in new hire training time and 40% reduction in query escalations while maintaining high accuracy and compliance requirements.

Enterprise Knowledge Management with LLMs: Morgan Stanley's GPT-4 Implementation

Morgan Stanley

Morgan Stanley's wealth management division successfully implemented GPT-4 to transform their vast institutional knowledge base into an instantly accessible resource for their financial advisors. The system processes hundreds of thousands of pages of investment strategies, market research, and analyst insights, making them immediately available through an internal chatbot. This implementation demonstrates how large enterprises can effectively leverage LLMs for knowledge management, with over 200 employees actively using the system daily. The case study highlights the importance of combining advanced AI capabilities with domain-specific content and human expertise, while maintaining appropriate internal controls and compliance measures in a regulated industry.

Enterprise LLM Application Development: GitHub Copilot's Journey

Github

GitHub shares their three-year journey of developing and scaling GitHub Copilot, their enterprise-grade AI code completion tool. The case study details their approach through three stages: finding the right problem space, nailing the product experience through rapid iteration and testing, and scaling the solution for enterprise deployment. The result was a successful launch that showed developers coding up to 55% faster and reporting 74% less frustration when coding.

Enterprise LLM Deployment with Multi-Cloud Data Platform Integration

Databricks

This presentation by Databricks' Product Management lead addresses the challenges large enterprises face when deploying LLMs into production, particularly around data governance, evaluation, and operational control. The talk centers on two primary case studies: FactSet's transformation of their query language translation system (improving from 59% to 85% accuracy while reducing latency from 15 to 6 seconds), and Databricks' internal use of Claude for automating analyst questionnaire responses. The solution involves decomposing complex prompts into multi-step agentic workflows, implementing granular governance controls across data and model access, and establishing rigorous evaluation frameworks to achieve production-grade reliability in high-risk enterprise environments.

Enterprise LLM Implementation Panel: Lessons from Box, Glean, Tyace, Security AI and Citibank

Various

A panel discussion featuring leaders from multiple enterprises sharing their experiences implementing LLMs in production. The discussion covers key challenges including data privacy, security, cost management, and enterprise integration. Speakers from Box discuss content management challenges, Glean covers enterprise search implementations, Tyace shares content generation experiences, Security AI addresses data safety, and Citibank provides CIO perspective on enterprise-wide AI deployment. The panel emphasizes the importance of proper data governance, security controls, and the need for systematic approach to move from POCs to production.

Enterprise LLMOps Platform with Focus on Model Customization and API Optimization

IBM

IBM's Watson X platform addresses enterprise LLMOps challenges by providing a comprehensive solution for model access, deployment, and customization. The platform offers both open-source and proprietary models, focusing on specialized use cases like banking and insurance, while emphasizing API optimization for LLM interactions and robust evaluation capabilities. The case study highlights how enterprises are implementing LLMOps at scale with particular attention to data security, model evaluation, and efficient API design for LLM consumption.

Enterprise LLMOps: Development, Operations and Security Framework

Cisco

At Cisco, the challenge of integrating LLMs into enterprise-scale applications required developing new DevSecOps workflows and practices. The presentation explores how Cisco approached continuous delivery, monitoring, security, and on-call support for LLM-powered applications, showcasing their end-to-end model for LLMOps in a large enterprise environment.

Enterprise RAG System with Coveo Passage Retrieval and Amazon Bedrock Agents

Coveo

Coveo addresses the challenge of LLM accuracy and trustworthiness in enterprise environments by integrating their AI-Relevance Platform with Amazon Bedrock Agents. The solution uses Coveo's Passage Retrieval API to provide contextually relevant, permission-aware enterprise knowledge to LLMs through a two-stage retrieval process. This RAG implementation combines semantic and lexical search with machine learning-driven relevance tuning, unified indexing across multiple data sources, and enterprise-grade security to deliver grounded responses while maintaining data protection and real-time performance.

Enterprise RAG-Based Virtual Assistant with LLM Evaluation Pipeline

Santalucía Seguros

Santalucía Seguros implemented a GenAI-based Virtual Assistant to improve customer service and agent productivity in their insurance operations. The solution uses a RAG framework powered by Databricks and Microsoft Azure, incorporating MLflow for LLMOps and Mosaic AI Model Serving for LLM deployment. They developed a sophisticated LLM-based evaluation system that acts as a judge for quality assessment before new releases, ensuring consistent performance and reliability of the virtual assistant.

Enterprise-Grade Memory Agents for Patent Processing with Deep Lake

Activeloop

Activeloop developed a solution for processing and generating patents using enterprise-grade memory agents and their Deep Lake vector database. The system handles 600,000 annual patent filings and 80 million total patents, reducing the typical 2-4 week patent generation process through specialized AI agents for different tasks like claim search, abstract generation, and question answering. The solution combines vector search, lexical search, and their proprietary Deep Memory technology to improve information retrieval accuracy by 5-10% without changing the underlying vector search architecture.

Enterprise-Grade RAG System for Internal Knowledge Management

PDI

PDI Technologies, a global leader in convenience retail and petroleum wholesale, built PDIQ (PDI Intelligence Query), an AI-powered internal knowledge assistant to address the challenge of fragmented information across websites, Confluence, SharePoint, and other enterprise systems. The solution implements a custom Retrieval Augmented Generation (RAG) system on AWS using serverless technologies including Lambda, ECS, DynamoDB, S3, Aurora PostgreSQL, and Amazon Bedrock models (Nova Pro, Nova Micro, Nova Lite, and Titan Embeddings V2). The system features sophisticated document processing with image captioning, dynamic token management for chunking (70% content, 10% overlap, 20% summary), and role-based access control. PDIQ improved customer satisfaction scores, reduced resolution times, increased accuracy approval rates from 60% to 79%, and enabled cost-effective scaling through serverless architecture while supporting multiple business units with configurable data sources.

Enterprise-Scale AI Agent Deployment in Insurance

Wakam

Wakam, a European digital insurance leader with 250 employees across 5 countries, faced critical knowledge silos that hampered productivity across insurance operations, business development, customer service, and legal teams. After initially attempting to build custom AI chatbots in-house with their data science team, they pivoted to implementing Dust, a commercial AI agent platform, to unlock organizational knowledge trapped across Notion, SharePoint, Slack, and other systems. Through strategic executive sponsorship, comprehensive employee enablement, and empowering workers to build their own agents, Wakam achieved 70% employee adoption and deployed 136 AI agents within two months, resulting in a 50% reduction in legal contract analysis time and dramatic improvements in self-service data intelligence across the organization.

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

Enterprise-Scale Cloud Event Management with Generative AI for Operational Intelligence

Fidelity Investments

Fidelity Investments faced the challenge of managing massive volumes of AWS health events and support case data across 2,000+ AWS accounts and 5 million resources in their multi-cloud environment. They built CENTS (Cloud Event Notification Transport Service), an event-driven data pipeline that ingests, enriches, routes, and acts on AWS health and support data at scale. Building upon this foundation, they developed and published the MAKI (Machine Augmented Key Insights) framework using Amazon Bedrock, which applies generative AI to analyze support cases and health events, identify trends, provide remediation guidance, and enable agentic workflows for vulnerability detection and automated code fixes. The solution reduced operational costs by 57%, improved stakeholder engagement through targeted notifications, and enabled proactive incident prevention by correlating patterns across their infrastructure.

Enterprise-Scale Data Product AI Agent for Multi-Domain Knowledge Discovery

Bosch

Bosch, a global manufacturing and technology company with over 400,000 employees across 60+ countries, faced the challenge of accessing and understanding its vast distributed data ecosystem spanning automotive, consumer goods, power tools, and industrial equipment divisions. The company developed DPAI (Data Product AI Agent), an enterprise AI platform that enables natural language interaction with Bosch's data by combining a data mesh architecture, a centralized data marketplace, and generative AI capabilities. The solution integrates semantic understanding through ontologies, data catalogs, and Bosch-specific context to provide accurate, business-relevant answers across divisions. While still in development with an estimated one to two years until full completion, the platform demonstrates how large enterprises can overcome data fragmentation and contextual complexity to make organizational knowledge accessible through conversational AI.

Enterprise-Scale Deployment of AI Ambient Scribes Across Multiple Healthcare Systems

Memorial Sloan Kettering / McLeod Health / UCLA

This panel discussion features three major healthcare systems—McLeod Health, Memorial Sloan Kettering Cancer Center, and UCLA Health—discussing their experiences deploying generative AI-powered ambient clinical documentation (AI scribes) at scale. The organizations faced challenges in vendor evaluation, clinician adoption, and demonstrating ROI while addressing physician burnout and documentation burden. Through rigorous evaluation processes including randomized controlled trials, head-to-head vendor comparisons, and structured pilots, these systems successfully deployed AI scribes to hundreds to thousands of physicians. Results included significant reductions in burnout (20% at UCLA), improved patient satisfaction scores (5-6% increases at McLeod), time savings of 1.5-2 hours per day, and positive financial ROI through improved coding and RVU capture. Key learnings emphasized the importance of robust training, encounter-based pricing models, workflow integration, and managing expectations that AI scribes are not a universal solution for all specialties and clinicians.

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

Enterprise-Scale GenAI Infrastructure Template and Starter Framework

Microsoft

Microsoft developed a solution to address the challenge of repeatedly setting up GenAI projects in enterprise environments. The team created a reusable template and starter framework that automates infrastructure setup, pipeline configuration, and tool integration. This solution includes reference architecture, DevSecOps and LLMOps pipelines, and automated project initialization through a template-starter wizard, significantly reducing setup time and ensuring consistency across projects while maintaining enterprise security and compliance requirements.

Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

John Snow Labs

John Snow Labs developed a comprehensive healthcare LLM system that integrates multimodal medical data (structured, unstructured, FHIR, and images) into unified patient journeys. The system enables natural language querying across millions of patient records while maintaining data privacy and security. It uses specialized healthcare LLMs for information extraction, reasoning, and query understanding, deployed on-premises via Kubernetes. The solution significantly improves clinical decision support accuracy and enables broader access to patient data analytics while outperforming GPT-4 in medical tasks.

Enterprise-Scale LLM Deployment with Licensed Content for Business Intelligence

Factiva

Factiva, a Dow Jones business intelligence platform, implemented a secure, enterprise-scale LLM solution for their content aggregation service. They developed "Smart Summaries" that allows natural language querying across their vast licensed content database of nearly 3 billion articles. The implementation required securing explicit GenAI licensing agreements from thousands of publishers, ensuring proper attribution and royalty tracking, and deploying a secure cloud infrastructure using Google's Gemini model. The solution successfully launched in November 2023 with 4,000 publishers, growing to nearly 5,000 publishers by early 2024.

Enterprise-Scale LLM Deployment with Self-Evolving Models and Graph-Based RAG

Writer

Writer, an enterprise AI company founded in 2020, has evolved from building basic transformer models to delivering full-stack GenAI solutions for Fortune 500 companies. They've developed a comprehensive approach to enterprise LLM deployment that includes their own Palmera model series, graph-based RAG systems, and innovative self-evolving models. Their platform focuses on workflow automation and "action AI" in industries like healthcare and financial services, achieving significant efficiency gains through a hybrid approach that combines both no-code interfaces for business users and developer tools for IT teams.

Enterprise-Scale LLM Integration into CRM Platform

Salesforce

Salesforce developed Einstein GPT, the first generative AI system for CRM, to address customer expectations for faster, personalized responses and automated tasks. The solution integrates LLMs across sales, service, marketing, and development workflows while ensuring data security and trust. The implementation includes features like automated email generation, content creation, code generation, and analytics, all grounded in customer-specific data with human-in-the-loop validation.

Enterprise-Scale LLM Platform with Multi-Model Support and Copilot Customization

Telus

Telus developed Fuel X, an enterprise-scale LLM platform that provides centralized management of multiple AI models and services. The platform enables creation of customized copilots for different use cases, with over 30,000 custom copilots built and 35,000 active users. Key features include flexible model switching, enterprise security, RAG capabilities, and integration with workplace tools like Slack and Google Chat. Results show significant impact, including 46% self-resolution rate for internal support queries and 21% reduction in agent interactions.

Enterprise-Scale Prompt Engineering Toolkit with Lifecycle Management and Production Integration

Uber

Uber developed a comprehensive prompt engineering toolkit to address the challenges of managing and deploying LLMs at scale. The toolkit provides centralized prompt template management, version control, evaluation frameworks, and production deployment capabilities. It includes features for prompt creation, iteration, testing, and monitoring, along with support for both offline batch processing and online serving. The system integrates with their existing infrastructure and supports use cases like rider name validation and support ticket summarization.

Enterprise-Wide AI Assistant Deployment for Collective Discovery

Prosus

Prosus, a global technology investment company serving a quarter of the world's population across 100+ countries, developed and deployed an internal AI assistant called Toqan.ai to enable collective discovery and exploration of generative AI capabilities across their organization. Starting with early LLM experiments in 2019-2021 using models like BERT and GPT-2, they conducted over 20 field experiments before launching a comprehensive chatbot accessible via Slack to approximately 13,000 employees across 24 companies. The assistant integrates over 20 models and tools including commercial and open-source LLMs, image generation, voice encoding, document processing, and code creation capabilities, with robust privacy guardrails. Results showed that over 81% of users reported productivity increases exceeding 5-10%, with 50% of usage devoted to engineering tasks and the remainder spanning diverse business functions. The platform reduced "Pinocchio" (hallucination) feedback from 10% to 1.5% through model improvements and user education, while enabling bottom-up use case discovery that graduated into production applications at multiple portfolio companies including learning assistants, conversational ordering systems, and coding mentors.

Enterprise-Wide Generative AI Implementation for Marketing Content Generation and Translation

Bosch

Bosch, a global industrial and consumer goods company, implemented a centralized generative AI platform called "Gen playground" to address their complex marketing content needs across 3,500+ websites and numerous social media channels. The solution enables their 430,000+ associates to create text content, generate images, and perform translations without relying on external agencies, significantly reducing costs and turnaround time from 6-12 weeks to near-immediate results while maintaining brand consistency and quality standards.

Enterprise-Wide LLM Assistant Deployment and Evolution Towards Fine-Tuned Models

Marsh McLennan

Marsh McLennan, a global professional services firm, implemented a comprehensive LLM-based assistant solution reaching 87% of their 90,000 employees worldwide, processing 25 million requests annually. Initially focused on productivity enhancement through API access and RAG, they evolved their strategy from using out-of-the-box models to incorporating fine-tuned models for specific tasks, achieving better accuracy than GPT-4 while maintaining cost efficiency. The implementation has conservatively saved over a million hours annually across the organization.

Enterprise-Wide LLM Framework for Manufacturing and Knowledge Management

Toyota

Toyota implemented a comprehensive LLMOps framework to address multiple production challenges, including battery manufacturing optimization, equipment maintenance, and knowledge management. The team developed a unified framework combining LangChain and LlamaIndex capabilities, with special attention to data ingestion pipelines, security, and multi-language support. Key applications include Battery Brain for manufacturing expertise, Gear Pal for equipment maintenance, and Project Cura for knowledge management, all showing significant operational improvements including reduced downtime and faster problem resolution.

Enterprise-Wide RAG Implementation with Amazon Q Business

Principal Financial

Principal Financial implemented Amazon Q Business to address challenges with scattered enterprise knowledge and inefficient search capabilities across multiple repositories. The solution integrated QnABot on AWS with Amazon Q Business to enable natural language querying of over 9,000 pages of work instructions. The implementation resulted in 84% accuracy in document retrieval, with 97% of queries receiving positive feedback and users reporting 50% reduction in some workloads. The project demonstrated successful scaling from proof-of-concept to enterprise-wide deployment while maintaining strict governance and security requirements.

Enterprise-Wide Virtual Assistant for Employee Knowledge Access

BNY Mellon

BNY Mellon implemented an LLM-based virtual assistant to help their 50,000 employees efficiently access internal information and policies across the organization. Starting with small pilot deployments in specific departments, they scaled the solution enterprise-wide using Google's Vertex AI platform, while addressing challenges in document processing, chunking strategies, and context-awareness for location-specific policies.

Error Handling in LLM Systems

Uber

This case study examines a common scenario in LLM systems where proper error handling and response validation is essential. The "Not Acceptable" error demonstrates the importance of implementing robust error handling mechanisms in production LLM applications to maintain system reliability and user experience.

Eval-Driven Development for AI Applications

Vercel

Vercel presents their approach to building and deploying AI applications through eval-driven development, moving beyond traditional testing methods to handle AI's probabilistic nature. They implement a comprehensive evaluation system combining code-based grading, human feedback, and LLM-based assessments to maintain quality in their v0 product, an AI-powered UI generation tool. This approach creates a positive feedback loop they call the "AI-native flywheel," which continuously improves their AI systems through data collection, model optimization, and user feedback.

Evaluating Context Compression Strategies for Long-Running AI Agent Sessions

Factory AI

Factory AI developed an evaluation framework to assess context compression strategies for AI agents working on extended software development tasks that generate millions of tokens across hundreds of messages. The company compared three approaches—their structured summarization method, OpenAI's compact endpoint, and Anthropic's built-in compression—using probe-based evaluation that tests factual retention, file tracking, task planning, and reasoning chains. Testing on over 36,000 production messages from debugging, code review, and feature implementation sessions, Factory's structured summarization approach scored 3.70 overall compared to 3.44 for Anthropic and 3.35 for OpenAI, demonstrating superior retention of technical details like file paths and error messages while maintaining comparable compression ratios.

Evaluating Long Context Performance in Legal AI Applications

Thomson Reuters

Thomson Reuters details their comprehensive approach to evaluating and deploying long-context LLMs in their legal AI assistant CoCounsel. They developed rigorous testing protocols to assess LLM performance with lengthy legal documents, implementing a multi-LLM strategy rather than relying on a single model. Through extensive benchmarking and testing, they found that using full document context generally outperformed RAG for most document-based legal tasks, leading to strategic decisions about when to use each approach in production.

Evaluating Product Image Integrity in AI-Generated Advertising Content

Microsoft

Microsoft worked with an advertising customer to enable 1:1 ad personalization while ensuring product image integrity in AI-generated content. They developed a comprehensive evaluation system combining template matching, Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR), and Cosine Similarity to verify that AI-generated backgrounds didn't alter the original product images. The solution successfully enabled automatic verification of product image fidelity in AI-generated advertising materials.

Evaluation Driven Development for LLM Reliability at Scale

Dosu

Dosu, a company providing an AI teammate for software development and maintenance, implemented Evaluation Driven Development (EDD) to ensure reliability of their LLM-based product. As their system scaled to thousands of repositories, they integrated LangSmith for monitoring and evaluation, enabling them to identify failure modes, maintain quality, and continuously improve their AI assistant's performance through systematic testing and iteration.

Evaluation Patterns for Deep Agents in Production

Langchain

LangChain built and deployed four production applications powered by "Deep Agents" - stateful, long-running AI agents capable of complex tasks including coding, email assistance, and agent building. The challenge was developing comprehensive evaluation strategies for these agents that went beyond traditional LLM evaluation approaches. Their solution involved five key patterns: bespoke test logic for each datapoint with custom assertions, single-step evaluations for validating specific decision points, full agent turn testing for end-to-end behavior, multi-turn conversations with conditional logic to simulate realistic interactions, and proper environment setup with clean, reproducible test conditions. Using LangSmith's Pytest and Vitest integrations, they implemented flexible evaluation frameworks that could assess agent trajectories, final responses, and state artifacts while maintaining fast, debuggable test suites through techniques like API mocking and containerized environments.

Evaluation-Driven LLM Production Workflows with Morgan Stanley and Grab Case Studies

OpenAI

OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.

Evaluations Driven Development for Production LLM Applications

Anaconda

Anaconda developed a systematic approach called Evaluations Driven Development (EDD) to improve their AI coding assistant's performance through continuous testing and refinement. Using their in-house "llm-eval" framework, they achieved dramatic improvements in their assistant's ability to handle Python debugging tasks, increasing success rates from 0-13% to 63-100% across different models and configurations. The case study demonstrates how rigorous evaluation, prompt engineering, and automated testing can significantly enhance LLM application reliability in production.

Evolution from Centralized to Federated Generative AI Governance

Pictet AM

Pictet Asset Management faced the challenge of governing a rapidly proliferating landscape of generative AI use cases across marketing, compliance, investment research, and sales functions while maintaining regulatory compliance in the financial services industry. They initially implemented a centralized governance approach using a single AWS account with Amazon Bedrock, featuring a custom "Gov API" to track all LLM interactions. However, this architecture encountered resource limitations, cost allocation difficulties, and operational bottlenecks as the number of use cases scaled. The company pivoted to a federated model with decentralized execution but centralized governance, allowing individual teams to manage their own Bedrock services while maintaining cross-account monitoring and standardized guardrails. This evolution enabled better scalability, clearer cost ownership, and faster team iteration while preserving compliance and oversight capabilities.

Evolution from Monolithic to Task-Oriented LLM Pipelines in a Developer Assistant Product

Outropy

The case study details how Outropy evolved their LLM inference pipeline architecture while building an AI-powered assistant for engineering leaders. They started with simple pipelines for daily briefings and context-aware features, but faced challenges with context windows, relevance, and error cascades. The team transitioned from monolithic pipelines to component-oriented design, and finally to task-oriented pipelines using Temporal for workflow management. The product successfully scaled to 10,000 users and expanded from a Slack-only tool to a comprehensive browser extension.

Evolution from Open-Ended LLM Agents to Guided Workflows

Lindy.ai

Lindy.ai evolved from an open-ended LLM agent platform to a more structured workflow-based approach, demonstrating how constraining LLM behavior through visual workflows and rails leads to more reliable and usable AI agents. The company found that by moving away from free-form prompts to guided, step-by-step workflows, they achieved better reliability and user adoption while maintaining the flexibility to handle complex automation tasks like meeting summaries, email processing, and customer support.

Evolution from Task-Specific Models to Multi-Agent Orchestration Platform

AI21

AI21 Labs evolved their production AI systems from task-specific models (2022-2023) to RAG-as-a-Service, and ultimately to Maestro, a multi-agent orchestration platform. The company identified that while general-purpose LLMs demonstrated impressive capabilities, they weren't optimized for specific business use cases that enterprises actually needed, such as contextual question answering and summarization. AI21 developed smaller language models fine-tuned for specific tasks, wrapped them with pre- and post-processing operations (including hallucination filters), and eventually built a comprehensive RAG system when customers struggled to identify relevant context from large document corpora. The Maestro platform emerged to handle complex multi-hop queries by automatically breaking them into subtasks, parallelizing execution, and orchestrating multiple agents and tools, achieving dramatically improved quality with full traceability for enterprise requirements.

Evolution of AI Agents: From Manual Workflows to End-to-End Training

OpenAI

OpenAI's journey in developing agentic products showcases the evolution from manually designed workflows with LLMs to end-to-end trained agents. The company has developed three main agentic products - Deep Research, Operator, and Codeex CLI - each addressing different use cases from web research to code generation. These agents demonstrate how end-to-end training with reinforcement learning enables better error recovery and more natural interaction compared to traditional manually designed workflows.

Evolution of AI Systems and LLMOps from Research to Production: Infrastructure Challenges and Application Design

NVIDA / Lepton

This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.

Evolution of an Internal AI Platform from No-Code LLM Apps to Agentic Systems

Grab

Grab developed SpellVault, an internal no-code AI platform that evolved from a simple RAG-based LLM app builder into a sophisticated agentic system supporting thousands of apps across the organization. Initially designed to democratize AI access for non-technical users through knowledge integrations and plugins, the platform progressively incorporated advanced capabilities including workflow orchestration, ReAct agent execution, unified tool frameworks, and Model Context Protocol (MCP) compatibility. This evolution enabled SpellVault to transform from supporting static question-answering apps into powering dynamic AI agents capable of reasoning, acting, and interacting with internal and external systems, while maintaining its core mission of accessibility and ease of use.

Evolution of Code Assistant Integration in a Cloud Development Platform

Val Town

Val Town's journey in implementing and evolving code assistance features showcases the challenges and opportunities in productionizing LLMs for code generation. Through iterative improvements and fast-following industry innovations, they progressed from basic ChatGPT integration to sophisticated features including error detection, deployment automation, and multi-file code generation, while addressing key challenges like generation speed and accuracy.

Evolution of Code Evaluation Benchmarks: From Single-Line Completion to Full Codebase Translation

Cursor

This research presentation details four years of work developing evaluation methodologies for coding LLMs across varying time horizons, from second-level code completions to hour-long codebase translations. The speaker addresses critical challenges in evaluating production coding AI systems including data contamination, insufficient test suites, and difficulty calibration. Key solutions include LiveCodeBench's dynamic evaluation approach with periodically updated problem sets, automated test generation using LLM-driven approaches, and novel reward hacking detection systems for complex optimization tasks. The work demonstrates how evaluation infrastructure must evolve alongside model capabilities, incorporating intermediate grading signals, latency-aware metrics, and LLM-as-judge approaches to detect non-idiomatic coding patterns that pass traditional tests but fail real-world quality standards.

Evolution of Hermes V3: Building a Conversational AI Data Analyst

Swiggy

Swiggy transformed their basic text-to-SQL assistant Hermes into a sophisticated conversational AI analyst capable of contextual querying, agentic reasoning, and transparent explanations. The evolution from a simple English-to-SQL translator to an intelligent agent involved implementing vector-based prompt retrieval, conversational memory, agentic workflows, and explanation layers. These enhancements improved query accuracy from 54% to 93% while enabling natural language interactions, context retention across sessions, and transparent decision-making processes for business analysts and non-technical teams.

Evolution of LLM Integration in GitHub Copilot Development

Github

The case study details GitHub's journey in developing GitHub Copilot by working with OpenAI's large language models. Starting with GPT-3 experimentation in 2020, the team evolved from basic code generation testing to creating an interactive IDE integration. Through multiple iterations of model improvements, prompt engineering, and fine-tuning techniques, they enhanced the tool's capabilities, ultimately leading to features like multi-language support, context-aware suggestions, and the development of GitHub Copilot X.

Evolution of ML Platform to Support GenAI Infrastructure

Lyft

Lyft's journey of evolving their ML platform to support GenAI infrastructure, focusing on how they adapted their existing ML serving infrastructure to handle LLMs and built new components for AI operations. The company transitioned from self-hosted models to vendor APIs, implemented comprehensive evaluation frameworks, and developed an AI assistants interface, while maintaining their established ML lifecycle principles. This evolution enabled various use cases including customer support automation and internal productivity tools.

Evolving a Conversational AI Platform for Production LLM Applications

AirBnB

AirBnB evolved their Automation Platform from a static workflow-based conversational AI system to a comprehensive LLM-powered platform. The new version (v2) combines traditional workflows with LLM capabilities, introducing features like Chain of Thought reasoning, robust context management, and a guardrails framework. This hybrid approach allows them to leverage LLM benefits while maintaining control over sensitive operations, ultimately enabling customer support agents to work more efficiently while ensuring safe and reliable AI interactions.

Evolving Agent Architecture Through Model Capability Improvements

Aomni

David from Aomni discusses how their company evolved from building complex agent architectures with multiple guardrails to simpler, more model-centric approaches as LLM capabilities improved. The company provides AI agents for revenue teams, helping automate research and sales workflows while keeping humans in the loop for customer relationships. Their journey demonstrates how LLMOps practices need to continuously adapt as model capabilities expand, leading to removal of scaffolding and simplified architectures.

Evolving GitHub Copilot through LLM Experimentation and User-Centered Design

Github

GitHub's evolution of GitHub Copilot showcases their systematic approach to integrating LLMs across the development lifecycle. Starting with experimental access to GPT-4, the GitHub Next team developed and tested various AI-powered features including Copilot Chat, Copilot for Pull Requests, Copilot for Docs, and Copilot for CLI. Through iterative development and user feedback, they learned key lessons about AI tool design, emphasizing the importance of predictability, tolerability, steerability, and verifiability in AI interactions.

Evolving GitHub Copilot with LLM Experimentation Across the Developer Lifecycle

GitHub

GitHub details their internal experimentation process with GPT-4 and other large language models to extend GitHub Copilot beyond code completion into multiple stages of the software development lifecycle. The GitHub Next research team received early access to GPT-4 and prototyped numerous AI-powered features including Copilot for Pull Requests, Copilot for Docs, Copilot for CLI, and GitHub Copilot Chat. Through iterative experimentation and internal testing with GitHub employees, the team discovered that user experience design, particularly how AI suggestions are presented and allow for developer control, is as critical as model accuracy for successful adoption. The experiments resulted in technical previews released in March 2023 that demonstrated AI integration across documentation, command-line interfaces, and pull request workflows, with key learnings around making AI outputs predictable, tolerable, steerable, and verifiable.

Evolving LLMOps Architecture for Enterprise Supplier Discovery

Various

A detailed case study of implementing LLMs in a supplier discovery product at Scoutbee, evolving from simple API integration to a sophisticated LLMOps architecture. The team tackled challenges of hallucinations, domain adaptation, and data quality through multiple stages: initial API integration, open-source LLM deployment, RAG implementation, and finally a comprehensive data expansion phase. The result was a production-ready system combining knowledge graphs, Chain of Thought prompting, and custom guardrails to provide reliable supplier discovery capabilities.

Evolving ML Infrastructure for Production Systems: From Traditional ML to LLMs

Doordash

A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.

Evolving Quality Control AI Agents with LangGraph

Rexera

Rexera transformed their real estate transaction quality control process by evolving from single-prompt LLM checks to a sophisticated LangGraph-based solution. The company initially faced challenges with single-prompt LLMs and CrewAI implementations, but by migrating to LangGraph, they achieved significant improvements in accuracy, reducing false positives from 8% to 2% and false negatives from 5% to 2% through more precise control and structured decision paths.

Expert-in-the-Loop Generative AI for Creative Content at Scale

Stitch Fix

Stitch Fix implemented expert-in-the-loop generative AI systems to automate creative content generation at scale, specifically for advertising headlines and product descriptions. The company leveraged GPT-3 with few-shot learning for ad headlines, combining latent style understanding and word embeddings to generate brand-aligned content. For product descriptions, they advanced to fine-tuning pre-trained language models on expert-written examples to create high-quality descriptions for hundreds of thousands of inventory items. The hybrid approach achieved significant time savings for copywriters who review and edit AI-generated content rather than writing from scratch, while blind evaluations showed AI-generated product descriptions scoring higher than human-written ones in quality assessments.

Expert-in-the-Loop Generative AI for Marketing Content and Product Descriptions

Stitch Fix

Stitch Fix implemented generative AI solutions to automate the creation of ad headlines and product descriptions for their e-commerce platform. The problem was the time-consuming and costly nature of manually writing marketing copy and product descriptions for hundreds of thousands of inventory items. Their solution combined GPT-3 with an "expert-in-the-loop" approach, using few-shot learning for ad headlines and fine-tuning for product descriptions, while maintaining human copywriter oversight for quality assurance. The results included significant time savings for copywriters, scalable content generation without sacrificing quality, and product descriptions that achieved higher quality scores than human-written alternatives in blind evaluations.

Federal Government AI Platform Adoption and Scalability Initiatives

Various

The U.S. federal government agencies are working to move AI applications from pilots to production, focusing on scalable and responsible deployment. The Department of Energy (DOE) has implemented Energy GPT using open models in their environment, while the Department of State is utilizing LLMs for diplomatic cable summarization. The U.S. Navy's Project AMMO showcases successful MLOps implementation, reducing model retraining time from six months to one week for underwater vehicle operations. Agencies are addressing challenges around budgeting, security compliance, and governance while ensuring user-friendly AI implementations.

Field AI Assistant for Sales Team Automation

Databricks

Databricks developed an AI-powered assistant to transform their sales operations by automating routine tasks and improving data access. The Field AI Assistant, built on their Mosaic AI agent framework, integrates multiple data sources including their Lakehouse, CRM, and collaboration platforms to provide conversational interactions, automate document creation, and execute actions based on data insights. The solution streamlines workflows for sales teams, allowing them to focus on high-value activities while ensuring proper governance and security measures.

Financial Transaction Categorization at Scale Using LLMs and Custom Embeddings

Mercado Libre

Mercado Libre (MELI) faced the challenge of categorizing millions of financial transactions across Latin America in multiple languages and formats as Open Finance unlocked access to customer financial data. Starting with a brittle regex-based system in 2021 that achieved only 60% accuracy and was difficult to maintain, they evolved through three generations: first implementing GPT-3.5 Turbo in 2023 to achieve 80% accuracy with 75% cost reduction, then transitioning to GPT-4o-mini in 2024, and finally developing custom BERT-based semantic embeddings trained on regional financial text to reach 90% accuracy with an additional 30% cost reduction. This evolution enabled them to scale from processing tens of millions of transactions per quarter to tens of millions per week, while enabling near real-time categorization that powers personalized financial insights across their ecosystem.

Fine-tuned LLM for Message Content Moderation and Trust & Safety

Thumbtack

Thumbtack implemented a fine-tuned LLM solution to enhance their message review system for detecting policy violations in customer-professional communications. After experimenting with prompt engineering and finding it insufficient (AUC 0.56), they successfully fine-tuned an LLM model achieving an AUC of 0.93. The production system uses a cost-effective two-tier approach: a CNN model pre-filters messages, with only suspicious ones (20%) processed by the LLM. Using LangChain for deployment, the system has processed tens of millions of messages, improving precision by 3.7x and recall by 1.5x compared to their previous system.

Fine-tuning and Deploying LLMs for Customer Service Contact Centers

Swisscom

Swisscom, a leading telecommunications provider in Switzerland, partnered with AWS to deploy fine-tuned large language models in their customer service contact centers to enable personalized, fast, and efficient customer interactions. The problem they faced was providing 24/7 customer service with high accuracy, low latency (critical for voice interactions), and the ability to handle hundreds of requests per minute during peak times while maintaining control over the model lifecycle. Their solution involved using AWS SageMaker to fine-tune a smaller LLM (Llama 3.1 8B) using synthetic data generated by a larger teacher model, implementing LoRA for efficient training, and deploying the model with infrastructure-as-code using AWS CDK. The results achieved median latency below 250 milliseconds in production, accuracy comparable to larger models, cost-efficient scaling with hourly infrastructure charging instead of per-token pricing, and successful handling of 50% of production traffic with the ability to scale for unexpected peaks.

Fine-Tuning and Multi-Stage Model Optimization for Financial AI Agents

Robinhood Markets

Robinhood Markets developed a sophisticated LLMOps platform to deploy AI agents serving millions of users across multiple use cases including customer support, content generation (Cortex Digest), and code generation (custom indicators and scans). To address the "generative AI trilemma" of balancing cost, quality, and latency in production, they implemented a hierarchical tuning approach starting with prompt optimization, progressing to trajectory tuning with dynamic few-shot examples, and culminating in LoRA-based fine-tuning. Their CX AI agent achieved over 50% latency reduction (from 3-6 seconds to under 1 second) while maintaining quality parity with frontier models, supported by a comprehensive three-layer evaluation system combining LLM-as-judge, human feedback, and task-specific metrics.

Fine-Tuning and Quantizing LLMs for Dynamic Attribute Extraction

Mercari

Mercari tackled the challenge of extracting dynamic attributes from user-generated marketplace listings by fine-tuning a 2B parameter LLM using QLoRA. The team successfully created a model that outperformed GPT-3.5-turbo while being 95% smaller and 14 times more cost-effective. The implementation included careful dataset preparation, parameter efficient fine-tuning, and post-training quantization using llama.cpp, resulting in a production-ready model with better control over hallucinations.

Fine-tuning and Scaling LLMs for Search Relevance Prediction

Faire

Faire, an e-commerce marketplace, tackled the challenge of evaluating search relevance at scale by transitioning from manual human labeling to automated LLM-based assessment. They first implemented a GPT-based solution and later improved it using fine-tuned Llama models. Their best performing model, Llama3-8b, achieved a 28% improvement in relevance prediction accuracy compared to their previous GPT model, while significantly reducing costs through self-hosted inference that can handle 70 million predictions per day using 16 GPUs.

Fine-Tuning LLMs for Multi-Agent Orchestration in Code Generation

Cosine

Cosine, a company building enterprise coding agents, faced the challenge of deploying high-performance AI systems in highly constrained environments including on-premise and air-gapped deployments where large frontier models were not viable. They developed a multi-agent architecture using specialized orchestrator and worker models, leveraging model distillation, supervised fine-tuning, preference optimization, and reinforcement fine-tuning to create smaller models that could match or exceed the performance of much larger models. The result was a 31% performance increase on the SWE-bench Freelancer benchmark, 3X latency improvement, 60% reduction in GPU footprint, and 20% fewer errors in generated code, all while operating on as few as 4 H100 GPUs and maintaining full deployment flexibility across cloud, VPC, and on-premise environments.

Fine-tuning Mistral 7B for Multilingual Defense Intelligence Sentiment Analysis

Vannevar Labs

Vannevar Labs needed to improve their sentiment analysis capabilities for defense intelligence across multiple languages, finding that GPT-4 provided insufficient accuracy (64%) and high costs. Using Databricks Mosaic AI, they successfully fine-tuned a Mistral 7B model on domain-specific data, achieving 76% accuracy while reducing latency by 75%. The entire process from development to deployment took only two weeks, enabling efficient processing of multilingual content for defense-related applications.

Five Critical Lessons for LLM Production Deployment

Amberflo

A former Apple messaging team lead shares five crucial insights for deploying LLMs in production, based on real-world experience. The presentation covers essential aspects including handling inappropriate queries, managing prompt diversity across different LLM providers, dealing with subtle technical changes that can impact performance, understanding the current limitations of function calling, and the critical importance of data quality in LLM applications.

Forward Deployed Engineering for Enterprise LLM Deployments

OpenAI

OpenAI's Forward Deployed Engineering (FDE) team embeds with enterprise customers to solve high-value problems using LLMs, aiming for production deployments that generate tens of millions to billions in value. The team works on complex use cases across industries—from wealth management at Morgan Stanley to semiconductor verification and automotive supply chain optimization—building custom solutions while extracting generalizable patterns that inform OpenAI's product development. Through an "eval-driven development" approach combining LLM capabilities with deterministic guardrails, the FDE team has grown from 2 to 52 engineers in 2025, successfully bridging the gap between AI capabilities and enterprise production requirements while maintaining focus on zero-to-one problem solving rather than long-term consulting engagements.

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

Four Critical Lessons from Building 50+ Global Chatbots: A Practitioner's Guide to Real-World Implementation

Campfire AI

Drawing from experience building over 50 chatbots across five continents, this case study outlines four crucial lessons for successful chatbot implementation. Key insights include treating chatbot projects as AI initiatives rather than traditional IT projects, anticipating out-of-scope queries through "99-intents", organizing intents hierarchically for more natural interactions, planning for unusual user expressions, and eliminating unhelpful "I don't understand" responses. The study emphasizes that successful chatbots require continuous optimization, aiming for 90-95% recognition rates for in-scope questions, while maintaining effective fallback mechanisms for edge cases.

From Mega-Prompts to Production: Lessons Learned Scaling LLMs in Enterprise Customer Support

GoDaddy

GoDaddy has implemented large language models across their customer support infrastructure, particularly in their Digital Care team which handles over 60,000 customer contacts daily through messaging channels. Their journey implementing LLMs for customer support revealed several key operational insights: the need for both broad and task-specific prompts, the importance of structured outputs with proper validation, the challenges of prompt portability across models, the necessity of AI guardrails for safety, handling model latency and reliability issues, the complexity of memory management in conversations, the benefits of adaptive model selection, the nuances of implementing RAG effectively, optimizing data for RAG through techniques like Sparse Priming Representations, and the critical importance of comprehensive testing approaches. Their experience demonstrates both the potential and challenges of operationalizing LLMs in a large-scale enterprise environment.

From MVP to Production: LLM Application Evaluation and Deployment Challenges

Various

A panel discussion featuring experts from Databricks, Last Mile AI, Honeycomb, and other companies discussing the challenges of moving LLM applications from MVP to production. The discussion focuses on key challenges around user feedback collection, evaluation methodologies, handling domain-specific requirements, and maintaining up-to-date knowledge in production LLM systems. The experts share experiences on implementing evaluation pipelines, dealing with non-deterministic outputs, and establishing robust observability practices.

From Pilot to Profit: Three Enterprise GenAI Case Studies in Manufacturing, Aviation, and Telecommunications

Various

A comprehensive analysis of three enterprise GenAI implementations showcasing the journey from pilot to profit. The cases cover a top 10 automaker's use of GenAI for manufacturing maintenance, an aviation entertainment company's predictive maintenance system, and a telecom provider's sales automation solution. Each case study reveals critical "hidden levers" for successful GenAI deployment: adoption triggers, lean workflows, and revenue accelerators. The analysis demonstrates that while GenAI projects typically cost between $200K to $1M and take 15-18 months to achieve ROI, success requires careful attention to implementation details, user adoption, and business process integration.

From Simple RAG to Multi-Agent Architecture for Document Data Extraction

Box

Box evolved their document data extraction system from a simple single-model approach to a sophisticated multi-agent architecture to handle enterprise-scale unstructured data processing. The initial straightforward approach of preprocessing documents and feeding them to an LLM worked well for basic use cases but failed when customers presented complex challenges like 300-page documents, poor OCR quality, hundreds of extraction fields, and confidence scoring requirements. By redesigning the system using an agentic approach with specialized sub-agents for different tasks, Box achieved better accuracy, easier system evolution, and improved maintainability while processing millions of pages for enterprise customers.

From SMS to AI: Lessons from 5 Years of Chatbot Development for Social Impact

ONE

ONE's journey deploying chatbots for advocacy work from 2018-2024 provides valuable insights into operating messaging systems at scale for social impact. Starting with a shift from SMS to Facebook Messenger, and later expanding to WhatsApp, ONE developed two chatbots reaching over 38,000 users across six African countries. The project demonstrated both the potential and limitations of non-AI chatbots, achieving 17,000+ user actions while identifying key challenges in user acquisition costs ($0.17-$1.77 per user), retention, and re-engagement restrictions. Their experience highlights the importance of starting small, continuous user testing, marketing investment planning, systematic re-engagement strategies, and organization-wide integration of chatbot initiatives.

Gen AI On-Call Copilot for Internal Support

Uber

Uber faced a challenge managing approximately 45,000 monthly questions across internal Slack support channels, creating productivity bottlenecks for both users waiting for responses and on-call engineers fielding repetitive queries. To address this, Uber built Genie, an on-call copilot using Retrieval-Augmented Generation (RAG) to automatically answer user questions by retrieving information from internal documentation sources including their internal wiki (Engwiki), internal Stack Overflow, and engineering requirement documents. Since launching in September 2023, Genie has expanded to 154 Slack channels, answered over 70,000 questions with a 48.9% helpfulness rate, and is estimated to have saved approximately 13,000 engineering hours.

GenAI Agent for Partner-Guest Messaging Automation

Booking.com

Booking.com developed a GenAI agent to assist accommodation partners in responding to guest inquiries more efficiently. The problem was that manual responses through their messaging platform were time-consuming, especially during busy periods, potentially leading to delayed responses and lost bookings. The solution involved building a tool-calling agent using LangGraph and GPT-4 Mini that can suggest relevant template responses, generate custom free-text answers, or abstain from responding when appropriate. The system includes guardrails for PII redaction, retrieval tools using embeddings for template matching, and access to property and reservation data. Early results show the system handles tens of thousands of daily messages, with pilots demonstrating 70% improvement in user satisfaction, reduced follow-up messages, and faster response times.

GenAI Agent for Partner-Guest Messaging in Travel Accommodation

Booking

Booking.com developed a GenAI agent to assist accommodation partners in responding to guest inquiries more efficiently. The problem addressed was the manual effort required by partners to search for and select response templates, particularly during busy periods, which could lead to delayed responses and potential booking cancellations. The solution is a tool-calling agent built with LangGraph and GPT-4 Mini that autonomously decides whether to suggest a predefined template, generate a custom response, or refrain from answering. The system retrieves relevant templates using semantic search with embeddings stored in Weaviate, accesses property and reservation data via GraphQL, and implements guardrails for PII redaction and topic filtering. Deployed as a microservice on Kubernetes with FastAPI, the agent processes tens of thousands of daily messages and achieved a 70% increase in user satisfaction in live pilots, along with reduced follow-up messages and faster response times.

GenAI Governance in Practice: Access Control, Data Quality, and Monitoring for Production LLM Systems

Xomnia

Martin Der, a data scientist at Xomnia, presents practical approaches to GenAI governance addressing the challenge that only 5% of GenAI projects deliver immediate ROI. The talk focuses on three key pillars: access and control (enabling self-service prototyping through tools like Open WebUI while avoiding shadow AI), unstructured data quality (detecting contradictions and redundancies in knowledge bases through similarity search and LLM-based validation), and LLM ops monitoring (implementing tracing platforms like LangFuse and creating dynamic golden datasets for continuous testing). The solutions include deploying Chrome extensions for workflow integration, API gateways for centralized policy enforcement, and developing a knowledge agent called "Genie" for internal use cases across telecom, healthcare, logistics, and maritime industries.

GenAI Transformation of Manufacturing and Supply Chain Operations

Jabil

Jabil, a global manufacturing company with $29B in revenue and 140,000 employees, implemented Amazon Q to transform their manufacturing and supply chain operations. They deployed GenAI solutions across three key areas: shop floor operations assistance (Ask Me How), procurement intelligence (PIP), and supply chain management (V-command). The implementation helped reduce downtime, improve operator efficiency, enhance procurement decisions, and accelerate sales cycles for their supply chain services. The company established robust governance through AI and GenAI councils while ensuring responsible AI usage and clear value creation.

GenAI-Powered Accessory Recommendations for Large-Scale E-commerce Catalog

Target

Target's Product Recommendations Team developed GRAM (GenAI-based Related Accessory Model) to address the challenge of recommending appropriate accessories across their vast Electronics and Home categories. The system uses LLMs to automatically analyze product attributes, assign importance weights to different attribute combinations, and generate aesthetic matches that consider color harmony and stylistic coherence. By incorporating human-in-the-loop processes with site merchant insights, the solution balances algorithmic recommendations with cross-category expertise. An A/B test conducted in February 2025 showed approximately 11% increase in interaction rate, 12% increase in display-to-conversion rates, and over 9% growth in attributable demand. The model was fully rolled out to production in April 2025.

GenAI-Powered Automated Resource Leak Fixing in Java Codebases

Uber

Uber developed FixrLeak, a generative AI-based framework to automate the detection and repair of resource leaks in their Java codebase. Resource leaks—where files, database connections, or streams aren't properly released—cause performance degradation and system failures, and while tools like SonarQube detect them, fixing remains manual and error-prone. FixrLeak combines Abstract Syntax Tree (AST) analysis with generative AI (specifically OpenAI ChatGPT-4O) to produce accurate, idiomatic fixes following Java best practices like try-with-resources. When tested on 124 resource leaks in Uber's codebase, FixrLeak successfully automated fixes for 93 out of 102 eligible cases (after filtering out deprecated code and complex inter-procedural leaks), significantly reducing manual effort and improving code quality at scale.

GenAI-Powered Document Classification for Community Management

Associa

Associa, North America's largest community management company managing 48 million documents across 26 TB of data, faced significant operational inefficiencies due to manual document classification processes that consumed employee hours and created bottlenecks. Collaborating with the AWS Generative AI Innovation Center, Associa built a generative AI-powered document classification system using Amazon Bedrock and the GenAI IDP Accelerator. The solution achieved 95% classification accuracy across eight document types at an average cost of 0.55 cents per document, using Amazon Nova Pro with a first-page-only approach combined with OCR and image inputs. The system processes documents automatically, integrates seamlessly into existing workflows, and delivers substantial cost savings while reducing manual classification effort and improving operational efficiency.

GenAI-Powered Dual-Loop System for Automated Documentation Management

Intuit

Intuit developed a sophisticated dual-loop GenAI system to address challenges in technical documentation management. The system combines an inner loop that continuously improves individual documents through analysis, enhancement, and augmentation, with an outer loop that leverages embeddings and semantic search to make knowledge more accessible. This approach not only improves document quality and maintains consistency but also enables context-aware information retrieval and synthesis.

GenAI-Powered Invoice Document Processing and Automation

Uber

Uber faced significant challenges processing a high volume of invoices daily from thousands of global suppliers, with diverse formats, 25+ languages, and varying templates requiring substantial manual intervention. The company developed TextSense, a GenAI-powered document processing platform that leverages OCR, computer vision, and large language models (specifically OpenAI GPT-4 after evaluating multiple options including fine-tuned Llama 2 and Flan T5) to automate invoice data extraction. The solution achieved 90% overall accuracy, reduced manual processing by 2x, cut average handling time by 70%, and delivered 25-30% cost savings compared to manual processes, while providing a scalable, configuration-driven platform adaptable to diverse document types.

GenAI-Powered Personalized Homepage Carousels for Food Delivery

Doordash

DoorDash developed a GenAI-powered system to create personalized store carousels on their homepage, addressing limitations in their previous heuristic-based content system that featured only 300 curated carousels with insufficient diversity and overly broad categories. The new system leverages LLMs to analyze comprehensive consumer profiles and generate unique carousel titles with metadata for each user, then uses embedding-based retrieval to populate carousels with relevant stores and dishes. Early A/B tests in San Francisco and Manhattan showed double-digit improvements in click rates, improved conversion rates and homepage relevance metrics, and increased merchant discovery, particularly benefiting small and mid-sized businesses.

GenAI-Powered Work Order Management System POC

NTT Data

An international infrastructure company partnered with NTT Data to evaluate whether GenAI could improve their work order management system that handles 500,000+ annual maintenance requests. The POC focused on automating classification, urgency assessment, and special handling requirements identification. Using a privately hosted LLM with company-specific knowledge base, the solution demonstrated improved accuracy and consistency in work order processing compared to the manual approach, while providing transparent reasoning for classifications.

Generating Production-Ready MCP Servers from OpenAPI Specifications

SpeakEasy

SpeakEasy tackled the challenge of enabling AI agents to interact with existing APIs by developing a tool that automatically generates Model Context Protocol (MCP) servers from OpenAPI documents. The company identified critical issues when generating over 50 production MCP servers for customers, including tool explosion (too many exposed operations), verbose descriptions consuming excessive tokens, complex data formats confusing LLMs, and inadequate access controls. Their solution involved a three-layer optimization approach: pruning OpenAPI documents with custom extensions, building intelligence into the generator to handle complex formats and streaming responses, and providing customization files for precise tool control. The result is production-ready MCP servers that balance LLM context window constraints with functional completeness, using techniques like scope-based access control, automatic data transformation, and optimized descriptions.

Generative AI Assistant for Agricultural Field Trial Analysis

Agmatix

Agmatix developed Leafy, a generative AI assistant powered by Amazon Bedrock, to streamline agricultural field trial analysis. The solution addresses challenges in analyzing complex trial data by enabling agronomists to query data using natural language, automatically selecting appropriate visualizations, and providing insights. Using Amazon Bedrock with Anthropic Claude, along with AWS services for data pipeline management, the system achieved 20% improved efficiency, 25% better data integrity, and tripled analysis throughput.

Generative AI Contact Center Solution with Amazon Bedrock and Claude

DoorDash

DoorDash implemented a generative AI-powered self-service contact center solution using Amazon Bedrock, Amazon Connect, and Anthropic's Claude to handle hundreds of thousands of daily support calls. The solution leverages RAG with Knowledge Bases for Amazon Bedrock to provide accurate responses to Dasher inquiries, achieving response latency of 2.5 seconds or less. The implementation reduced development time by 50% and increased testing capacity 50x through automated evaluation frameworks.

Generative AI Customer Service Agent Assist with RAG Implementation

Newday

NewDay, a UK financial services company handling 2.5 million customer calls annually, developed NewAssist, a real-time generative AI assistant to help customer service agents quickly find answers from nearly 200 knowledge articles. Starting as a hackathon project, the solution evolved from a voice assistant concept to a chatbot implementation using Amazon Bedrock and Claude 3 Haiku. Through iterative experimentation and custom data processing, the team achieved over 90% accuracy, reducing answer retrieval time from 90 seconds to 4 seconds while maintaining costs under $400 per month using a serverless AWS architecture.

Generative AI for Secondary Manuscript Generation in Life Sciences

Sorcero

Sorcero, a life sciences AI company, addresses the challenge of generating secondary manuscripts (particularly patient-reported outcomes manuscripts) from clinical study reports, a process that traditionally takes months and is costly, inconsistent, and delays patient access to treatments. Their solution uses generative AI to create foundational manuscript drafts within hours from source materials including clinical study reports, statistical analysis plans, and protocols. The system emphasizes trust, traceability, and regulatory compliance through rigorous validation frameworks, industry benchmarks (like CONSORT guidelines), comprehensive audit trails, and human oversight. The approach generates complete manuscripts with proper structure, figures, and tables while ensuring all assertions are traceable to source data, hallucinations are controlled, and industry standards are met.

Generative AI Implementation in Banking Customer Service and Knowledge Management

Various

Multiple banks, including Discover Financial Services, Scotia Bank, and others, share their experiences implementing generative AI in production. The case study focuses particularly on Discover's implementation of gen AI for customer service, where they achieved a 70% reduction in agent search time by using RAG and summarization for procedure documentation. The implementation included careful consideration of risk management, regulatory compliance, and human-in-the-loop validation, with technical writers and agents providing continuous feedback for model improvement.

Generative AI Integration in Financial Crime Detection Platform

NICE Actimize

NICE Actimize implemented generative AI into their financial crime detection platform "Excite" to create an automated machine learning model factory and enhance MLOps capabilities. They developed a system that converts natural language requests into analytical artifacts, helping analysts create aggregations, features, and models more efficiently. The solution includes built-in guardrails and validation pipelines to ensure safe deployment while significantly reducing time to market for analytical solutions.

Generative AI-Powered Enhancements for Streaming Video Platform

Amazon

Amazon Prime Video addresses the challenge of differentiating their streaming platform in a crowded market by implementing multiple generative AI features powered by AWS services, particularly Amazon Bedrock. The solution encompasses personalized content recommendations, AI-generated episode recaps (X-Ray Recaps), real-time sports analytics insights, dialogue enhancement features, and automated video content understanding with metadata extraction. These implementations have resulted in improved content discoverability, enhanced viewer engagement through features that prevent spoilers while keeping audiences informed, deeper sports broadcast insights, increased accessibility through AI-enhanced audio, and enriched metadata for hundreds of thousands of marketing assets, collectively improving the overall streaming experience and reducing time spent searching for content.

Generative AI-Powered Intelligent Document Processing for Healthcare Operations

Myriad Genetics

Myriad Genetics, a genetic testing and precision medicine provider, faced challenges processing thousands of healthcare documents daily with their existing Amazon Comprehend and Amazon Textract solution, which cost $15,000 monthly per business unit with 8.5-minute processing times and required manual information extraction involving up to 10 full-time employees. Partnering with AWS Generative AI Innovation Center, they deployed the open-source GenAI IDP Accelerator using Amazon Bedrock with Amazon Nova models, implementing advanced prompt engineering techniques including AI-driven prompt engineering, negative prompting, few-shot learning, and chain-of-thought reasoning. The solution increased classification accuracy from 94% to 98%, reduced classification costs by 77%, decreased processing time by 80% (from 8.5 to 1.5 minutes), and automated key information extraction at 90% accuracy, projected to save $132K annually while reducing prior authorization processing time by 2 minutes per submission.

Generative AI-Powered Knowledge Sharing System for Travel Expertise

Hotelplan Suisse

Hotelplan Suisse implemented a generative AI solution to address the challenge of sharing travel expertise across their 500+ travel experts. The system integrates multiple data sources and uses semantic search to provide instant, expert-level travel recommendations to sales staff. The solution reduced response time from hours to minutes and includes features like chat history management, automated testing, and content generation capabilities for marketing materials.

GitHub Copilot Deployment at Scale: Enhancing Developer Productivity

Mercado Libre

Mercado Libre, Latin America's largest e-commerce platform, implemented GitHub Copilot across their development team of 9,000+ developers to address the need for more efficient development processes. The solution resulted in approximately 50% reduction in code writing time, improved developer satisfaction, and enhanced productivity by automating repetitive tasks. The implementation was part of a broader GitHub Enterprise strategy that includes security features and automated workflows.

GitHub Copilot Integration for Enhanced Developer Productivity

Duolingo

Duolingo implemented GitHub Copilot to address challenges with developer efficiency and code consistency across their expanding codebase. The solution led to a 25% increase in developer speed for those new to specific repositories, and a 10% increase for experienced developers. The implementation of GitHub Copilot, along with Codespaces and custom API integrations, helped maintain consistent standards while accelerating development workflows and reducing context switching.

Global News Organization's AI-Powered Content Production and Verification System

Reuters

Reuters has implemented a comprehensive AI strategy to enhance its global news operations, focusing on reducing manual work, augmenting content production, and transforming news delivery. The organization developed three key tools: a press release fact extraction system, an AI-integrated CMS called Leon, and a content packaging tool called LAMP. They've also launched the Reuters AI Suite for clients, offering transcription and translation capabilities while maintaining strict ethical guidelines around AI-generated imagery and maintaining journalistic integrity.

Google Photos Magic Editor: Transitioning from On-Device ML to Cloud-Based Generative AI for Image Editing

Google

Google Photos evolved from using on-device machine learning models for basic image editing features like background blur and object removal to implementing cloud-based generative AI for their Magic Editor feature. The team transitioned from small, specialized models (10MB) running locally on devices to large-scale generative models hosted in the cloud to enable more sophisticated image editing capabilities like scene reimagination, object relocation, and advanced inpainting. This shift required significant changes in infrastructure, capacity planning, evaluation methodologies, and user experience design while maintaining focus on grounded, memory-preserving edits rather than fantastical image generation.

GPT Integration for SQL Stored Procedure Optimization in CI/CD Pipeline

Agoda

Agoda integrated GPT into their CI/CD pipeline to automate SQL stored procedure optimization, addressing a significant operational bottleneck where database developers were spending 366 man-days annually on manual optimization tasks. The system provides automated analysis and suggestions for query improvements, index recommendations, and performance optimizations, leading to reduced manual review time and improved merge request processing. While achieving approximately 25% accuracy, the solution demonstrates practical benefits in streamlining database development workflows despite some limitations in handling complex stored procedures.

GPT-4 Visit Notes System

Summer Health

Summer Health successfully deployed GPT-4 to revolutionize pediatric visit note generation, addressing both provider burnout and parent communication challenges. The implementation reduced note-writing time from 10 to 2 minutes per visit (80% reduction) while making medical information more accessible to parents. By carefully considering HIPAA compliance through BAAs and implementing robust clinical review processes, they demonstrated how LLMs can be safely and effectively deployed in healthcare settings. The case study showcases how AI can simultaneously improve healthcare provider efficiency and patient experience, while maintaining high standards of medical accuracy and regulatory compliance.

Graph RAG and Multi-Agent Systems for Legal Case Discovery and Document Analysis

WhyHow

WhyHow.ai, a legal technology company, developed a system that combines graph databases, multi-agent architectures, and retrieval-augmented generation (RAG) to identify class action and mass tort cases before competitors by scraping web data, structuring it into knowledge graphs, and generating personalized reports for law firms. The company claims to find potential cases within 15 minutes compared to the industry standard of 8-9 months, using a pipeline that processes complaints from various online sources, applies lawyer-specific filtering schemas, and generates actionable legal intelligence through automated multi-agent workflows backed by graph-structured knowledge representation.

Hardening AI Agents for E-commerce at Scale: Multi-Company Perspectives on RL Alignment and Reliability

Prosus / Microsoft / Inworld AI / IUD

This panel discussion features experts from Microsoft, Google Cloud, InWorld AI, and Brazilian e-commerce company IUD (Prosus partner) discussing the challenges of deploying reliable AI agents for e-commerce at scale. The panelists share production experiences ranging from Google Cloud's support ticket routing agent that improved policy adherence from 45% to 90% using DPO adapters, to Microsoft's shift away from prompt engineering toward post-training methods for all Copilot models, to InWorld AI's voice agent architecture optimization through cascading models, and IUD's struggles with personalization balance in their multi-channel shopping agent. Key challenges identified include model localization for UI elements, cost efficiency, real-time voice adaptation, and finding the right balance between automation and user control in commerce experiences.

Harness Engineering for Agentic Coding Systems

Langchain

LangChain improved their coding agent (deepagents-cli) from 52.8% to 66.5% on Terminal Bench 2.0, advancing from Top 30 to Top 5 performance, solely through harness engineering without changing the underlying model (gpt-5.2-codex). The solution focused on three key areas: system prompts emphasizing self-verification loops, enhanced tools and context injection to help agents understand their environment, and middleware hooks to detect problematic patterns like doom loops. The approach leveraged LangSmith tracing at scale to identify failure modes and iteratively optimize the harness through automated trace analysis, demonstrating that systematic engineering around the model can yield significant performance improvements in production agentic systems.

Healthcare Conversational AI and Multi-Model Cost Management in Production

Amberflo / Interactly.ai

A panel discussion featuring Interactly.ai's development of conversational AI for healthcare appointment management, and Amberflo's approach to usage tracking and cost management for LLM applications. The case study explores how Interactly.ai handles the challenges of deploying LLMs in healthcare settings with privacy and latency constraints, while Amberflo addresses the complexities of monitoring and billing for multi-model LLM applications in production.

Healthcare Data Analytics Democratization with MapAI and LLM Integration

Komodo

Komodo Health developed MapAI, an NLP-powered AI assistant integrated into their MapLab enterprise platform, to democratize healthcare data analytics. The solution enables non-technical users to query complex healthcare data using natural language, transforming weeks-long data analysis processes into instant insights. The system leverages multiple foundation models, LangChain, and LangGraph for deployment, with an API-first approach for seamless integration with their Healthcare Map platform.

Healthcare Patient Journey Analysis Platform with Multimodal LLMs

John Snow Labs

John Snow Labs developed a comprehensive healthcare analytics platform that uses specialized medical LLMs to process and analyze patient data across multiple modalities including unstructured text, structured EHR data, FIR resources, and images. The platform enables healthcare professionals to query patient histories and build cohorts using natural language, while handling complex medical terminology mapping and temporal reasoning. The system runs entirely within the customer's infrastructure for security, uses Kubernetes for deployment, and significantly outperforms GPT-4 on medical tasks while maintaining consistency and explainability in production.

Healthcare Search Discovery Using ML and Generative AI on E-commerce Platform

Amazon Health Services

Amazon Health Services faced the challenge of integrating healthcare services into Amazon's e-commerce search experience, where traditional product search algorithms weren't designed to handle complex relationships between symptoms, conditions, treatments, and healthcare services. They developed a comprehensive solution combining machine learning for query understanding, vector search for product matching, and large language models for relevance optimization. The solution uses AWS services including Amazon SageMaker for ML models, Amazon Bedrock for LLM capabilities, and Amazon EMR for data processing, implementing a three-component architecture: query understanding pipeline to classify health searches, LLM-enhanced product knowledge base for semantic search, and hybrid relevance optimization using both human labeling and LLM-based classification. This system now serves daily health-related search queries, helping customers find everything from prescription medications to primary care services through improved discovery pathways.

Human-AI Co-Annotation System for Efficient Data Labeling

Appen

Appen developed a hybrid approach combining LLMs with human annotators to address the growing challenges in data annotation for AI models. They implemented a co-annotation engine that uses model uncertainty metrics to efficiently route annotation tasks between LLMs and human annotators. Using GPT-3.5 Turbo for initial annotations and entropy-based confidence scoring, they achieved 87% accuracy while reducing costs by 62% and annotation time by 63% compared to purely human annotation, demonstrating an effective balance between automation and human expertise.

Human-AI Synergy in Pharmaceutical Research and Document Processing

Merantix

Merantix has implemented AI systems that focus on human-AI collaboration across multiple domains, particularly in pharmaceutical research and document processing. Their approach emphasizes progressive automation where AI systems learn from human input, gradually taking over more tasks while maintaining high accuracy. In pharmaceutical applications, they developed a system for analyzing rodent behavior videos, while in document processing, they created solutions for legal and compliance cases where error tolerance is minimal. The systems demonstrate a shift from using AI as mere tools to creating collaborative AI-human workflows that maintain high accuracy while improving efficiency.

Hybrid AI System for Large-Scale Product Categorization

Walmart

Walmart developed Ghotok, an innovative AI system that combines predictive and generative AI to improve product categorization across their digital platforms. The system addresses the challenge of accurately mapping relationships between product categories and types across 400 million SKUs. Using an ensemble approach with both predictive and generative AI models, along with sophisticated caching and deployment strategies, Ghotok successfully reduces false positives and improves the efficiency of product categorization while maintaining fast response times in production.

Hybrid LLM-Optimization System for Trip Planning with Real-World Constraints

Google

Google Research developed a hybrid system for trip planning that combines LLMs with optimization algorithms to address the challenge of generating practical travel itineraries. The system uses Gemini models to generate initial trip plans based on user preferences and qualitative goals, then applies a two-stage optimization algorithm that incorporates real-world constraints like opening hours, travel times, and budget considerations to produce feasible itineraries. This approach was implemented in Google's "AI trip ideas in Search" feature, demonstrating how LLMs can be effectively deployed in production while maintaining reliability through algorithmic correction of potential feasibility issues.

Hybrid ML and LLM Approach for Automated Question Quality Feedback

Stack Overflow

Stack Overflow developed Question Assistant to provide automated feedback on question quality for new askers, addressing the repetitive nature of human reviewer comments in their Staging Ground platform. Initial attempts to use LLMs alone to rate question quality failed due to unreliable predictions and generic feedback. The team pivoted to a hybrid approach combining traditional logistic regression models trained on historical reviewer comments to flag quality indicators, paired with Google's Gemini LLM to generate contextual, actionable feedback. While the solution didn't significantly improve approval rates or review times, it achieved a meaningful 12% increase in question success rates (questions that remain open and receive answers or positive scores) across two A/B tests, leading to full deployment in March 2025.

Hybrid RAG for Technical Training Knowledge Assistant in Mining Operations

Rio Tinto

Rio Tinto Aluminium faced challenges in providing technical experts in refining and smelting sectors with quick and accurate access to vast amounts of specialized institutional knowledge during their internal training programs. They developed a generative AI-powered knowledge assistant using hybrid RAG (retrieval augmented generation) on Amazon Bedrock, combining both vector search and knowledge graph databases to enable more accurate, contextually rich responses. The hybrid system significantly outperformed traditional vector-only RAG across all metrics, particularly in context quality and entity recall, showing over 53% reduction in standard deviation while maintaining high mean scores, and leveraging 11-17 technical documents per query compared to 2-3 for vector-only approaches, ultimately streamlining how employees find and utilize critical business information.

Implementing Effective Safety Filters in a Game-Based LLM Application

JOBifAI

JOBifAI, a game leveraging LLMs for interactive gameplay, encountered significant challenges with LLM safety filters in production. The developers implemented a retry-based solution to handle both technical failures and safety filter triggers, achieving a 99% success rate after three retries. However, the experience highlighted fundamental issues with current safety filter implementations, including lack of transparency, inconsistent behavior, and potential cost implications, ultimately limiting the game's development from proof-of-concept to full production.

Implementing Evaluation Framework for MCP Server Tool Selection

Neon

Neon developed a comprehensive evaluation framework to test their Model Context Protocol (MCP) server's ability to correctly use database migration tools. The company faced challenges with LLMs selecting appropriate tools from a large set of 20+ tools, particularly for complex stateful workflows involving database migrations. Their solution involved creating automated evals using Braintrust, implementing "LLM-as-a-judge" scoring techniques, and establishing integrity checks to ensure proper tool usage. Through iterative prompt engineering guided by these evaluations, they improved their tool selection success rate from 60% to 100% without requiring code changes.

Implementing Generative AI in Manufacturing: A Multi-Use Case Study

Accenture

Accenture's Industry X division conducted extensive experiments with generative AI in manufacturing settings throughout 2023. They developed and validated nine key use cases including operations twins, virtual mentors, test case generation, and technical documentation automation. The implementations showed significant efficiency gains (40-50% effort reduction in some cases) while maintaining a human-in-the-loop approach. The study emphasized the importance of using domain-specific data, avoiding generic knowledge management solutions, and implementing multi-agent orchestrated solutions rather than standalone models.

Implementing LLM Observability for Natural Language Querying Interface

Honeycomb

Honeycomb implemented a natural language querying interface for their observability product and faced challenges in maintaining and improving it post-launch. They solved this by implementing comprehensive observability practices, capturing everything from user inputs to LLM responses using distributed tracing. This approach enabled them to monitor the entire user experience, isolate issues, and establish a continuous improvement flywheel, resulting in higher product retention and conversion rates.

Implementing LLMOps in Restricted Networks with Long-Running Evaluations

Microsoft

A case study detailing Microsoft's experience implementing LLMOps in a restricted network environment using Azure Machine Learning. The team faced challenges with long-running evaluations (6+ hours) and network restrictions, developing solutions including opt-out mechanisms for lengthy evaluations, implementing Git Flow for controlled releases, and establishing a comprehensive CI/CE/CD pipeline. Their approach balanced the needs of data scientists, engineers, and platform teams while maintaining security and evaluation quality.

Implementing LLMs for Patient Education and Healthcare Communication

National Healthcare Group

National Healthcare Group addressed the challenge of inconsistent and time-consuming patient education by implementing LLM-powered chatbots integrated into their existing healthcare apps and messaging platforms. The solution provides 24/7 multilingual patient education, focusing on conditions like eczema and medical test preparation, while ensuring privacy and accuracy. The implementation emphasizes integration with existing platforms rather than creating new standalone solutions, combined with careful monitoring and refinement of responses.

Implementing MCP Remote Server for CRM Agent Integration

HubSpot

HubSpot built a remote Model Context Protocol (MCP) server to enable AI agents like ChatGPT to interact with their CRM data. The challenge was to provide seamless, secure access to CRM objects (contacts, companies, deals) for ChatGPT's 500 million weekly users, most of whom aren't developers. In less than four weeks, HubSpot's team extended the Java MCP SDK to create a stateless, HTTP-based microservice that integrated with their existing REST APIs and RPC system, implementing OAuth 2.0 for authentication and user permission scoping. The solution made HubSpot the first CRM with an OpenAI connector, enabling read-only queries that allow customers to analyze CRM data through natural language interactions while maintaining enterprise-grade security and scale.

Implementing Multi-Agent RAG Architecture for Customer Care Automation

Doctolib

Doctolib evolved their customer care system from basic RAG to a sophisticated multi-agent architecture using LangGraph. The system employs a primary assistant for routing and specialized agents for specific tasks, incorporating safety checks and API integrations. While showing promise in automating customer support tasks like managing calendar access rights, they faced challenges with LLM behavior variance, prompt size limitations, and unstructured data handling, highlighting the importance of robust data structuration and API documentation for production deployment.

Implementing Product Comparison and Discovery Features with LLMs at Scale

idealo

idealo, a major European price comparison platform, implemented LLM-powered features to enhance product comparison and discovery. They developed two key applications: an intelligent product comparison tool that extracts and compares relevant attributes from extensive product specifications, and a guided product finder that helps users navigate complex product categories. The company focused on using LLMs as language interfaces rather than knowledge bases, relying on proprietary data to prevent hallucinations. They implemented thorough evaluation frameworks and A/B testing to measure business impact.

Implementing Question-Answering Over Sales Conversations with Deal Me at Gong

Gong

Gong developed "Deal Me", a natural language question-answering feature for sales conversations that allows users to query vast amounts of sales interaction data. The system processes thousands of emails and calls per deal, providing quick responses within 5 seconds. After initial deployment, they discovered that 70% of user queries matched existing structured features, leading to a hybrid approach combining direct LLM-based QA with guided navigation to pre-computed insights.

Implementing RAG and RagRails for Reliable Conversational AI in Insurance

GEICO

GEICO explored using LLMs for customer service chatbots through a hackathon initiative in 2023. After discovering issues with hallucinations and "overpromising" in their initial implementation, they developed a comprehensive RAG (Retrieval Augmented Generation) solution enhanced with their novel "RagRails" approach. This method successfully reduced incorrect responses from 12 out of 20 to zero in test cases by providing structured guidance within retrieved context, demonstrating how to safely deploy LLMs in a regulated insurance environment.

Implementing RAG for Call Center Operations with Hybrid Data Sources

Manulife

Manulife implemented a Retrieval Augmented Generation (RAG) system in their call center to help customer service representatives quickly access and utilize information from both structured and unstructured data sources. They developed an innovative approach combining document chunks and structured data embeddings, achieving an optimized response time of 7.33 seconds in production. The system successfully handles both policy documents and database information, using GPT-3.5 for answer generation with additional validation from Llama 3 or GPT-4.

Implementing RAG for Enhanced Customer Care at Scale

Doctolib

Doctolib, a European e-health company, implemented a RAG-based system to improve their customer care services. Using GPT-4 hosted on Azure OpenAI, combined with OpenSearch as a vector database and a custom reranking system, they achieved a 20% reduction in customer care cases. The system includes comprehensive evaluation metrics through the Ragas framework, and overcame significant latency challenges to achieve response times under 5 seconds. While successful, they identified limitations with complex queries that led them to explore agentic frameworks as a next step.

Improving AI Code Review Bot Comment Quality Through Vector Embeddings

Greptile

Greptile faced a challenge with their AI code review bot generating too many low-value "nit" comments, leading to user frustration and ignored feedback. After unsuccessful attempts with prompt engineering and LLM-based severity rating, they implemented a successful solution using vector embeddings to cluster and filter comments based on user feedback. This approach improved the percentage of addressed comments from 19% to 55+% within two weeks of deployment.

Improving AI Documentation Assistant Through Data Pipeline Reconstruction and LLM-Based Feedback Analysis

Mintlify

Mintlify's AI-powered documentation assistant was underperforming, prompting a week-long investigation to identify and address its weaknesses. The team rebuilt their feedback pipeline by migrating conversation data from PSQL to ClickHouse, enabling them to analyze thumbs-down events mapped to full conversation threads. Using an LLM to categorize 1,000 negative feedback conversations into eight buckets, they discovered that search quality across documentation was the assistant's primary weakness, while other response types were generally strong. Based on these findings, they enhanced their dashboard with LLM-categorized conversation insights for documentation owners, shipped UI improvements including conversation history and better mobile interactions, and identified areas for continued improvement despite a previous model upgrade to Claude Sonnet 3.5 showing limited impact on feedback patterns.

Improving Contextual Understanding in GitHub Copilot Through Advanced Prompt Engineering

Github

GitHub's machine learning team enhanced GitHub Copilot's contextual understanding through several key innovations: implementing Fill-in-the-Middle (FIM) paradigm, developing neighboring tabs functionality, and extensive prompt engineering. These improvements led to significant gains in suggestion accuracy, with FIM providing a 10% boost in completion acceptance rates and neighboring tabs yielding a 5% increase in suggestion acceptance.

Improving Error Handling for AI Agents in Production

fewsats

A case study exploring how fewsats improved their domain management AI agents by enhancing error handling in their HTTP SDK. They discovered that while different LLM models (Claude, Llama 3, Replit Agent) could interact with their domain management API, the agents often failed due to incomplete error information. By modifying their SDK to surface complete error details instead of just status codes, they enabled the AI agents to self-correct and handle API errors more effectively, demonstrating the importance of error visibility in production LLM systems.

Improving GitHub Copilot's Contextual Understanding Through Advanced Prompt Engineering and Retrieval

GitHub

GitHub's machine learning team worked to enhance GitHub Copilot's contextual understanding of code to provide more relevant AI-powered coding suggestions. The problem was that large language models could only process limited context (approximately 6,000 characters), making it challenging to leverage all relevant information from a developer's codebase. The solution involved sophisticated prompt engineering, implementing neighboring tabs to process multiple open files, introducing a Fill-In-the-Middle (FIM) paradigm to consider code both before and after the cursor, and experimenting with vector databases and embeddings for semantic code retrieval. These improvements resulted in measurable gains: neighboring tabs provided a 5% relative increase in suggestion acceptance, FIM yielded a 10% relative boost in performance, and the overall enhancements contributed to developers coding up to 55% faster when using GitHub Copilot.

Improving LLM Accuracy and Evaluation in Enterprise Customer Analytics

Various

Echo.ai and Log10 partnered to solve accuracy and evaluation challenges in deploying LLMs for enterprise customer conversation analysis. Echo.ai's platform analyzes millions of customer conversations using multiple LLMs, while Log10 provides infrastructure for improving LLM accuracy through automated feedback and evaluation. The partnership resulted in a 20-point F1 score increase in accuracy and enabled Echo.ai to successfully deploy large enterprise contracts with improved prompt optimization and model fine-tuning.

Improving LLM Food Tracking Accuracy through Systematic Evaluation and Few-Shot Learning

Taralli

A case study of Taralli's food tracking application that initially used a naive approach with GPT-4-mini for calorie and nutrient estimation, resulting in significant accuracy issues. Through the implementation of systematic evaluation methods, creation of a golden dataset, and optimization using DSPy's BootstrapFewShotWithRandomSearch technique, they improved accuracy from 17% to 76% while maintaining reasonable response times with Gemini 2.5 Flash.

Improving Multilingual Search with Few-Shot LLM Translations

Delivery Hero

Delivery Hero operates across 68 countries and faced significant challenges with multilingual search due to dialectal variations, transliterations, spelling errors, and multiple languages within single markets. Traditional machine translation systems struggled with user intent and contextual nuances, leading to poor search results. The company implemented a solution using Large Language Models (LLMs), specifically Gemini, with few-shot learning to provide context-aware translations that handle regional dialects, correct spelling mistakes, and understand transliterations. By combining LLM-generated translations with Elastic Search and Vector Search in a hybrid approach, they achieved over 90% translation accuracy for restaurant queries and demonstrated positive improvements in user engagement through A/B testing, with the solution being rolled out to their Talabat and Hungerstation brands.

Incremental LLM Adoption Strategy in Email Processing API Platform

Nylas

Nylas, an email/calendar/contacts API platform provider, implemented a systematic three-month strategy to integrate LLMs into their production systems. They started with development workflow automation using multi-agent systems, enhanced their annotation processes with LLMs, and finally integrated LLMs as a fallback mechanism in their core email processing product. This measured approach resulted in 90% reduction in bug tickets, 20x cost savings in annotation, and successful deployment of their own LLM infrastructure when usage reached cost-effective thresholds.

Inferring Grocery Preferences from Restaurant Order History Using LLMs

Doordash

DoorDash faced the classic cold start problem when trying to recommend grocery and convenience items to customers who had never shopped in those verticals before. To address this, they developed an LLM-based solution that analyzes customers' restaurant order histories to infer underlying preferences about culinary tastes, lifestyle habits, and dietary patterns. The system translates these implicit signals into explicit, personalized grocery recommendations, successfully surfacing relevant items like hot pot soup base, potstickers, and burritos based on restaurant ordering behavior. The approach combines statistical analysis with LLM inference capabilities to leverage the models' semantic understanding and world knowledge, creating a scalable, evaluation-driven pipeline that delivers relevant recommendations from the first interaction.

Infrastructure Challenges and Solutions for Agentic AI Systems in Production

Meta / Google / Monte Carlo / Microsoft

A panel discussion featuring experts from Meta, Google, Monte Carlo, and Microsoft examining the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion covers how agentic workloads differ from traditional software systems, requiring new approaches to networking, load balancing, caching, security, and observability, while highlighting specific challenges like non-deterministic behavior, massive search spaces, and the need for comprehensive evaluation frameworks to ensure reliable and secure AI agent operations at scale.

Insurance Policy Review Automation Using Retrieval-Augmented Generation and Prompt Engineering

Verisk

Verisk developed a generative AI companion for their Mozart platform to automate insurance policy document comparison and change detection. Using Amazon Bedrock, OpenSearch, and Anthropic's Claude 3 Sonnet model, they built a system that reduces policy review time from days to minutes. The solution combines embedding-based retrieval, sophisticated prompt engineering, and document chunking strategies to achieve over 90% accuracy in change summaries while maintaining cost efficiency and security compliance.

Integrating Foundation Models into the Modern Data Stack: Challenges and Solutions

Numbers Station

Numbers Station addresses the challenges of integrating foundation models into the modern data stack for data processing and analysis. They tackle key challenges including SQL query generation from natural language, data cleaning, and data linkage across different sources. The company develops solutions for common LLMOps issues such as scale limitations, prompt brittleness, and domain knowledge integration through techniques like model distillation, prompt ensembling, and domain-specific pre-training.

Integrating Gemini for Natural Language Analytics in IoT Fleet Management

Cox 2M

Cox 2M, facing challenges with a lean analytics team and slow insight generation (taking up to a week per request), partnered with Thoughtspot and Google Cloud to implement Gemini-powered natural language analytics. The solution reduced time to insights by 88% while enabling non-technical users to directly query complex IoT and fleet management data using natural language. The implementation includes automated insight generation, change analysis, and natural language processing capabilities.

Integrating Generative AI into Low-Code Platform Development with Amazon Bedrock

Mendix

Mendix, a low-code platform provider, faced the challenge of integrating advanced generative AI capabilities into their development environment while maintaining security and scalability. They implemented Amazon Bedrock to provide their customers with seamless access to various AI models, enabling features like text generation, summarization, and multimodal image generation. The solution included custom model training, robust security measures through AWS services, and cost-effective model selection capabilities.

Integrating Live-Staffed AI Chat with LLM-Powered Customer Service

Smith.ai

Smith.ai transformed their customer service platform by implementing a next-generation chat system powered by large language models (LLMs). The solution combines AI automation with human supervision, allowing the system to handle routine inquiries autonomously while enabling human agents to focus on complex cases. The system leverages website data for context-aware responses and seamlessly integrates structured workflows with free-flowing conversations, resulting in improved customer experience and operational efficiency.

Integrating LLMs and Diffusion Models for Website Design Automation

Wix

Wix is leveraging AI technologies, including LLMs and diffusion models, to automate and enhance the website building experience. Their AI group has developed the AI Text Creator suite using LLMs for content generation, integrated DALL-E for image creation, and introduced the Diffusion Layout Transformer (DLT) for automated layout generation. This comprehensive approach combines content generation with layout design, addressing the challenge of creating professional websites without requiring extensive design expertise.

Integrating Symbolic Reasoning with LLMs for AI-Native Telecom Infrastructure

Ericsson

Ericsson's System Comprehension Lab is exploring the integration of symbolic reasoning capabilities into telecom-oriented large language models to address critical limitations in current LLM architectures for telecommunications infrastructure management. The problem centers on LLMs' inability to provide deterministic, explainable reasoning required for telecom network optimization, security, and anomaly detection—domains where hallucinations, lack of logical consistency, and black-box behavior are unacceptable. The proposed solution involves hybrid neural-symbolic AI architectures that combine the pattern recognition strengths of transformer-based LLMs with rule-based reasoning engines, connected through techniques like symbolic chain-of-thought prompting, program-aided reasoning, and external solver integration. This approach aims to enable AI-native wireless systems for 6G infrastructure that can perform cross-layer optimization, real-time decision-making, and intent-driven network management while maintaining the explainability and logical rigor demanded by production telecom environments.

Intelligent Document Processing at Scale with AI-Powered Tax Compliance and Invoice Analysis

Syngenta

Syngenta, a global agricultural company processing over one million invoices annually across 90 countries, implemented "Wingman," an AI-powered intelligent document processing system to automate complex document analysis tasks. The solution leverages Amazon Bedrock Data Automation (BDA) for document parsing and LLMs (primarily Anthropic Claude) for intelligent content extraction and policy comparison. Starting with tax compliance in Argentina, where complex regional tax laws required manual verification of 4,000 invoices monthly, Wingman automatically extracts invoice content, compares it against tax policies, and identifies discrepancies with human-readable explanations. The system achieved near-perfect accuracy and is being scaled to additional use cases including indirect spend reduction, vendor master data accuracy, and expense compliance across multiple countries.

Intelligent Document Processing for Education Quality Assessment Reports

BQA

BQA, Bahrain's Education and Training Quality Authority, faced challenges with manual review of self-evaluation reports from educational institutions. They implemented a solution using Amazon Bedrock and other AWS services to automate and streamline the analysis of these reports. The system leverages the Amazon Titan Express model for intelligent document processing, combining document analysis, summarization, and compliance checking. The solution achieved 70% accuracy in standards-compliant report generation and reduced evidence analysis time by 30%.

Intelligent Document Processing for Mortgage Servicing Using Amazon Bedrock and Multimodal AI

Onity Group

Onity Group, a mortgage servicing company processing millions of pages annually across hundreds of document types, implemented an intelligent document processing solution using Amazon Bedrock foundation models to handle complex legal documents with verbose text, handwritten entries, and notarization verification. The solution combines Amazon Textract for basic OCR with Amazon Bedrock's multimodal models (Anthropic Claude Sonnet and Amazon Nova) for complex extraction tasks, using dynamic routing based on content complexity. This hybrid approach achieved a 50% reduction in document extraction costs while improving overall accuracy by 20% compared to their previous OCR and AI/ML solution, with some use cases like credit report processing achieving 85% accuracy.

Interactive AI-Powered Chess Tutoring System

Interweb Alchemy

A chess tutoring application that leverages LLMs and traditional chess engines to provide real-time analysis and feedback during gameplay. The system combines GPT-4 mini for move generation with Stockfish for position evaluation, offering features like positional help, outcome analysis, and real-time commentary. The project explores the practical application of different LLM models for chess tutoring, focusing on helping beginners improve their game through interactive feedback and analysis.

Internal AI Agent Platform for Enterprise Data Access and Product Development

Amplitude

Amplitude built an internal AI agent called "Moda" that provides company-wide access to enterprise data through Slack and web interfaces, enabling employees to query business information, generate insights, and create product requirements documents (PRDs) with prototypes. The tool was developed by engineers in their spare time over 3-4 weeks and achieved viral adoption across the company within a week of launch, demonstrating how organizations can rapidly build custom AI tools to accelerate product development workflows and democratize data access across teams.

Internal AI Orchestration and Automation Across Multiple Departments

Zapier

Zapier, a workflow automation platform company, faced the challenge of managing repetitive operational tasks across multiple departments while maintaining productivity and focus on strategic work. The company implemented a comprehensive AI and automation strategy using their own platform combined with LLM capabilities (primarily ChatGPT/OpenAI) to automate workflows across customer success, sales, HR, technical support, content creation, engineering, accounting, and revenue operations. The results demonstrate significant time savings through automated meeting transcriptions and summaries, AI-powered sentiment analysis of surveys, automated content generation and translation, chatbot-based internal support systems, and intelligent ticket routing and categorization, enabling teams to focus on higher-value strategic activities while maintaining operational efficiency.

Iterative Development Process for Production AI Features

Zapier

Zapier's journey in developing and deploying AI products demonstrates a pragmatic, iterative approach to LLMOps. Their methodology focuses on rapid prototyping with advanced models like GPT-4 Turbo and Claude Opus, followed by quick deployment of initial versions (even with sub-50% accuracy), systematic collection of user feedback, and establishment of comprehensive evaluation frameworks. This approach has enabled them to improve their AI products from sub-50% to over 90% accuracy within 2-3 months, while successfully managing costs and maintaining product quality.

Iterative Prompt Optimization and Model Selection for Nutritional Calorie Estimation

Taralli

Taralli, a calorie tracking application, demonstrates systematic LLM improvement through rigorous evaluation and prompt optimization. The developer addressed the challenge of accurate nutritional estimation by creating a 107-example evaluation dataset, testing multiple prompt optimization techniques (vanilla, few-shot bootstrapping, MIPROv2, and GEPA) across several models (Gemini 2.5 Flash, Gemini 3 Flash, and DeepSeek v3.2). Through this methodical approach, they achieved a 15% accuracy improvement by switching from Gemini 2.5 Flash to Gemini 3 Flash while using a few-shot learning approach with 16 examples, reaching 60% accuracy within a 10% calorie prediction threshold. The system was deployed with fallback model configurations and extended to support fully offline on-device inference for iOS.

Knowledge Graph Enhancement with LLMs for Content Understanding

Netflix

Netflix has developed a sophisticated knowledge graph system for entertainment content that helps understand relationships between movies, actors, and other entities. While initially focused on traditional entity matching techniques, they are now incorporating LLMs to enhance their graph by inferring new relationships and entity types from unstructured data. The system uses Metaflow for orchestration and supports both traditional and LLM-based approaches, allowing for flexible model deployment while maintaining production stability.

Knowledge Graph-Enhanced RAG for Customer Service Question Answering

Linkedin

LinkedIn's customer service team faced challenges with retrieving relevant past issue tickets to resolve customer inquiries efficiently. Traditional text-based retrieval-augmented generation (RAG) approaches treated historical tickets as plain text, losing crucial structural information and inter-issue relationships. LinkedIn developed a novel system that integrates RAG with knowledge graphs, constructing tree-structured representations of issue tickets while maintaining explicit and implicit connections between issues. The system uses GPT-4 for parsing and answer generation, E5 embeddings for semantic retrieval, and converts user queries into graph database queries for precise subgraph extraction. Deployed across multiple product lines, the system achieved a 77.6% improvement in MRR, a 0.32 increase in BLEU score, and reduced median issue resolution time by 28.6% over six months of production use.

LangSmith Implementation for Full Product Lifecycle Development and Monitoring

Wordsmith

Wordsmith, an AI legal assistant platform, implemented LangSmith to enhance their LLM operations across the entire product lifecycle. They tackled challenges in prototyping, debugging, and evaluating complex LLM pipelines by utilizing LangSmith's hierarchical tracing, evaluation datasets, monitoring capabilities, and experimentation features. This implementation enabled faster development cycles, confident model deployment, efficient debugging, and data-driven experimentation while managing multiple LLM providers including OpenAI, Anthropic, Google, and Mistral.

LangSmith Integration for Automated Feedback and Improved Iteration in SDLC

Factory

Factory AI implemented self-hosted LangSmith to address observability challenges in their SDLC automation platform, particularly for their Code Droid system. By integrating LangSmith with AWS CloudWatch logs and utilizing its Feedback API, they achieved comprehensive LLM pipeline monitoring, automated feedback collection, and streamlined prompt optimization. This resulted in a 2x improvement in iteration speed, 20% reduction in open-to-merge time, and 3x reduction in code churn.

Large Bank LLMOps Implementation: Lessons from Deutsche Bank and Others

Various

A discussion between banking technology leaders about their implementation of generative AI, focusing on practical applications, regulatory challenges, and strategic considerations. Deutsche Bank's CTO and other banking executives share their experiences in implementing gen AI across document processing, risk modeling, research analysis, and compliance use cases, while emphasizing the importance of responsible deployment and regulatory compliance.

Large Foundation Model for Unified Recommendation and Ranking at Scale

LinkedIn

LinkedIn developed a large foundation model called "Brew XL" with 150 billion parameters to unify all personalization and recommendation tasks across their platform, addressing the limitations of task-specific models that operate in silos. The solution involved training a massive language model on user interaction data through "promptification" techniques, then distilling it down to smaller, production-ready models (3B parameters) that could serve high-QPS recommendation systems with sub-second latency. The system demonstrated zero-shot capabilities for new tasks, improved performance on cold-start users, and achieved 7x latency reduction with 30x throughput improvement through optimization techniques including distillation, pruning, quantization, and sparsification.

Large Language Models for Game Player Sentiment Analysis and Retention

SEGA Europe

SEGA Europe faced challenges managing data from 50,000 events per second across 40 million players, making it difficult to derive actionable insights. They implemented a sentiment analysis LLM system on the Databricks platform that processes over 10,000 user reviews daily to identify and address gameplay issues. This led to up to 40% increase in player retention and significantly faster time to insight through AI-powered analytics.

Large Language Models for Retail Customer Feedback Analysis

Microsoft

A retail organization was facing challenges in analyzing large volumes of daily customer feedback manually. Microsoft implemented an LLM-based solution using Azure OpenAI to automatically extract themes, sentiments, and competitor comparisons from customer feedback. The system uses carefully engineered prompts and predefined themes to ensure consistent analysis, enabling the creation of actionable insights and reports at various organizational levels.

Large Language Models in Production Round Table Discussion: Latency, Cost and Trust Considerations

Various

A panel of experts from various companies and backgrounds discusses the challenges and solutions of deploying LLMs in production. They explore three main themes: latency considerations in LLM deployments, cost optimization strategies, and building trust in LLM systems. The discussion includes practical examples from Digits, which uses LLMs for financial document processing, and insights from other practitioners about model optimization, deployment strategies, and the evolution of LLM architectures.

Large Recommender Models: Adapting Gemini for YouTube Video Recommendations

Google / YouTube

YouTube developed Large Recommender Models (LRM) by adapting Google's Gemini LLM for video recommendations, addressing the challenge of serving personalized content to billions of users. The solution involved creating semantic IDs to tokenize videos, continuous pre-training to teach the model both English and YouTube-specific video language, and implementing generative retrieval systems. While the approach delivered significant improvements in recommendation quality, particularly for challenging cases like new users and fresh content, the team faced substantial serving cost challenges that required 95%+ cost reductions and offline inference strategies to make production deployment viable at YouTube's scale.

Large-Scale AI Assistant Deployment with Safety-First Evaluation Approach

Discord

Discord implemented Clyde AI, a chatbot assistant that was deployed to over 200 million users, focusing heavily on safety, security, and evaluation practices. The team developed a comprehensive evaluation framework using simple, deterministic tests and metrics, implemented through their open-source tool PromptFu. They faced unique challenges in preventing harmful content and jailbreaks, leading to innovative solutions in red teaming and risk assessment, while maintaining a balance between casual user interaction and safety constraints.

Large-Scale AI Red Teaming Competition Platform for Production Model Security

HackAPrompt, LearnPrompting

Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.

Large-Scale Analysis of AI Coding Tool Adoption and Productivity Impact Across 1,000 Companies

Jellyfish

Jellyfish, a software engineering analytics company, conducted a comprehensive study analyzing 20 million pull requests from 200,000 developers across 1,000 companies to understand real-world AI transformation patterns in software development. The study tracked adoption of AI coding tools (Copilot, Cursor, Claude Code) and autonomous agents (Devon, Codeex) from June 2024 onwards. Key findings include: median developer adoption rates grew from 22% to 90%, companies achieved approximately 2x gains in PR throughput with full AI adoption, cycle times decreased by 24%, and PR sizes increased by 18%. However, the study revealed that code architecture significantly impacts outcomes—centralized and balanced architectures saw 4x gains while highly distributed architectures showed minimal correlation between AI adoption and productivity, primarily due to context limitations across multiple repositories. Quality metrics showed no significant degradation, with bug resolution rates actually improving as teams used AI for well-scoped bug fixes.

Large-Scale Aviation Content Classification on Hacker News Using Small Language Models

Skysight

Skysight conducted a large-scale analysis of Hacker News content using small language models (SLMs) to classify aviation-related posts. The project processed 42 million items (10.7B input tokens) using a parallelized pipeline and cloud infrastructure. Through careful prompt engineering and model selection, they achieved efficient classification at scale, revealing that 0.62% of all posts and 1.13% of stories were aviation-related, with notable temporal trends in aviation content frequency.

Large-Scale Deployment of On-Device and Server Foundation Models for Consumer AI Features

Apple

Apple developed and deployed a comprehensive foundation model infrastructure consisting of a 3-billion parameter on-device model and a mixture-of-experts server model to power Apple Intelligence features across iOS, iPadOS, and macOS. The implementation addresses the challenge of delivering generative AI capabilities at consumer scale while maintaining privacy, efficiency, and quality across 15 languages. The solution involved novel architectural innovations including shared KV caches, parallel track mixture-of-experts design, and extensive optimization techniques including quantization and compression, resulting in production deployment across millions of devices with measurable performance improvements in text and vision tasks.

Large-Scale Enterprise Copilot Deployment: Lessons from Einstein Copilot Implementation

Salesforce

Salesforce shares their experience deploying Einstein Copilot, their conversational AI assistant for CRM, across their internal organization. The deployment process focused on starting simple with standard actions before adding custom capabilities, implementing comprehensive testing protocols, and establishing clear feedback loops. The rollout began with 100 sellers before expanding to thousands of users, resulting in significant time savings and improved user productivity.

Large-Scale LLM Batch Processing Platform for Millions of Prompts

Instacart

Instacart faced challenges processing millions of LLM calls required by various teams for tasks like catalog data cleaning, item enrichment, fulfillment routing, and search relevance improvements. Real-time LLM APIs couldn't handle this scale effectively, leading to rate limiting issues and high costs. To solve this, Instacart built Maple, a centralized service that automates large-scale LLM batch processing by handling batching, encoding/decoding, file management, retries, and cost tracking. Maple integrates with external LLM providers through batch APIs and an internal AI Gateway, achieving up to 50% cost savings compared to real-time calls while enabling teams to process millions of prompts reliably without building custom infrastructure.

Large-Scale LLM Infrastructure for E-commerce Applications

Coupang

Coupang, a major e-commerce platform operating primarily in South Korea and Taiwan, faced challenges in scaling their ML infrastructure to support LLM applications across search, ads, catalog management, and recommendations. The company addressed GPU supply shortages and infrastructure limitations by building a hybrid multi-region architecture combining cloud and on-premises clusters, implementing model parallel training with DeepSpeed, and establishing GPU-based serving using Nvidia Triton and vLLM. This infrastructure enabled production applications including multilingual product understanding, weak label generation at scale, and unified product categorization, with teams using patterns ranging from in-context learning to supervised fine-tuning and continued pre-training depending on resource constraints and quality requirements.

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

Large-Scale Personalization System Using LLMs for Buyer Profile Generation

Etsy

Etsy tackled the challenge of personalizing shopping experiences for nearly 90 million buyers across 100+ million listings by implementing an LLM-based system to generate detailed buyer profiles from browsing and purchasing behaviors. The system analyzes user session data including searches, views, purchases, and favorites to create structured profiles capturing nuanced interests like style preferences and shopping missions. Through significant optimization efforts including data source improvements, token reduction, batch processing, and parallel execution, Etsy reduced profile generation time from 21 days to 3 days for 10 million users while cutting costs by 94% per million users, enabling economically viable large-scale personalization for search query rewriting and refinement pills.

Large-Scale Tax AI Assistant Implementation for TurboTax

Intuit

Intuit built a comprehensive LLM-powered AI assistant system called Intuit Assist for TurboTax to help millions of customers understand their tax situations, deductions, and refunds. The system processes 44 million tax returns annually and uses a hybrid approach combining Claude and GPT models for both static tax explanations and dynamic Q&A, supported by RAG systems, fine-tuning, and extensive evaluation frameworks with human tax experts. The implementation includes proprietary platform GenOS with safety guardrails, orchestration capabilities, and multi-phase evaluation systems to ensure accuracy in the highly regulated tax domain.

Large-Scale Test Framework Migration Using LLMs

AirBnB

AirBnB successfully migrated 3,500 React component test files from Enzyme to React Testing Library (RTL) using LLMs, reducing what was estimated to be an 18-month manual engineering effort to just 6 weeks. Through a combination of systematic automation, retry loops, and context-rich prompts, they achieved a 97% automated migration success rate, with the remaining 3% completed manually using the LLM-generated code as a baseline.

Launching an MCP Server for AI-Powered Debugging and Development

Multiplayer

Multiplayer, a provider of full-stack session recording and debugging tools, launched a Model Context Protocol (MCP) server to connect their platform's engineering context with AI coding agents like Cursor, Claude Code, and Windsurf. The challenge was enabling AI agents to access session recordings, backend server calls, and debugging data to provide contextually-aware assistance for bug fixes and feature development. By designing use-case-driven MCP tools that abstract multiple API calls, Multiplayer created a streamlined integration that has shown good adoption among developers. The gradual rollout to power users revealed best practices such as keeping tools minimal and scoped, focusing on read-only operations for security, and providing human-readable data formats to LLMs.

Lessons from Building a Production RAG System: Data Formatting and Prompt Engineering

Credal

A case study detailing lessons learned from processing over 250k LLM calls on 100k corporate documents at Credal. The team discovered that successful LLM implementations require careful data formatting and focused prompt engineering. Key findings included the importance of structuring data to maximize LLM understanding, especially for complex documents with footnotes and tables, and concentrating prompts on the most challenging aspects of tasks rather than trying to solve multiple problems simultaneously.

Lessons from Deploying an HR-Aware AI Assistant: Five Key Implementation Insights

Applaud

Applaud shares their experience implementing an AI assistant for HR service delivery, highlighting key challenges and solutions in areas including content management, personalization, testing methodologies, accuracy expectations, and continuous improvement. The case study explores practical solutions to common deployment challenges like content quality control, context-aware responses, testing for infinite possibilities, managing accuracy expectations, and post-deployment optimization.

Lessons from Enterprise LLM Deployment: Cross-functional Teams, Experimentation, and Security

Microsoft

A team of Microsoft engineers share their experiences helping strategic customers implement LLM solutions in production environments. They discuss the importance of cross-functional teams, continuous experimentation, RAG implementation challenges, and security considerations. The presentation emphasizes the need for proper LLMOps practices, including evaluation pipelines, guard rails, and careful attention to potential vulnerabilities like prompt injection and jailbreaking.

Lessons from Red Teaming 100+ Generative AI Products

Microsoft

Microsoft's AI Red Team (AIRT) conducted extensive red teaming operations on over 100 generative AI products to assess their safety and security. The team developed a comprehensive threat model ontology and leveraged both manual and automated testing approaches through their PyRIT framework. Through this process, they identified key lessons about AI system vulnerabilities, the importance of human expertise in red teaming, and the challenges of measuring responsible AI impacts. The findings highlight both traditional security risks and novel AI-specific attack vectors that need to be considered when deploying AI systems in production.

Lessons Learned from Deploying 30+ GenAI Agents in Production

Quic

Quic shares their experience deploying over 30 AI agents across various industries, focusing on customer experience and e-commerce applications. They developed a comprehensive approach to LLMOps that includes careful planning, persona development, RAG implementation, API integration, and robust testing and monitoring systems. The solution achieved 60% resolution of tier-one support issues with higher quality than human agents, while maintaining human involvement for complex cases.

Lessons Learned from Production AI Agent Deployments

Google / Vertex AI

A comprehensive overview of lessons learned from deploying AI agents in production at Google's Vertex AI division. The presentation covers three key areas: meta-prompting techniques for optimizing agent prompts, implementing multi-layered safety and guard rails, and the critical importance of evaluation frameworks. These insights come from real-world experience delivering hundreds of models into production with various developers, customers, and partners.

Leveraging LangSmith for Debugging Tools & Actions in Production LLM Applications

Mendable

Mendable.ai enhanced their enterprise AI assistant platform with Tools & Actions capabilities, enabling automated tasks and API interactions. They faced challenges with debugging and observability of agent behaviors in production. By implementing LangSmith, they successfully debugged agent decision processes, optimized prompts, improved tool schema generation, and built evaluation datasets, resulting in a more reliable and efficient system that has already achieved $1.3 million in savings for a major tech company client.

Leveraging RAG and LLMs for ESG Data Intelligence Platform

ESGPedia

ESGpedia faced challenges in managing complex ESG data across multiple platforms and pipelines. They implemented Databricks' Data Intelligence Platform to create a unified lakehouse architecture and leveraged Mosaic AI with RAG techniques to process sustainability data more effectively. The solution resulted in 4x cost savings in data pipeline management, improved time to insights, and enhanced ability to provide context-aware ESG insights to clients across APAC.

Linguistic-Informed Approach to Production LLM Systems

Mastercard

A lead data scientist at Mastercard presents a comprehensive approach to implementing LLMs in production by focusing on linguistic features rather than just metrics. The case study demonstrates how understanding and implementing linguistic principles (syntax, morphology, semantics, pragmatics, and phonetics) can significantly improve LLM performance. A practical example showed how using pragmatic instruction with Falcon 7B and the guidance framework improved biology question answering accuracy from 35% to 85% while drastically reducing inference time compared to vanilla ChatGPT.

LLM Applications in Drug Discovery and Call Center Analytics

QuantumBlack

QuantumBlack presented two distinct LLM applications: molecular discovery for pharmaceutical research and call center analytics for banking. The molecular discovery system used chemical language models and RAG to analyze scientific literature and predict molecular properties. The call center analytics solution processed audio files through a pipeline of diarization, transcription, and LLM analysis to extract insights from customer calls, achieving 60x performance improvement through domain-specific optimizations and efficient resource utilization.

LLM Applications in Education: Personalized Learning and Assessment Systems

Various

Multiple education technology organizations showcase their use of LLMs and LangChain to enhance learning experiences. Podzy develops a spaced repetition system with LLM-powered question generation and tutoring capabilities. The Learning Agency Lab creates datasets and competitions to develop LLM solutions for educational problems like automated writing evaluation. Vanderbilt's LEER Lab builds intelligent textbooks using LLMs for content summarization and question generation. All cases demonstrate the integration of LLMs with existing educational tools while addressing challenges of accuracy, personalization, and fairness.

LLM Evaluation Framework for Financial Crime Report Generation

Sumup

SumUp developed an LLM application to automate the generation of financial crime reports, along with a novel evaluation framework using LLMs as evaluators. The solution addresses the challenges of evaluating unstructured text output by implementing custom benchmark checks and scoring systems. The evaluation framework outperformed traditional NLP metrics and showed strong correlation with human reviewer assessments, while acknowledging and addressing potential LLM evaluator biases.

LLM Feature Extraction for Content Categorization and Search Query Understanding

Canva

Canva implemented LLMs as a feature extraction method for two key use cases: search query categorization and content page categorization. By replacing traditional ML classifiers with LLM-based approaches, they achieved higher accuracy, reduced development time from weeks to days, and lowered operational costs from $100/month to under $5/month for query categorization. For content categorization, LLM embeddings outperformed traditional methods in terms of balance, completion, and coherence metrics while simplifying the feature extraction process.

LLM Integration for Customer Support Automation and Enhancement

Airbnb

Airbnb implemented AI text generation models across three key customer support areas: content recommendation, real-time agent assistance, and chatbot paraphrasing. They leveraged large language models with prompt engineering to encode domain knowledge from historical support data, resulting in significant improvements in content relevance, agent efficiency, and user engagement. The implementation included innovative approaches to data preparation, model training with DeepSpeed, and careful prompt design to overcome common challenges like generic responses.

LLM Integration in EdTech: Lessons from Duolingo, Brainly, and SoloLearn

Various

Leaders from three major EdTech companies share their experiences implementing LLMs in production for language learning, coding education, and homework help. They discuss challenges around cost-effective scaling, fact generation accuracy, and content personalization, while highlighting successful approaches like retrieval-augmented generation, pre-generation of options, and using LLMs to create simpler production rules. The companies focus on using AI not just for content generation but for improving the actual teaching and learning experience.

LLM Observability for Enhanced Audience Segmentation Systems

Acxiom

Acxiom developed an AI-driven audience segmentation system using LLMs but faced challenges in scaling and debugging their solution. By implementing LangSmith, they achieved robust observability for their LangChain-based application, enabling efficient debugging of complex workflows involving multiple LLM calls, improved audience segment creation, and better token usage optimization. The solution successfully handled conversational memory, dynamic updates, and data consistency requirements while scaling to meet growing user demands.

LLM Production Case Studies: Consulting Database Search, Automotive Showroom Assistant, and Banking Development Tools

Globant

A collection of LLM implementation case studies detailing challenges and solutions in various industries. Key cases include: a consulting firm's semantic search implementation for financial data, requiring careful handling of proprietary data and similarity definitions; an automotive company's showroom chatbot facing challenges with data consistency and hallucination control; and a bank's attempt to create a custom code copilot, highlighting the importance of clear requirements and technical understanding in LLM projects.

LLM Security: Discovering and Mitigating Repeated Token Attacks in Production Models

Dropbox

Dropbox's security research team discovered vulnerabilities in OpenAI's GPT-3.5 and GPT-4 models where repeated tokens could trigger model divergence and extract training data. They identified that both single-token and multi-token repetitions could bypass OpenAI's initial security controls, leading to potential data leakage and denial of service risks. The findings were reported to OpenAI, who subsequently implemented improved filtering mechanisms and server-side timeouts to address these vulnerabilities.

LLM Testing Framework Using LLMs as Quality Assurance Agents

Various

Alaska Airlines and Bitra developed QARL (Quality Assurance Response Liaison), an innovative testing framework that uses LLMs to evaluate other LLMs in production. The system conducts automated adversarial testing of customer-facing chatbots by simulating various user personas and conversation scenarios. This approach helps identify potential risks and unwanted behaviors before deployment, while providing scalable testing capabilities through containerized architecture on Google Cloud Platform.

LLM Validation and Testing at Scale: GitLab's Comprehensive Model Evaluation Framework

Gitlab

GitLab developed a robust framework for validating and testing LLMs at scale for their GitLab Duo AI features. They created a Centralized Evaluation Framework (CEF) that uses thousands of prompts across multiple use cases to assess model performance. The process involves creating a comprehensive prompt library, establishing baseline model performance, iterative feature development, and continuous validation using metrics like Cosine Similarity Score and LLM Judge, ensuring consistent improvement while maintaining quality across all use cases.

LLM-as-a-Judge Framework for Automated LLM Evaluation at Scale

Booking.com

Booking.com developed a comprehensive framework to evaluate LLM-powered applications at scale using an LLM-as-a-judge approach. The solution addresses the challenge of evaluating generative AI applications where traditional metrics are insufficient and human evaluation is impractical. The framework uses a more powerful LLM to evaluate target LLM outputs based on carefully annotated "golden datasets," enabling continuous monitoring of production GenAI applications. The approach has been successfully deployed across multiple use cases at Booking.com, providing automated evaluation capabilities that significantly reduce the need for human oversight while maintaining evaluation quality.

LLM-as-Judge Framework for Production LLM Evaluation and Improvement

Segment

Twilio Segment developed a novel LLM-as-Judge evaluation framework to assess and improve their CustomerAI audiences feature, which uses LLMs to generate complex audience queries from natural language. The system achieved over 90% alignment with human evaluation for ASTs, enabled 3x improvement in audience creation time, and maintained 95% feature retention. The framework includes components for generating synthetic evaluation data, comparing outputs against ground truth, and providing structured scoring mechanisms.

LLM-Assisted Personalization Framework for Multi-Vertical Retail Discovery

DoorDash

DoorDash developed an LLM-assisted personalization framework to help customers discover products across their expanding catalog of hundreds of thousands of SKUs spanning multiple verticals including grocery, convenience, alcohol, retail, flowers, and gifting. The solution combines traditional machine learning approaches like two-tower embedding models and multi-task learning rankers with LLM capabilities for semantic understanding, collection generation, query rewriting, and knowledge graph augmentation. The framework balances three core consumer value dimensions—familiarity (showing relevant favorites), affordability (optimizing for price sensitivity and deals), and novelty (introducing new complementary products)—across the entire personalization stack from retrieval to ranking to presentation. While specific quantitative results are not provided, the case study presents this as a production system deployed across multiple discovery surfaces including category pages, checkout aisles, personalized carousels, and search.

LLM-Based Agents for User Story Quality Enhancement in Agile Development

Austrian Post Group

Austrian Post Group IT explored the use of LLM-based agents to automatically improve user story quality in their agile development teams. They developed and implemented an Autonomous LLM-based Agent System (ALAS) with specialized agent profiles for Product Owner and Requirements Engineer roles. Using GPT-3.5-turbo-16k and GPT-4 models, the system demonstrated significant improvements in user story clarity and comprehensibility, though with some challenges around story length and context alignment. The effectiveness was validated through evaluations by 11 professionals across six agile teams.

LLM-Based Dasher Support Automation with RAG and Quality Controls

Doordash

DoorDash implemented an LLM-based chatbot system to improve their Dasher support automation, replacing a traditional flow-based system. The solution uses RAG (Retrieval Augmented Generation) to leverage their knowledge base, along with sophisticated quality control systems including LLM Guardrail for real-time response validation and LLM Judge for quality monitoring. The system successfully handles thousands of support requests daily while achieving a 90% reduction in hallucinations and 99% reduction in compliance issues.

LLM-Driven Developer Experience and Code Migrations at Scale

Uber

Uber's Developer Platform team explored three major initiatives using LLMs in production: a custom IDE coding assistant (which was later abandoned in favor of GitHub Copilot), an AI-powered test generation system called Auto Cover, and an automated Java-to-Kotlin code migration system. The team combined deterministic approaches with LLMs to achieve significant developer productivity gains while maintaining code quality and safety. They found that while pure LLM approaches could be risky, hybrid approaches combining traditional software engineering practices with AI showed promising results.

LLM-Enhanced Search and Discovery for Grocery E-commerce

Instacart

Instacart's search and machine learning team implemented LLMs to transform their search and discovery capabilities in grocery e-commerce, addressing challenges with tail queries and product discovery. They used LLMs to enhance query understanding models, including query-to-category classification and query rewrites, by combining LLM world knowledge with Instacart-specific domain knowledge and user behavior data. The hybrid approach involved batch pre-computing results for head/torso queries while using real-time inference for tail queries, resulting in significant improvements: 18 percentage point increase in precision and 70 percentage point increase in recall for tail queries, along with substantial reductions in zero-result queries and enhanced user engagement with discovery-oriented content.

LLM-Enhanced Topic Modeling System for Qualitative Text Analysis

QualIT

QualIT developed a novel topic modeling system that combines large language models with traditional clustering techniques to analyze qualitative text data more effectively. The system uses LLMs to extract key phrases and employs a two-stage hierarchical clustering approach, demonstrating significant improvements over baseline methods with 70% topic coherence (vs 65% and 57% for benchmarks) and 95.5% topic diversity (vs 85% and 72%). The system includes safeguards against LLM hallucinations and has been validated through human evaluation.

LLM-Enhanced Trust and Safety Platform for E-commerce Content Moderation

Whatnot

Whatnot, a live shopping marketplace, implemented LLMs to enhance their trust and safety operations by moving beyond traditional rule-based systems. They developed a sophisticated system combining LLMs with their existing rule engine to detect scams, moderate content, and enforce platform policies. The system achieved over 95% detection rate of scam attempts with 96% precision by analyzing conversational context and user behavior patterns, while maintaining a human-in-the-loop approach for final decisions.

LLM-Generated Entity Profiles for Personalized Food Delivery Platform

DoorDash

DoorDash evolved from traditional numerical embeddings to LLM-generated natural language profiles for representing consumers, merchants, and food items to improve personalization and explainability. The company built an automated system that generates detailed, human-readable profiles by feeding structured data (order history, reviews, menu metadata) through carefully engineered prompts to LLMs, enabling transparent recommendations, editable user preferences, and richer input for downstream ML models. While the approach offers scalability and interpretability advantages over traditional embeddings, the implementation requires careful evaluation frameworks, robust serving infrastructure, and continuous iteration cycles to maintain profile quality in production.

LLM-Powered 3D Model Generation for 3D Printing

Build Great AI

Build Great AI developed a prototype application that leverages multiple LLM models to generate 3D printable models from text descriptions. The system uses various models including LLaMA 3.1, GPT-4, and Claude 3.5 to generate OpenSCAD code, which is then converted to STL files for 3D printing. The solution demonstrates rapid prototyping capabilities, reducing design time from hours to minutes, while handling the challenges of LLMs' spatial reasoning limitations through multiple simultaneous generations and iterative refinement.

LLM-Powered Analysis of User Bug Reports for Product Quality Improvement

Meta

Meta's Facebook product team faced challenges in analyzing large volumes of unstructured user bug reports at scale using traditional methods. They developed an LLM-based system that classifies user feedback into predefined categories, monitors trends through automated dashboards, and performs root cause analysis to identify product issues. Through iterative prompt engineering and integration with data pipelines, the system successfully detected major outages in real-time, identified less visible bugs that might have been missed, and contributed to reducing overall bug reports by double digits over several months by enabling targeted product improvements and cross-functional collaboration.

LLM-Powered Crisis Counselor Training and Conversation Simulation

Crisis Text Line

Crisis Text Line transformed their mental health support services by implementing LLM-based solutions on the Databricks platform. They developed a conversation simulator using fine-tuned Llama 2 models to train crisis counselors, and created a conversation phase classifier to maintain quality standards. The implementation helped centralize their data infrastructure, enhance volunteer training, and scale their crisis intervention services more effectively, supporting over 1.3 million conversations in the past year.

LLM-Powered Customer Service Agent Copilot for E-commerce Support

Wayfair

Wayfair developed Wilma, an LLM-based copilot system to assist customer service agents in responding to customer inquiries about product issues. The system uses models like Gemini and GPT to draft contextual messages that agents can review and edit before sending. Through an iterative evolution from a single monolithic prompt to over 40 specialized prompt templates and multiple coordinated LLM calls, Wilma helps agents respond 12% faster while improving policy adherence by 2-5% depending on issue type. The system pulls real-time customer, order, and product data from Wayfair's systems to generate appropriate responses, with particular sophistication in handling complex resolution negotiation scenarios through a multi-LLM routing and analysis framework.

LLM-Powered Customer Support Agent Handling 50% of Inbound Requests

Otter

Otter, a delivery-native restaurant hardware and software provider, built an in-house LLM-powered support agent called Otter Assistant to handle the high volume of customer support requests generated by their broad feature set and integrations. The company chose to build rather than buy after determining that existing vendors in Q1 2024 relied on hard-coded decision trees and lacked the deep integration flexibility required. Through an agentic architecture using function calling, runbooks, API integrations, confirmation widgets, and RAG-based research capabilities, Otter Assistant now autonomously resolves approximately 50% of inbound customer support requests while maintaining customer satisfaction and seamless escalation to human agents when needed.

LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation

Grab

Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.

LLM-Powered Data Discovery and Documentation Platform

Grab

Grab faced challenges with data discovery across their 200,000+ tables in their data lake. They developed HubbleIQ, an LLM-powered chatbot integrated with their data discovery platform, to improve search capabilities and automate documentation generation. The solution included enhancing Elasticsearch, implementing GPT-4 for automated documentation generation, and creating a Slack-integrated chatbot. This resulted in documentation coverage increasing from 20% to 90% for frequently queried tables, with 73% of users reporting improved data discovery experience.

LLM-Powered Data Labeling Quality Assurance System

Uber

Uber AI Solutions developed a production LLM-based quality assurance system called Requirement Adherence to improve data labeling accuracy for their enterprise clients. The system addresses the costly and time-consuming problem of post-labeling rework by identifying quality issues during the labeling process itself. It works in two phases: first extracting atomic rules from client Standard Operating Procedure (SOP) documents using LLMs with reflection capabilities, then performing real-time validation during the labeling process by routing different rule types to appropriately-sized models with optimization techniques like prefix caching. This approach resulted in an 80% reduction in required audits, significantly improving timelines and reducing costs while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered GraphQL Mock Data Generation for Developer Productivity

Airbnb

Airbnb developed an innovative solution to address the persistent challenge of creating and maintaining realistic GraphQL mock data for testing and prototyping. Engineers traditionally spent significant time manually writing and updating mock data, which would often drift out of sync with evolving queries. Airbnb introduced the @generateMock directive, which combines GraphQL schema validation, product context (including design mockups), and LLMs (specifically Gemini 2.5 Pro) to automatically generate type-safe, realistic mock data. The solution integrates seamlessly into their existing code generation workflow (Niobe CLI), keeping engineers in their local development loops. A companion @respondWithMock directive enables client engineers to prototype features before server implementations are complete. Since deployment, Airbnb engineers have generated and merged over 700 mocks across iOS, Android, and Web platforms, significantly reducing manual effort and accelerating development cycles.

LLM-Powered In-Tool Quality Validation for Data Labeling

Uber

Uber AI Solutions developed a Requirement Adherence system to address quality issues in data labeling workflows, which traditionally relied on post-labeling checks that resulted in costly rework and delays. The solution uses LLMs in a two-phase approach: first extracting atomic rules from Standard Operating Procedure (SOP) documents and categorizing them by complexity, then performing real-time validation during the labeling process within their uLabel tool. By routing different rule types to appropriate LLM models (non-reasoning models for deterministic checks, reasoning models for subjective checks) and leveraging techniques like prefix caching and parallel execution, the system achieved an 80% reduction in required audits while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered Information Extraction from Pediatric Cardiac MRI Reports

UK National Health Service (NHS)

Great Ormond Street Hospital NHS Trust developed a solution to extract information from 15,000 unstructured cardiac MRI reports spanning 10 years. They implemented a hybrid approach using small LLMs for entity extraction and few-shot learning for table structure classification. The system successfully extracted patient identifiers and clinical measurements from heterogeneous reports, enabling linkage with structured data and improving clinical research capabilities. The solution demonstrated significant improvements in extraction accuracy when using contextual prompting with models like FLAN-T5 and RoBERTa, while operating within NHS security constraints.

LLM-Powered Investment Document Analysis and Processing

AngelList

AngelList transformed their investment document processing from manual classification to an automated system using LLMs. They initially used AWS Comprehend for news article classification but transitioned to OpenAI's models, which proved more accurate and cost-effective. They built Relay, a product that automatically extracts and organizes investment terms and company updates from documents, achieving 99% accuracy in term extraction while significantly reducing operational costs compared to manual processing.

LLM-Powered Migration of UI Component Libraries at Scale

Zalando

Zalando's Partner Tech team faced significant challenges maintaining two distinct in-house UI component libraries across 15 B2B applications, leading to inconsistent user experiences, duplicated efforts, and increased maintenance complexity. To address this technical debt, they explored using Large Language Models (LLMs) to automate the migration from one library to another. Through an iterative experimentation process involving five iterations of prompt engineering, they developed a Python-based migration tool using GPT-4o that achieved over 90% accuracy in component transformations. The solution proved highly cost-effective at under $40 per repository and significantly reduced manual migration effort, though it still required human oversight for visual verification and handling of complex edge cases.

LLM-Powered Multi-Tool Architecture for Oil & Gas Data Exploration

DXC

DXC developed an AI assistant to accelerate oil and gas data exploration by integrating multiple specialized LLM-powered tools. The solution uses a router to direct queries to specialized tools optimized for different data types including text, tables, and industry-specific formats like LAS files. Built using Anthropic's Claude on Amazon Bedrock, the system includes conversational capabilities and semantic search to help users efficiently analyze complex datasets, reducing exploration time from hours to minutes.

LLM-Powered Mutation Testing for Automated Compliance at Scale

Meta

Meta developed the Automated Compliance Hardening (ACH) tool to address the challenge of scaling compliance adherence across its products while maintaining developer velocity. Traditional compliance processes relied on manual, error-prone approaches that couldn't keep pace with rapid technology development. By leveraging LLMs for mutation-guided test generation, ACH generates realistic, problem-specific mutants (deliberately introduced faults) and automatically creates tests to catch them through plain-text prompts. During a trial from October to December 2024 across Facebook, Instagram, WhatsApp, and Meta's wearables platforms, privacy engineers accepted 73% of generated tests, with 36% judged as privacy-relevant. The system overcomes traditional barriers to mutation testing deployment including scalability issues, unrealistic mutants, equivalent mutants, computational costs, and testing overstretch.

LLM-Powered Mutation Testing for Automated Test Generation

Meta

Meta developed ACH (Automated Compliance Hardening), an LLM-powered system that revolutionizes software testing by combining mutation-guided test generation with large language models. Traditional mutation testing required manual test writing and generated unrealistic faults, creating a labor-intensive process with no guarantees of catching relevant bugs. ACH addresses this by allowing engineers to describe bug concerns in plain text, then automatically generating both realistic code mutations (faults) and the tests needed to catch them. The system has been deployed across Meta's platforms including Facebook Feed, Instagram, Messenger, and WhatsApp, particularly for privacy compliance testing, marking the first large-scale industrial deployment combining LLM-based mutant and test generation with verifiable assurances that generated tests will catch the specified fault types.

LLM-Powered Personalized Music Recommendations and AI DJ Commentary

Spotify

Spotify implemented LLMs to enhance their recommendation system by providing contextualized explanations for music recommendations and powering their AI DJ feature. They adapted Meta's Llama models through careful domain adaptation, human-in-the-loop training, and multi-task fine-tuning. The implementation resulted in up to 4x higher user engagement for recommendations with explanations, and a 14% improvement in Spotify-specific tasks compared to baseline Llama performance. The system was deployed at scale using vLLM for efficient serving and inference.

LLM-Powered Product Attribute Extraction from Unstructured Marketplace Data

Etsy

Etsy faced the challenge of understanding and categorizing over 100 million unique, handmade items listed by 5 million sellers, where most product information existed only as unstructured text and images rather than structured attributes. The company deployed large language models to extract product attributes at scale from listing titles, descriptions, and photos, transforming unstructured data into structured attributes that could power search filters and product comparisons. The implementation increased complete attribute coverage from 31% to 91% in target categories, improved engagement with search filters, and increased overall post-click conversion rates, while establishing robust evaluation frameworks using both human-annotated ground truth and LLM-generated silver labels.

LLM-Powered Product Catalogue Quality Control at Scale

Amazon

Amazon's product catalogue contains hundreds of millions of products with millions of listings added or edited daily, requiring accurate and appealing product data to help shoppers find what they need. Traditional specialized machine learning models worked well for products with structured attributes but struggled with nuanced or complex product descriptions. Amazon deployed large language models (LLMs) adapted through prompt tuning and catalogue knowledge integration to perform quality control tasks including recognizing standard attribute values, collecting synonyms, and detecting erroneous data. This LLM-based approach enables quality control across more product categories and languages, includes latest seller values within days rather than weeks, and saves thousands of hours in human review while extending reach into previously cost-prohibitive areas of the catalogue.

LLM-Powered Real Estate Search and Agent Matching

Zillow

Zillow's StreetEasy platform developed two LLM-powered features in 2024 to enhance the real estate experience for New York City users. The first feature, "Instant Answers," uses pre-generated AI responses to address frequently asked property questions, reducing user frustration and improving efficiency on listing pages where shoppers spend less than 61 seconds. The second feature, "Easy as PIE," creates personalized introductions between home buyers and agents by generating AI-powered bio summaries and highlighting relevant agent attributes based on deal history and user preferences. Both features were designed with cost-effectiveness, scalability, and ethical considerations in mind, leveraging techniques like BERTopic for topic modeling, chain-of-thought prompting to prevent hallucinations, and Fair Housing guardrails to ensure compliance. The implementation demonstrated the importance of data quality, human oversight, cross-functional collaboration, and iterative development in deploying production LLM systems.

LLM-Powered Requirements Generation and Virtual Testing for Automotive Software Development

Capgemini

Capgemini developed an accelerator called "amplifier" that transforms automotive software development by using LLMs deployed on AWS Bedrock to convert whiteboard sketches into structured requirements and test cases. The solution addresses the traditionally lengthy automotive development cycle by enabling rapid requirement generation, virtual testing, and scalable simulation environments. This approach reduces development time from weeks to hours while maintaining necessary safety and regulatory compliance, effectively bringing cloud-native development speeds to automotive software development.

LLM-Powered Search Evaluation System for Automated Result Quality Assessment

DoorDash

DoorDash developed AutoEval, a human-in-the-loop LLM-powered system for evaluating search result quality at scale. The system replaced traditional manual human annotations which were slow, inconsistent, and didn't scale. AutoEval combines LLMs, prompt engineering, and expert oversight to deliver automated relevance judgments, achieving a 98% reduction in evaluation turnaround time while matching or exceeding human rater accuracy. The system uses a custom Whole-Page Relevance (WPR) metric to evaluate entire search result pages holistically.

LLM-Powered Security Incident Response and Automation

Agoda

Agoda, a global travel platform processing sensitive data at scale, faced operational bottlenecks in security incident response due to high alert volumes, manual phishing email reviews, and time-consuming incident documentation. The security team implemented three LLM-powered workflows: automated triage for Level 1-2 security alerts using RAG to retrieve historical context, autonomous phishing email classification responding in under 25 seconds, and multi-source incident report generation reducing drafting time from 5-7 hours to 10 minutes. The solutions achieved 97%+ alignment with human analysts for alert triage, 99% precision in phishing classification with no false negatives, and 95% factual accuracy in report generation, while significantly reducing analyst workload and response times.

LLM-Powered Style Compatibility Labeling Pipeline for E-Commerce Catalog Curation

Wayfair

Wayfair addressed the challenge of identifying stylistic compatibility among millions of products in their catalog by building an LLM-powered labeling pipeline on Google Cloud. Traditional recommendation systems relied on popularity signals and manual annotation, which was accurate but slow and costly. By leveraging Gemini 2.5 Pro with carefully engineered prompts that incorporate interior design principles and few-shot examples, they automated the binary classification task of determining whether product pairs are stylistically compatible. This approach improved annotation accuracy by 11% compared to initial generic prompts and enables scalable, consistent style-aware curation that will be used to evaluate and ultimately improve recommendation algorithms, with plans for future integration into production search and personalization systems.

LLM-Powered User Feedback Analysis for Bug Report Classification and Product Improvement

Meta

Meta (Facebook) developed an LLM-based system to analyze unstructured user bug reports at scale, addressing the challenge of processing free-text feedback that was previously resource-intensive and difficult to analyze with traditional methods. The solution uses prompt engineering to classify bug reports into predefined categories, enabling automated monitoring through dashboards, trend detection, and root cause analysis. This approach successfully identified critical issues during outages, caught less visible bugs that might have been missed, and resulted in double-digit reductions in topline bug reports over several months by enabling cross-functional teams to implement targeted fixes and product improvements.

LLM-Powered Voice Assistant for Restaurant Operations and Personalized Alcohol Recommendations

Doordash

DoorDash implemented two major LLM-powered features during their 2025 summer intern program: a voice AI assistant for verifying restaurant hours and personalized alcohol recommendations with carousel generation. The voice assistant replaced rigid touch-tone phone systems with natural language conversations, allowing merchants to specify detailed hours information in advance while maintaining backward compatibility with legacy infrastructure through factory patterns and feature flags. The alcohol recommendation system leveraged LLMs to generate personalized product suggestions and engaging carousel titles using chain-of-thought prompting and a two-stage generation pipeline. Both systems were integrated into production using DoorDash's existing frameworks, with the voice assistant achieving structured data extraction through prompt engineering and webhook processing, while the recommendations carousel utilized the company's Carousel Serving Framework and Discovery SDK for rapid deployment.

LLMOps Best Practices and Success Patterns Across Multiple Companies

HumanLoop

A comprehensive analysis of successful LLM implementations across multiple companies including Duolingo, GitHub, Fathom, and others, highlighting key patterns in team composition, evaluation strategies, and tooling requirements. The study emphasizes the importance of domain experts in LLMOps, proper evaluation frameworks, and the need for comprehensive logging and debugging tools, showcasing concrete examples of companies achieving significant ROI through proper LLMOps implementation.

LLMOps Evolution: Scaling Wandbot from Monolith to Production-Ready Microservices

Weights & Biases

Weights & Biases presents a comprehensive case study of transforming their documentation chatbot Wandbot from a monolithic system into a production-ready microservices architecture. The transformation involved creating four core modules (ingestion, chat, database, and API), implementing sophisticated features like multilingual support and model fallback mechanisms, and establishing robust evaluation frameworks. The new architecture achieved significant metrics including 66.67% response accuracy and 88.636% query relevancy, while enabling easier maintenance, cost optimization through caching, and seamless platform integration. The case study provides valuable insights into practical LLMOps challenges and solutions, from vector store management to conversation history handling, making it a notable example of scaling LLM applications in production.

LLMOps Lessons from W&B's Wandbot: Manual Evaluation & Quality Assurance of Production LLM Systems

Weights & Biases

The case study details Weights & Biases' comprehensive evaluation of their production LLM system Wandbot, achieving a baseline accuracy of 66.67% through manual evaluation. The study offers valuable insights into LLMOps practices, demonstrating the importance of systematic evaluation, clear metrics, and expert annotation in production LLM systems. It highlights key challenges in areas like language handling, retrieval accuracy, and hallucination prevention, while also showcasing practical solutions using tools like Argilla.io for annotation management. The findings emphasize the need for continuous improvement cycles and the critical role of high-quality documentation in LLM system performance, providing a practical template for other organizations deploying LLMs in production.

LLMs for Cloud Incident Management and Root Cause Analysis

Microsoft

Microsoft Research explored using large language models (LLMs) to automate cloud incident management in Microsoft 365 services. The study focused on using GPT-3 and GPT-3.5 models to analyze incident reports and generate recommendations for root cause analysis and mitigation steps. Through rigorous evaluation of over 40,000 incidents across 1000+ services, they found that fine-tuned GPT-3.5 models significantly outperformed other approaches, with over 70% of on-call engineers rating the recommendations as useful (3/5 or better) in production settings.

LLMs for Enhanced Search Retrieval and Query Understanding

Doordash

Doordash implemented an advanced search system using LLMs to better understand and process complex food delivery search queries. They combined LLMs with knowledge graphs for query segmentation and entity linking, using retrieval-augmented generation (RAG) to constrain outputs to their controlled vocabulary. The system improved popular dish carousel trigger rates by 30%, increased whole page relevance by over 2%, and led to higher conversion rates while maintaining high precision in query understanding.

LLMs for Investigative Data Analysis in Journalism

ProPublica

ProPublica utilized LLMs to analyze a large database of National Science Foundation grants that were flagged as "woke" by Senator Ted Cruz's office. The AI helped journalists quickly identify patterns and assess why grants were flagged, while maintaining journalistic integrity through human verification. This approach demonstrated how AI can be used responsibly in journalism to accelerate data analysis while maintaining high standards of accuracy and accountability.

Long-Running Agent Harness for Multi-Context Software Development

Anthropic

Anthropic addressed the challenge of enabling AI coding agents to work effectively across multiple context windows when building complex software projects that span hours or days. The core problem was that agents would lose memory between sessions, leading to incomplete features, duplicated work, or premature project completion. Their solution involved a two-fold agent harness: an initializer agent that sets up structured environments (feature lists, git repositories, progress tracking files) on first run, and a coding agent that makes incremental progress session-by-session while maintaining clean code states. Combined with browser automation testing tools like Puppeteer, this approach enabled Claude to successfully build production-quality web applications through sustained, multi-session work.

Mainframe to Cloud Migration with AI-Powered Code Transformation

Mercedes-Benz

Mercedes-Benz faced the challenge of modernizing their Global Ordering system, a critical mainframe application handling over 5 million lines of code that processes every vehicle order and production request across 150 countries. The company partnered with Capgemini, AWS, and Rocket Software to migrate this system from mainframe to cloud using a hybrid approach: replatforming the majority of the application while using agentic AI (GenRevive tool) to refactor specific components. The most notable success was transforming 1.3 million lines of COBOL code in their pricing service to Java in just a few months, achieving faster performance, reduced mainframe costs, and a successful production deployment with zero incidents at go-live.

MCP Protocol Development and Agent AI Foundation Launch

Anthropic / OpenAI / Goose

This podcast transcript covers the one-year journey of the Model Context Protocol (MCP) from its initial launch by Anthropic through to its donation to the newly formed Agent AI Foundation. The discussion explores how MCP evolved from a local-only protocol to support remote servers, authentication, and long-running tasks, addressing the fundamental challenge of connecting AI agents to external tools and data sources in production environments. The case study highlights extensive production usage of MCP both within Anthropic's internal systems and across major technology companies including OpenAI, Microsoft, and Google, demonstrating widespread adoption with millions of requests at scale. The formation of the Agent AI Foundation with founding members including Anthropic, OpenAI, and Block represents a significant industry collaboration to standardize agentic system protocols and ensure neutral governance of critical AI infrastructure.

MCP Server for Natural Language Business Data Analytics

Ramp

Ramp built an open-source Model Context Protocol (MCP) server that enables natural language interaction with business financial data by creating a SQL interface over their developer API. The solution evolved from direct API querying to an in-memory SQLite database approach to handle scaling challenges, allowing Claude to analyze tens of thousands of spend events through natural language queries. While demonstrating strong potential for business intelligence applications, the implementation reveals both the promise and current limitations of agentic AI systems in production environments.

Medical AI Assistant for Battlefield Care Using LLMs

Johns Hopkins

Johns Hopkins Applied Physics Laboratory (APL) is developing CPG-AI, a conversational AI system using Large Language Models to provide medical guidance to untrained soldiers in battlefield situations. The system interprets clinical practice guidelines and tactical combat casualty care protocols into plain English guidance, leveraging APL's RALF framework for LLM application development. The prototype successfully demonstrates capabilities in condition inference, natural dialogue, and algorithmic care guidance for common battlefield injuries.

Medical Transcript Summarization Using Multiple LLM Models: An Evaluation Study

Oracle

A comparative study evaluating different LLM models (Claude, GPT-4, LLaMA, and Pi 3.1) for medical transcript summarization aimed at reducing administrative burden in healthcare. The study processed over 5,000 medical transcripts, comparing model performance using ROUGE scores and cosine similarity metrics. GPT-4 emerged as the top performer, followed by Pi 3.1, with results showing potential to reduce care coordinator preparation time by over 50%.

Mercury: Agentic AI Platform for LLM-Powered Recommendation Systems

eBay

eBay developed Mercury, an internal agentic framework designed to scale LLM-powered recommendation experiences across its massive marketplace of over two billion active listings. The platform addresses the challenge of transforming vast amounts of unstructured data into personalized product recommendations by integrating Retrieval-Augmented Generation (RAG) with a custom Listing Matching Engine that bridges the gap between LLM-generated text outputs and eBay's dynamic inventory. Mercury enables rapid development through reusable, plug-and-play components following object-oriented design principles, while its near-real-time distributed queue-based execution platform handles cost and latency requirements at industrial scale. The system combines multiple retrieval mechanisms, semantic search using embedding models, anomaly detection, and personalized ranking to deliver contextually relevant shopping experiences to hundreds of millions of users.

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

ML-Powered Interactive Voice Response System for Customer Support

Airbnb

Airbnb transformed their traditional button-based Interactive Voice Response (IVR) system into an intelligent, conversational AI-powered solution that allows customers to describe their issues in natural language. The system combines automated speech recognition, intent detection, LLM-based article retrieval and ranking, and paraphrasing models to understand customer queries and either provide relevant self-service resources via SMS/app notifications or route calls to appropriate agents. This resulted in significant improvements including a reduction in word error rate from 33% to 10%, sub-50ms intent detection latency, increased user engagement with help articles, and reduced dependency on human customer support agents.

MLflow's Production-Ready Agent Framework and LLM Tracing

MLflow

MLflow addresses the challenges of moving LLM agents from demo to production by introducing comprehensive tooling for tracing, evaluation, and experiment tracking. The solution includes LLM tracing capabilities to debug black-box agent systems, evaluation tools for retrieval relevance and prompt engineering, and integrations with popular agent frameworks like Autogen and LlamaIndex. This enables organizations to effectively monitor, debug, and improve their LLM-based applications in production environments.

MLOps Evolution and LLM Integration at a Major Bank

Barclays

Discussion of MLOps practices and the evolution towards LLM integration at Barclays, focusing on the transition from traditional ML to GenAI workflows while maintaining production stability. The case study highlights the importance of balancing innovation with regulatory requirements in financial services, emphasizing ROI-driven development and the creation of reusable infrastructure components.

MLOps Platform for Airline Operations with LLM Integration

LATAM Airlines

LATAM Airlines developed Cosmos, a vendor-agnostic MLOps framework that enables both traditional ML and LLM deployments across their business operations. The framework reduced model deployment time from 3-4 months to less than a week, supporting use cases from fuel efficiency optimization to personalized travel recommendations. The platform demonstrates how a traditional airline can transform into a data-driven organization through effective MLOps practices and careful integration of AI technologies.

Model Context Protocol (MCP) Server for Error Monitoring and AI Observability

Sentry

Sentry developed a Model Context Protocol (MCP) server to enable Large Language Models (LLMs) to access real-time error monitoring and application performance data directly within AI-powered development environments. The solution addresses the challenge of LLMs lacking current context about application issues by providing 16 different tool calls that allow AI assistants to retrieve project information, analyze errors, and even trigger their AI agent Seer for root cause analysis, ultimately enabling more informed debugging and issue resolution workflows within modern development environments.

Model Context Protocol (MCP): A Universal Standard for AI Application Extensions

Anthropic

Anthropic developed the Model Context Protocol (MCP) to solve the challenge of extending AI applications with plugins and external functionality in a standardized way. Inspired by the Language Server Protocol (LSP), MCP provides a universal connector that enables AI applications to interact with various tools, resources, and prompts through a client-server architecture. The protocol has gained significant community adoption and contributions from companies like Shopify, Microsoft, and JetBrains, demonstrating its potential as an open standard for AI application integration.

Model Context Protocol (MCP): Building Universal Connectivity for LLMs in Production

Anthropic

Anthropic developed and open-sourced the Model Context Protocol (MCP) to address the challenge of providing external context and tool connectivity to large language models in production environments. The protocol emerged from recognizing that teams were repeatedly reimplementing the same capabilities across different contexts (coding editors, web interfaces, and various services) where Claude needed to interact with external systems. By creating a universal standard protocol and open-sourcing it, Anthropic enabled developers to build integrations once and deploy them everywhere, while fostering an ecosystem that became what they describe as the fastest-growing open source protocol in history. The protocol has matured from requiring local server deployments to supporting remote hosted servers with a central registry, reducing friction for both developers and end users while enabling sophisticated production use cases across enterprise integrations and personal automation.

Modernizing DevOps with Generative AI: Challenges and Best Practices in Production

Various (Bundesliga, Harness, Trice)

A panel of experts from various organizations discusses the current state and challenges of integrating generative AI into DevOps workflows and production environments. The discussion covers how companies are balancing productivity gains with security concerns, the importance of having proper testing and evaluation frameworks, and strategies for successful adoption of AI tools in production DevOps processes while maintaining code quality and security.

Modernizing Software Development Lifecycle with MCP Servers and Agentic AI

Stack Overflow

HP, with over 4,000 developers, faced challenges in breaking down knowledge silos and providing enterprise context to AI coding agents. The company experimented with Stack Overflow's Model Context Protocol (MCP) server integrated with their Stack Internal knowledge base to bridge tribal knowledge barriers and enable agentic workflows. The MCP server proved successful as both a proof-of-concept for the MCP framework and a practical tool for bringing validated, contextual knowledge into developers' IDEs. This experimentation is paving the way for HP to transform their software development lifecycle into an AI-powered, "directive" model where developers guide multiple parallel agents with access to necessary enterprise context, aiming to dramatically increase productivity and reduce toil.

MongoDB Search Playground Chatbot Demo Builder for RAG Applications

MongoDB

MongoDB introduced the Chatbot Demo Builder within their Search Playground to enable developers to rapidly experiment with RAG-based chatbots without requiring an Atlas account, cluster, or collection. The tool addresses the common challenge of prototyping and testing vector search capabilities by allowing users to upload PDFs or paste text, automatically generate embeddings using Voyage AI models, configure chunking strategies, and query the data through a conversational interface. The solution provides immediate hands-on experience with MongoDB's vector search capabilities, enables sharing of demo configurations via snapshot URLs, and helps developers understand RAG architectures before committing to production deployments, though it comes with limitations including data size constraints, non-persistent environments, and lack of image processing support.

Multi-Agent AI Banking Assistant Using Amazon Bedrock

Bunq

Bunq, Europe's second-largest neobank serving 20 million users, faced challenges delivering consistent, round-the-clock multilingual customer support across multiple time zones while maintaining strict banking security and compliance standards. Traditional support models created frustrating bottlenecks and strained internal resources as users expected instant access to banking functions like transaction disputes, account management, and financial advice. The company built Finn, a proprietary multi-agent generative AI assistant using Amazon Bedrock with Anthropic's Claude models, Amazon ECS for orchestration, DynamoDB for session management, and OpenSearch Serverless for RAG capabilities. The solution evolved from a problematic router-based architecture to a flexible orchestrator pattern where primary agents dynamically invoke specialized agents as tools. Results include handling 97% of support interactions with 82% fully automated, reducing average response times to 47 seconds, translating the app into 38 languages, and deploying the system from concept to production in 3 months with a team of 80 people deploying updates three times daily.

Multi-Agent AI Development Assistant for Clinical Trial Data Analysis

AstraZeneca

AstraZeneca developed a "Development Assistant" - an interactive AI agent that enables researchers to query clinical trial data using natural language. The system evolved from a single-agent approach to a multi-agent architecture using Amazon Bedrock, allowing users across different R&D domains to access insights from their 3DP data platform. The solution went from concept to production MVP in six months, addressing the challenge of scaling AI initiatives beyond isolated proof-of-concepts while ensuring proper governance and user adoption through comprehensive change management practices.

Multi-Agent AI Platform for Customer Experience at Scale

Cisco

Cisco developed an agentic AI platform leveraging LangChain to transform their customer experience operations across a 20,000-person organization managing $26 billion in recurring revenue. The solution combines multiple specialized agents with a supervisor architecture to handle complex workflows across customer adoption, renewals, and support processes. By integrating traditional machine learning models for predictions with LLMs for language processing, they achieved 95% accuracy in risk recommendations and reduced operational time by 20% in just three weeks of limited availability deployment, while automating 60% of their 1.6-1.8 million annual support cases.

Multi-Agent AI Platform for Financial Workflow Automation

Moody’s

Moody's developed AI Studio, a multi-agent AI platform that automates complex financial workflows such as credit memo generation for loan underwriting processes. The solution reduced a traditionally 40-hour manual analyst task to approximately 2-3 minutes by deploying specialized AI agents that can perform multiple tasks simultaneously, accessing both proprietary Moody's data and third-party sources. The company has successfully commercialized this as a service for financial services customers while also implementing internal AI adoption across all 40,000 employees to improve efficiency and maintain competitive advantage.

Multi-Agent AI System for Automated Test Case Generation in Payment Systems

Amazon AMET Payments

Amazon AMET Payments team developed SAARAM, a multi-agent AI solution using Amazon Bedrock with Claude Sonnet and Strands Agents SDK to automate test case generation for payment features across five Middle Eastern and North African countries. The manual process previously required one week of QA engineer effort per feature, consuming approximately one full-time employee annually. By implementing a human-centric approach that mirrors how experienced testers analyze requirements through specialized agents, the team reduced test case generation time from one week to hours while improving test coverage by 40% and reducing QA effort from 1.0 FTE to 0.2 FTE for validation activities.

Multi-Agent AI System for Financial Intelligence and Risk Analysis

Moody’s

Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.

Multi-Agent AI System for Investment Thesis Validation Using Devil's Advocate

Linqalpha

LinqAlpha, a Boston-based AI platform serving over 170 institutional investors, developed Devil's Advocate, an AI agent that systematically pressure-tests investment theses by identifying blind spots and generating evidence-based counterarguments. The system addresses the challenge of confirmation bias in investment research by automating the manual process of challenging investment ideas, which traditionally required time-consuming cross-referencing of expert calls, broker reports, and filings. Using a multi-agent architecture powered by Claude Sonnet 3.7 and 4.0 on Amazon Bedrock, integrated with Amazon Textract, Amazon OpenSearch Service, Amazon RDS, and Amazon S3, the solution decomposes investment theses into assumptions, retrieves counterevidence from uploaded documents, and generates structured, citation-linked rebuttals. The system enables investors to conduct rigorous due diligence at 5-10 times the speed of traditional reviews while maintaining auditability and compliance requirements critical to institutional finance.

Multi-Agent AI Systems for IT Operations and Incident Management

Kolomolo / DeLaval / Arelion

Kolomolo, an AWS advanced partner, implemented two distinct AI-powered solutions for their customers DeLaval (dairy farm equipment manufacturer) and Arelion (global internet infrastructure provider). For DeLaval, they built Unity Ops, a multi-agent system that automates incident response and root cause analysis across 3,000+ connected dairy farms, processing alerts from monitoring systems and generating enriched incident tickets automatically. For Arelion, they developed a hybrid ML/LLM solution to classify and extract critical information from thousands of maintenance notification emails from over 100 vendors, reducing manual classification workload by 80%. Both solutions achieved over 95% accuracy while maintaining cost efficiency through strategic use of classical ML techniques combined with selective LLM invocation, demonstrating significant operational efficiency improvements and enabling engineering teams to focus on higher-value tasks rather than reactive incident management.

Multi-Agent Architecture for Addiction Recovery Support

OpenRecovery

OpenRecovery developed an AI-powered assistant for addiction recovery support using a sophisticated multi-agent architecture built on LangGraph. The system provides personalized, 24/7 support via text and voice, bridging the gap between expensive inpatient care and generic self-help programs. By leveraging LangGraph Platform for deployment, LangSmith for observability, and implementing human-in-the-loop features, they created a scalable solution that maintains empathy and accuracy in addiction recovery guidance.

Multi-Agent Architecture for Automated Advertising Media Planning

Spotify

Spotify faced a structural problem where multiple advertising buying channels (Direct, Self-Serve, Programmatic) relied on consolidated backend services but implemented fragmented, channel-specific workflow logic, creating duplicated decision-making and technical debt. To address this, they built "Ads AI," a multi-agent system using Google's Agent Development Kit (ADK) and Vertex AI that transforms media planning from a manual 15-30 minute process requiring 20+ form fields into a conversational interface that generates optimized, data-driven media plans in 5-10 seconds using 1-3 natural language messages. The system decomposes media planning into specialized agents (RouterAgent, GoalResolverAgent, AudienceResolverAgent, BudgetAgent, ScheduleAgent, and MediaPlannerAgent) that execute in parallel, leverage historical campaign performance data via function calling tools, and produce recommendations based on cost optimization, delivery rates, and budget matching heuristics.

Multi-Agent Architecture for Automating Commercial Real Estate Development Workflows

Build.inc

Build.inc developed a sophisticated multi-agent system called Dougie to automate complex commercial real estate development workflows, particularly for data center projects. Using LangGraph for orchestration, they implemented a hierarchical system of over 25 specialized agents working in parallel to perform land diligence tasks. The system reduces what traditionally took human consultants four weeks to complete down to 75 minutes, while maintaining high quality and depth of analysis.

Multi-Agent Copilot for Data Protection and Cyber Resilience

Druva

Druva, a data security solutions provider, collaborated with AWS to develop a generative AI-powered multi-agent copilot to simplify complex data protection operations for enterprise customers. The system leverages Amazon Bedrock, multiple LLMs (including Anthropic Claude and Amazon Nova models), and a sophisticated multi-agent architecture consisting of a supervisor agent coordinating specialized data, help, and action agents. The solution addresses challenges in managing comprehensive data security across large-scale deployments by providing natural language interfaces for troubleshooting, policy management, and operational support. Initial evaluation results showed 88-93% accuracy in API selection depending on the model used, with end-to-end testing achieving 3.3 out of 5 scores from expert evaluators during early development phases. The implementation promises to reduce investigation time from hours to minutes and enables 90% of routine data protection tasks through conversational interactions.

Multi-Agent Customer Support Automation Platform for Fintech

Gradient Labs

Gradient Labs, an AI-native startup founded after ChatGPT's release, built a comprehensive customer support automation platform for fintech companies featuring three coordinated AI agents: inbound, outbound, and back office. The company addresses the challenge that traditional customer support automation only handles the "tip of the iceberg" - frontline queries - while missing the complex back-office tasks like fraud disputes and KYC compliance that consume most human agent time. Their solution uses a modular agent architecture with natural language procedures, deterministic skill-based orchestration, multi-layer guardrails for regulatory compliance, and sophisticated state management to handle complex, multi-turn conversations across email, chat, and voice channels. This approach enables end-to-end automation where agents coordinate seamlessly, such as an inbound agent receiving a dispute claim, triggering a back-office agent to process it, and an outbound agent proactively following up with customers for additional information.

Multi-Agent Customer Support System for E-commerce

Minimal

Minimal developed a sophisticated multi-agent customer support system for e-commerce businesses using LangGraph and LangSmith, achieving 80%+ efficiency gains in ticket resolution. Their system combines three specialized agents (Planner, Research, and Tool-Calling) to handle complex support queries, automate responses, and execute order management tasks while maintaining compliance with business protocols. The system successfully automates up to 90% of support tickets, requiring human intervention for only 10% of cases.

Multi-Agent DBT Development Workflow for Data Engineering Consulting

Mammoth Growth

Mammoth Growth, a boutique data consultancy specializing in marketing and customer data, developed a multi-agent AI system to automate DBT development workflows in response to data teams struggling to deliver analytics at the speed of business. The solution employs a team of specialized AI agents (orchestrator, analyst, architect, and analytics engineer) that leverage the DBT Model Context Protocol (MCP) to autonomously write, document, and test production-grade DBT code from detailed specifications. The system enabled the delivery of a complete enterprise-grade data lineage with 15 data models and two gold-layer models in just 3 weeks for a pilot client, compared to an estimated 10 weeks using traditional manual development approaches, while maintaining code quality standards through human-led requirements gathering and mandatory code review before production deployment.

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

Multi-Agent Framework for Automated Telecom Change Request Processing

Totogi

Totogi, an AI company serving the telecommunications industry, faced challenges with traditional Business Support Systems (BSS) that required lengthy change request processing—typically taking 7 days and involving costly, specialized engineering talent. To address this, Totogi developed BSS Magic, which combines a comprehensive telco ontology with a multi-agent AI framework powered by Anthropic Claude models on Amazon Bedrock. The solution orchestrates five specialized AI agents (Business Analyst, Technical Architect, Developer, QA, and Tester) through AWS Step Functions and Lambda, automating the entire software development lifecycle from requirements analysis to code generation and testing. In collaboration with the AWS Generative AI Innovation Center, Totogi achieved significant results: reducing change request processing time from 7 days to a few hours, achieving 76% code coverage in automated testing, and delivering production-ready telecom-grade code with minimal human intervention.

Multi-Agent GenAI System for Developer Support and Documentation

Northwestern Mutual

Northwestern Mutual implemented a GenAI-powered developer support system to address challenges with their internal developer support chat system, which suffered from long response times and repetitive basic queries. Using Amazon Bedrock Agents, they developed a multi-agent system that could automatically handle common developer support requests, documentation queries, and user management tasks. The system went from pilot to production in just three months and successfully reduced support engineer workload while maintaining strict compliance with internal security and risk management requirements.

Multi-Agent Investment Research Assistant with RAG and Human-in-the-Loop

J.P. Morgan Chase

J.P. Morgan Chase's Private Bank investment research team developed "Ask David," a multi-agent AI system to automate investment research processes that previously required manual database searches and analysis. The system combines structured data querying, RAG for unstructured documents, and proprietary analytics through specialized agents orchestrated by a supervisor agent. While the team claims significant efficiency gains and real-time decision-making capabilities, they acknowledge accuracy limitations requiring human oversight, especially for high-stakes financial decisions involving billions in assets.

Multi-Agent LLM System for Business Process Automation

Cognizant

Cognizant developed Neuro AI, a multi-agent LLM-based system that enables business users to create and deploy AI-powered decision-making workflows without requiring deep technical expertise. The platform allows agents to communicate with each other to handle complex business processes, from intranet search to process automation, with the ability to deploy either in the cloud or on-premises. The system includes features for opportunity identification, use case scoping, synthetic data generation, and automated workflow creation, all while maintaining explainability and human oversight.

Multi-Agent LLM System for Logistics Planning Optimization

Amazon Logistics

Amazon Logistics developed a multi-agent LLM system to optimize their package delivery planning process. The system addresses the challenge of processing over 10 million data points annually for delivery planning, which previously relied heavily on human planners' tribal knowledge. The solution combines graph-based analysis with LLM agents to identify causal relationships between planning parameters and automate complex decision-making, potentially saving up to $150 million in logistics optimization while maintaining promised delivery dates.

Multi-Agent LLM Systems: Implementation Patterns and Production Case Studies

Nimble Gravity, Hiflylabs

A research study conducted by Nimble Gravity and Hiflylabs examining GenAI adoption patterns across industries, revealing that approximately 28-30% of GenAI projects successfully transition from assessment to production. The study explores various multi-agent LLM architectures and their implementation in production, including orchestrator-based, agent-to-agent, and shared message pool patterns, demonstrating practical applications like automated customer service systems that achieved significant cost savings.

Multi-Agent Orchestration for Automated Sales Proposal Generation

Fujitsu

Fujitsu developed an AI-powered solution to automate sales proposal creation using Azure AI Agent Service and Semantic Kernel to orchestrate multiple specialized AI agents. The system integrates with existing tools and knowledge bases to retrieve and synthesize information from dispersed sources. The implementation resulted in a 67% increase in productivity for sales proposal creation, allowing sales teams to focus more on strategic customer engagement.

Multi-Agent Personalization Engine with Proactive Memory System for Batch Processing

Personize.ai

Personize.ai, a Canadian startup, developed a multi-agent personalization engine called "Cortex" to generate personalized content at scale for emails, websites, and product pages. The company faced challenges with traditional RAG and function calling approaches when processing customer databases autonomously, including inconsistency across agents, context overload, and lack of deep customer understanding. Their solution implements a proactive memory system that infers and synthesizes customer insights into standardized attributes shared across all agents, enabling centralized recall and compressed context. Early testing with 20+ B2B companies showed the system can perform deep research in 5-10 minutes and generate highly personalized, domain-specific content that matches senior-level quality without human-in-the-loop intervention.

Multi-Agent Property Investment Advisor with Continuous Evaluation

PropHero

PropHero, a property wealth management service, needed an AI-powered advisory system to provide personalized property investment insights for Spanish and Australian consumers. Working with AWS Generative AI Innovation Center, they built a multi-agent conversational AI system using Amazon Bedrock that delivers knowledge-grounded property investment advice through natural language conversations. The solution uses strategically selected foundation models for different agents, implements semantic search with Amazon Bedrock Knowledge Bases, and includes an integrated continuous evaluation system that monitors context relevance, response groundedness, and goal accuracy in real-time. The system achieved 90% goal accuracy, reduced customer service workload by 30%, lowered AI costs by 60% through optimal model selection, and enabled over 50% of users (70% of paid users) to actively engage with the AI advisor.

Multi-Agent RAG System for Enterprise Data Discovery

Wix

Wix developed an AI-powered data discovery system called Anna to address the challenges of finding relevant data across their data mesh architecture. The system combines multiple specialized AI agents with Retrieval-Augmented Generation (RAG) to translate natural language queries into structured data queries. Using semantic search with Vespa for vector storage and an innovative approach of matching business questions to business questions, they achieved 83% accuracy in data discovery, significantly improving data accessibility across the organization.

Multi-Agent System Architecture for Autonomous Recruiting Agents

LinkedIn

LinkedIn developed a multi-agent system called Hiring Assistant to help recruiters work more efficiently, launching in October 2024. The system comprises four specialized agents (intake, sourcing, evaluation, and outreach) coordinated by a supervisor agent, with personalization driven by a preference model trained on recruiter behaviors. The presentation focuses on the operational challenges of scaling from specialized multi-agent systems to truly autonomous agents, addressing critical production issues including memory isolation across users, tool discovery and validation, safety considerations for destructive tool calls, and computational efficiency through complexity classification to route simpler tasks to completion models rather than expensive reasoning models.

Multi-Agent System for Customer Success and Sales Orchestration

ServiceNow

ServiceNow, a digital workflow platform provider, faced significant challenges with agent fragmentation across their internal sales and customer success operations, lacking a unified orchestration layer to coordinate complex workflows spanning the entire customer lifecycle. To address this, they built a comprehensive multi-agent system using LangGraph for orchestration and LangSmith for observability, covering stages from lead qualification through post-sales adoption, renewal, and customer advocacy. The system uses specialized agents coordinated by a supervisor agent, with sophisticated evaluation frameworks using custom metrics and LLM-as-a-judge evaluators. Currently in the testing phase with QA engineers, the solution has enabled modular development with human-in-the-loop capabilities, granular tracing for debugging, and automated golden dataset creation for continuous quality assurance.

Multi-Agent System for Misinformation Detection and Correction at Scale

Meta

This case study presents a sophisticated multi-agent LLM system designed to identify, correct, and find the root causes of misinformation on social media platforms at scale. The solution addresses the limitations of pre-LLM era approaches (content-only features, no real-time information, low precision/recall) by deploying specialized agents including an Indexer (for sourcing authentic data), Extractor (adaptive retrieval and reranking), Classifier (discriminative misinformation categorization), Corrector (reasoning and correction generation), and Verifier (final validation). The system achieves high precision and recall by orchestrating these agents through a centralized coordinator, implementing comprehensive logging, evaluation at both individual agent and system levels, and optimization strategies including model distillation, semantic caching, and adaptive retrieval. The approach prioritizes accuracy over cost and latency given the high stakes of misinformation propagation on platforms.

Multi-Agent System for Prediction Market Resolution Using LangChain and LangGraph

Chaos Labs

Chaos Labs developed Edge AI Oracle, a decentralized multi-agent system built on LangChain and LangGraph for resolving queries in prediction markets. The system utilizes multiple LLM models from providers like OpenAI, Anthropic, and Meta to ensure objective and accurate resolutions. Through a sophisticated workflow of specialized agents including research analysts, web scrapers, and bias analysts, the system processes queries and provides transparent, traceable results with configurable consensus requirements.

Multi-Agent Web Research System with Dynamic Task Generation

Exa

Exa evolved from providing a search API to building a production-ready multi-agent web research system that processes hundreds of research queries daily, delivering structured results in 15 seconds to 3 minutes. Using LangGraph for orchestration and LangSmith for observability, their system employs a three-component architecture with a planner that dynamically generates parallel tasks, independent research units with specialized tools, and an observer maintaining full context across all components. The system intelligently balances between search snippets and full content retrieval to optimize token usage while maintaining research quality, ultimately providing structured JSON outputs specifically designed for API consumption.

Multi-Company Panel Discussion on Enterprise AI and Agentic AI Deployment Challenges

Glean / Deloitte / Docusign

This panel discussion at AWS re:Invent brings together practitioners from Glean, Deloitte, and DocuSign to discuss the practical realities of deploying AI and agentic AI systems in enterprise environments. The panelists explore challenges around organizational complexity, data silos, governance, agent creation and sharing, value measurement, and the tension between autonomous capabilities and human oversight. Key themes include the need for cross-functional collaboration, the importance of security integration from day one, the difficulty of measuring AI-driven productivity gains, and the evolution from individual AI experimentation to governed enterprise-wide agent deployment. The discussion emphasizes that successful AI transformation requires reimagining workflows rather than simply bolting AI onto legacy systems, and that business value should drive technical decisions rather than focusing solely on which LLM model to use.

Multi-Company Panel Discussion on Production LLM Frameworks and Scaling Challenges

Various (Thinking Machines, Yutori, Evolutionaryscale, Perplexity, Axiom)

This panel discussion features experts from multiple AI companies discussing the current state and future of agentic frameworks, reinforcement learning applications, and production LLM deployment challenges. The panelists from Thinking Machines, Perplexity, Evolutionary Scale AI, and Axiom share insights on framework proliferation, the role of RL in post-training, domain-specific applications in mathematics and biology, and infrastructure bottlenecks when scaling models to hundreds of GPUs, highlighting the gap between research capabilities and production deployment tools.

Multi-Company Panel on Production LLM Deployment Strategies and Small Language Model Optimization

Meta / AWS / NVIDIA / ConverseNow

This panel discussion features leaders from Meta, AWS, NVIDIA, and ConverseNow discussing real-world challenges and solutions for deploying LLMs in production environments. The conversation covers the trade-offs between small and large language models, with ConverseNow sharing their experience building voice AI systems for restaurants that require high accuracy and low latency. Key themes include the importance of fine-tuning small models for production use cases, the convergence of training and inference systems, optimization techniques like quantization and alternative architectures, and the challenges of building reliable, cost-effective inference stacks for mission-critical applications.

Multi-Company Showcase: AI-Powered Development Tools and Creative Applications

Tempo Labs / Zencoder / Diffusion / Bito / Gamma / Create

This case study presents six startups showcasing production deployments of Claude-powered applications across diverse domains at Anthropic's Code with Claude conference. Tempo Labs built a visual IDE enabling designers and PMs to collaborate on code generation, Zencoder extended AI coding assistance across the full software development lifecycle with custom agents, Gamma created an AI presentation builder leveraging Claude's web search capabilities, Bito developed an AI code review platform analyzing codebases for critical issues, Diffusion deployed Claude for song lyric generation in their music creation platform, and Create built a no-code platform for generating full-stack mobile and web applications. These companies demonstrated how Claude 3.5 and 3.7 Sonnet, along with features like tool use, web search, and prompt caching, enabled them to achieve rapid growth with hundreds of thousands to millions of users within 12 months.

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

Multi-Industry LLM Deployment: Building Production AI Systems Across Diverse Verticals

Caylent

Caylent, a development consultancy, shares their extensive experience building production LLM systems across multiple industries including environmental management, sports media, healthcare, and logistics. The presentation outlines their comprehensive approach to LLMOps, emphasizing the importance of proper evaluation frameworks, prompt engineering over fine-tuning, understanding user context, and managing inference economics. Through various client projects ranging from multimodal video search to intelligent document processing, they demonstrate key lessons learned about deploying reliable AI systems at scale, highlighting that generative AI is not a "magical pill" but requires careful engineering around inputs, outputs, evaluation, and user experience.

Multi-Label Red Flag Detection System for Fraud Prevention

Feedzai

Feedzai developed ScamAlert, a generative AI-based system that moves beyond traditional binary scam classification to identify specific red flags in suspected fraud attempts. The system addresses the limitations of binary classifiers that only output risk scores without explanation by using multimodal LLMs to analyze screenshots of suspected scams (emails, text messages, listings) and identify observable warning signs like suspicious links, urgency tactics, or unusual communication channels. The team created a comprehensive benchmarking framework to evaluate multiple commercial multimodal models across four dimensions: red flag detection accuracy (precision/recall/F1), instruction adherence, cost, and latency. Their results showed significant performance variations across models, with GPT-5, Gemini 3 Pro, and Gemini 2.5 Pro leading in accuracy, though with notable tradeoffs in cost and latency, while also revealing instruction-following issues in some models that generated hallucinated red flags not in the predefined taxonomy.

Multi-Layered LLM Evaluation Pipeline for Production Content Generation

Treater

Treater developed a comprehensive evaluation pipeline for production LLM workflows that combines deterministic rule-based checks, LLM-based evaluations, automatic rewriting systems, and human edit analysis to ensure high-quality content generation at scale. The system addresses the challenge of maintaining consistent quality in LLM-generated outputs by implementing a multi-layered defense approach that catches errors early, provides interpretable feedback, and continuously improves through human feedback loops, resulting in under 2% failure rates at the deterministic level and measurable improvements in content acceptance rates over time.

Multi-Lingual Voice Control System for AGV Management Using Edge LLMs

Addverb

Addverb developed an AI-powered voice control system for AGV (Automated Guided Vehicle) maintenance that enables warehouse workers to communicate with robots in their native language. The system uses a combination of edge-deployed Llama 3 and cloud-based ChatGPT to translate natural language commands from 98 different languages into AGV instructions, significantly reducing maintenance downtime and improving operational efficiency.

Multi-LLM Orchestration for Product Matching at Scale

Mercado Libre

Mercado Libre tackled the classic e-commerce product-matching challenge where sellers create listings with inconsistent titles, attributes, and identifiers, making it difficult to identify identical products across the platform. The team developed a sophisticated multi-LLM orchestration system that evolved from a simple 2-node architecture to a complex 7-node pipeline, incorporating adaptive prompts, context-aware decision-making, and collaborative consensus mechanisms. Through systematic iteration and careful orchestration alongside existing ML models and embedding systems, they achieved human-level performance with 95% precision and over 50% recall at a cost-effective rate of less than $0.001 per request, enabling scalable autonomous product matching across millions of items for critical use cases including pricing, personalization, and inventory optimization.

Multi-modal LLM Platform for Catalog Attribute Extraction at Scale

Instacart

Instacart faced significant challenges in extracting structured product attributes (flavor, size, dietary claims, etc.) from millions of SKUs using traditional SQL-based rules and text-only machine learning models. These approaches suffered from low quality, high development overhead, and inability to process image data. To address these limitations, Instacart built PARSE (Product Attribute Recognition System for E-commerce), a self-serve multi-modal LLM platform that enables teams to extract attributes from both text and images with minimal engineering effort. The platform reduced attribute extraction development time from weeks to days, achieved 10% higher recall through multi-modal reasoning compared to text-only approaches, and delivered 95% accuracy on simpler attributes with just one day of effort versus one week with traditional methods.

Multi-Model AI Strategy for Talent Marketplace Optimization

Upwork

Upwork, a global freelance talent marketplace, developed Uma (Upwork's Mindful AI) to streamline the hiring and matching processes between clients and freelancers. The company faced the challenge of serving a large, diverse customer base with AI solutions that needed both broad applicability and precision for specific marketplace use cases like discovery, search, and matching. Their solution involved a dual approach: leveraging pretrained models like GPT-4 for rapid deployment of features such as job post generation and chat assistance, while simultaneously developing custom, use case-specific smaller language models fine-tuned on proprietary platform data, synthetic data, and human-generated content from talented writers. This strategy resulted in significant improvements, including an 80% reduction in job post creation time and more accurate, contextually relevant assistance for both freelancers and clients across the platform.

Multi-Model LLM Orchestration with Rate Limit Management

Bito

Bito, an AI coding assistant startup, faced challenges with API rate limits while scaling their LLM-powered service. They developed a sophisticated load balancing system across multiple LLM providers (OpenAI, Anthropic, Azure) and accounts to handle rate limits and ensure high availability. Their solution includes intelligent model selection based on context size, cost, and performance requirements, while maintaining strict guardrails through prompt engineering.

Multi-Step GTM Agent for Sales Lead Processing and Account Intelligence

Langchain

LangChain built an end-to-end GTM (Go-To-Market) agent to automate outbound sales research and email drafting, addressing the problem of sales reps spending excessive time toggling between multiple systems and manually researching leads. The agent triggers on new Salesforce leads, performs multi-source research, checks contact history, and generates personalized email drafts with reasoning for rep approval via Slack. The solution increased lead-to-qualified-opportunity conversion by 250%, saved each sales rep 40 hours per month (1,320 hours team-wide), increased follow-up rates by 97% for lower-intent leads and 18% for higher-intent leads, and achieved 50% daily and 86% weekly active usage across the GTM team.

Multi-Tenant AI Chatbot Platform for Industrial Conglomerate Operating Companies

Capgemini

Capgemini and AWS developed "Fort Brain," a centralized AI chatbot platform for Fortive, an industrial technology conglomerate with 18,000 employees across 50 countries and multiple independently-operating subsidiary companies (OpCos). The platform addressed the challenge of disparate data sources and siloed chatbot development across operating companies by creating a unified, secure, and dynamically-updating system that could ingest structured data (RDS, Snowflake), unstructured documents (SharePoint), and software engineering repositories (GitLab). Built in 8 weeks as a POC using AWS Bedrock, Fargate, API Gateway, Lambda, and the Model Context Protocol (MCP), the solution enabled non-technical users to query live databases and documents through natural language interfaces, eliminating the need for manual schema remapping when data structures changed and providing real-time access to operational data across all operating companies.

Multi-Tenant MCP Server Authentication with Redis Session Management

BrainGrid

BrainGrid faced the challenge of transforming their Model Context Protocol (MCP) server from a local development tool into a production-ready, multi-tenant service that could be deployed to customers. The core problem was that serverless platforms like Cloud Run and Vercel don't maintain session state, causing users to re-authenticate repeatedly as instances scaled to zero or requests hit different instances. BrainGrid solved this by implementing a Redis-based session store with AES-256-GCM encryption, OAuth integration via WorkOS, and a fast-path/slow-path authentication pattern that caches validated JWT sessions. The solution reduced authentication overhead from 50-100ms per request to near-instantaneous for cached sessions, eliminated re-authentication fatigue, and enabled the MCP server to scale from single-user to multi-tenant deployment while maintaining security and performance.

Multilingual Document Processing Pipeline with Human-in-the-Loop Validation

A2I

A case study on implementing a robust multilingual document processing system that combines Amazon Bedrock's Claude models with human review capabilities through Amazon A2I. The solution addresses the challenge of processing documents in multiple languages by using LLMs for initial extraction and human reviewers for validation, enabling organizations to efficiently process and validate documents across language barriers while maintaining high accuracy.

Multimodal Healthcare Data Integration with Specialized LLMs

John Snow Labs

John Snow Labs developed a comprehensive healthcare data integration system that leverages multiple specialized LLMs to unify and analyze patient data from various sources. The system processes structured, unstructured, and semi-structured medical data (including EHR, PDFs, HL7, FHIR) to create complete patient journeys, enabling natural language querying while maintaining consistency, accuracy, and scalability. The solution addresses key healthcare challenges like terminology mapping, date normalization, and data deduplication, all while operating within secure environments and handling millions of patient records.

Multimodal LLM-as-a-Judge for Large-Scale Product Retrieval Evaluation

Zalando

Zalando, a major e-commerce platform, faced the challenge of evaluating product retrieval systems at scale across multiple languages and diverse customer queries. Traditional human relevance assessments required substantial time and resources, making large-scale continuous evaluation impractical. The company developed a novel framework leveraging Multimodal Large Language Models (MLLMs) that automatically generate context-specific annotation guidelines and conduct relevance assessments by analyzing both text and images. Evaluated on 20,000 examples, the approach achieved accuracy comparable to human annotators while being up to 1,000 times cheaper and significantly faster (20 minutes versus weeks for humans), enabling continuous monitoring of high-frequency search queries in production and faster identification of areas requiring improvement.

Multimodal RAG Architecture Optimization for Production

Microsoft

Microsoft explored optimizing a production Retrieval-Augmented Generation (RAG) system that incorporates both text and image content to answer domain-specific queries. The team conducted extensive experiments on various aspects of the system including prompt engineering, metadata inclusion, chunk structure, image enrichment strategies, and model selection. Key improvements came from using separate image chunks, implementing a classifier for image relevance, and utilizing GPT-4V for enrichment while using GPT-4o for inference. The resulting system achieved better search precision and more relevant LLM-generated responses while maintaining cost efficiency.

Multimodal RAG Solution for Oil and Gas Drilling Data Processing

Infosys

Infosys developed an advanced multimodal Retrieval-Augmented Generation (RAG) solution using Amazon Bedrock to process complex oil and gas drilling documentation containing text, images, charts, and technical diagrams. The solution addresses the challenge of extracting insights from thousands of technical documents including well completion reports, drilling logs, and lithology diagrams that traditional document processing methods struggle to handle effectively. Through iterative development exploring various chunking strategies, embedding models, and search approaches, the team ultimately implemented a hybrid search system with parent-child chunking hierarchy, achieving 92% retrieval accuracy, sub-2-second response times, and delivering significant operational efficiency gains including 40-50% reduction in manual document processing costs and 60% time savings for field engineers and geologists.

National-Scale AI Deployment in UK Public Sector: Contact Center Automation and Citizen Information Retrieval

Capita / UK Department of Science

Two UK government organizations, Capita and the Government Digital Service (GDS), deployed large-scale AI solutions to serve millions of citizens. Capita implemented AWS Connect and Amazon Bedrock with Claude to automate contact center operations handling 100,000+ daily interactions, achieving 35% productivity improvements and targeting 95% automation by 2027. GDS launched GOV.UK Chat, the UK's first national-scale RAG implementation using Amazon Bedrock, providing instant access to 850,000+ pages of government content for 67 million citizens. Both organizations prioritized safety, trust, and human oversight while scaling AI solutions to handle millions of interactions with zero tolerance for errors in this high-stakes public sector environment.

Native Image Generation with Multimodal Context in Gemini 2.5 Flash

Google DeepMind

Google DeepMind released an updated native image generation capability in Gemini 2.5 Flash that represents a significant quality leap over previous versions. The model addresses key production challenges including consistent character rendering across multiple angles, pixel-perfect editing that preserves scene context, and improved text rendering within images. Through interleaved generation, the model can maintain conversation context across multiple editing turns, enabling iterative creative workflows. The team tackled evaluation challenges by combining human preference data with specific technical metrics like text rendering quality, while incorporating real user feedback from social media to create comprehensive benchmarks that drive model improvements.

Natural Language Analytics Assistant Using Amazon Bedrock Agents

Skai

Skai, an omnichannel advertising platform, developed Celeste, an AI agent powered by Amazon Bedrock Agents, to transform how customers access and analyze complex advertising data. The solution addresses the challenge of time-consuming manual report generation (taking days or weeks) by enabling natural language queries that automatically collect data from multiple sources, synthesize insights, and provide actionable recommendations. The implementation reduced report generation time by 50%, case study creation by 75%, and transformed weeks-long processes into minutes while maintaining enterprise-grade security and privacy for sensitive customer data.

Natural Language Analytics with Snowflake Cortex for Self-Service BI

Gitlab

GitLab implemented conversational analytics using Snowflake Cortex to enable non-technical business users to query structured data using natural language, eliminating the traditional dependency on data analysts and reducing analytics backlog. The solution evolved from a basic proof-of-concept with 60% accuracy to a production system achieving 85-95% accuracy for simple queries and 75% for complex queries, utilizing semantic models, prompt engineering, verified query feedback loops, and role-based access controls. The implementation reduced analytics requests by approximately 50% for some teams, decreased time-to-insight from weeks to seconds, and democratized data access while maintaining enterprise-grade security through Snowflake's native governance features.

Natural Language Interface for Healthcare Data Analytics using LLMs

Aachen Uniklinik / Aurea Software

A UK-based NLQ (Natural Language Query) company developed an AI-powered interface for Aachen Uniklinik to make intensive care unit databases more accessible to healthcare professionals. The system uses a hybrid approach combining vector databases, large language models, and traditional SQL to allow non-technical medical staff to query complex patient data using natural language. The solution includes features for handling dirty data, intent detection, and downstream complication analysis, ultimately improving clinical decision-making processes.

Natural Language Interface to Business Intelligence Using RAG

Volvo

Volvo implemented a Retrieval Augmented Generation (RAG) system that allows non-technical users to query business intelligence data through a Slack interface using natural language. The system translates natural language questions into SQL queries for BigQuery, executes them, and returns results - effectively automating what was previously manual work done by data analysts. The system leverages DBT metadata and schema information to provide accurate responses while maintaining control over data access.

Natural Language Query Interface with Production LLM Integration

Honeycomb

Honeycomb implemented a natural language query interface for their observability platform to help users more easily analyze their production data. Rather than creating a chatbot, they focused on a targeted query translation feature using GPT-3.5, achieving a 94% success rate in query generation. The feature led to significant improvements in user activation metrics, with teams using the query assistant being 2-3x more likely to create complex queries and save them to boards.

Natural Language to SQL Query Generation at Scale

Uber

Uber developed QueryGPT to address the time-intensive process of SQL query authoring across its data platform, which handles 1.2 million interactive queries monthly. The system uses large language models, vector databases, and similarity search to generate complex SQL queries from natural language prompts, reducing query authoring time from approximately 10 minutes to 3 minutes. Starting from a hackathon prototype in May 2023, the system evolved through 20+ iterations into a production service featuring workspaces for domain-specific query generation, multiple specialized LLM agents (intent, table, and column pruning), and a comprehensive evaluation framework. The limited release achieved 300 daily active users with 78% reporting significant time savings, representing a major productivity gain particularly for Uber's Operations organization which contributes 36% of all queries.

Natural Language to SQL System with Production Safeguards for Contact Center Analytics

NICE

NICE implemented a system that allows users to query contact center metadata using natural language, which gets translated to SQL queries. The solution achieves 86% accuracy and includes critical production safeguards like tenant isolation, default time frames, data visualization, and context management for follow-up questions. The system also provides detailed explanations of query interpretations and results to users.

Neural Search and Conversational AI for Food Delivery and Restaurant Discovery

Swiggy

Swiggy implemented a neural search system powered by fine-tuned LLMs to enable conversational food and grocery discovery across their platforms. The system handles open-ended queries to provide personalized recommendations from over 50 million catalog items. They are also developing LLM-powered chatbots for customer service, restaurant partner support, and a Dineout conversational bot for restaurant discovery, demonstrating a comprehensive approach to integrating generative AI across their ecosystem.

No-Code Agentic Workflow Platform for Automated Code Changes

Duolingo

Duolingo developed an internal platform enabling employees across all roles to create and deploy AI coding agents without writing custom code, addressing the challenge of scaling AI-assisted development beyond individual use. The solution centers on a JSON-based workflow creator that allows users to define prompts, target repositories, and parameters, backed by a unified CodingAgent library supporting multiple LLM providers (Codex and Claude) and orchestrated through Temporal workflows. The platform has enabled rapid creation of agents for routine tasks like feature flag removal, experiment management, and infrastructure changes, with simple agents deployable in under five minutes and custom multi-step workflows buildable in 1-2 days, allowing engineers to focus on core product logic rather than repetitive coding tasks.

Observability Platform's Journey to Production GenAI Integration

New Relic

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

One-Shot End-to-End Coding Agents for Developer Productivity

Stripe

Stripe developed "Minions," a system of one-shot, end-to-end coding agents designed to enhance developer productivity within their internal engineering workflows. The problem addressed is the time-consuming nature of routine coding tasks and the potential for AI to automate portions of the software development lifecycle. The solution involves deploying LLM-based coding agents that can handle complete coding tasks from start to finish in a single execution. While the provided text is limited in detail, it represents Stripe's investment in leveraging LLMs for internal tooling to improve engineering efficiency, with the blog post being part of a series documenting their approach to building and deploying these AI-powered development assistants.

One-Shot End-to-End Coding Agents for Developer Productivity

Stripe

Stripe developed "Minions," an internal system of one-shot, end-to-end coding agents designed to enhance developer productivity. While the provided source text is extremely limited and appears to be primarily metadata from a blog post header, it indicates that Stripe has deployed LLM-based coding agents that can autonomously handle complete coding tasks from start to finish in a single execution. The system aims to reduce developer toil and accelerate software engineering workflows at scale within Stripe's infrastructure, though specific implementation details, performance metrics, and concrete results are not available in the provided excerpt.

Open Source Code Generation Model Release and Production Deployment Considerations

Meta

Meta released Code Llama, a family of specialized large language models for code generation built on top of Llama 2, aiming to assist developers with coding tasks and lower barriers to entry for new programmers. The solution includes multiple model sizes (7B, 13B, 34B, and 70B parameters) with three variants: a foundational code model, a Python-specialized version, and an instruction-tuned variant, all trained on 500B-1T tokens of code and supporting up to 100,000 token contexts. Benchmark testing showed Code Llama 34B achieved 53.7% on HumanEval and 56.2% on MBPP, matching ChatGPT performance while being released under an open license for both research and commercial use, with extensive safety evaluations and red teaming conducted to address responsible AI concerns.

Open Source vs. Closed Source Agentic Stacks: Panel Discussion on Production Deployment Strategies

Various (Alation, GrottoAI, Nvidia, OLX)

This panel discussion brings together experts from Nvidia, OLX, Alation, and GrottoAI to discuss practical considerations for deploying agentic AI systems in production. The conversation explores when to choose open source versus closed source tooling, the challenges of standardizing agent frameworks across enterprise organizations, and the tradeoffs between abstraction levels in agent orchestration platforms. Key themes include starting with closed source models for rapid prototyping before transitioning to open source for compliance and cost reasons, the importance of observability across heterogeneous agent frameworks, the difficulty of enabling non-technical users to build agents, and the critical difference between internal tooling with lower precision requirements versus customer-facing systems demanding 95%+ accuracy.

Optimizing Agent Behavior and Support Operations with LangSmith Testing and Observability

Podium

Podium, a communication platform for small businesses, implemented LangSmith to improve their AI Employee agent's performance and support operations. Through comprehensive testing, dataset curation, and fine-tuning workflows, they achieved a 98.6% F1 score in response quality and reduced engineering intervention needs by 90%. The implementation enabled their Technical Product Specialists to troubleshoot issues independently and improved overall customer satisfaction.

Optimizing Agent Harness for OpenAI Codex Models in Production

Cursor

Cursor, an AI-powered code editor, details their approach to integrating OpenAI's GPT-5.1-Codex-Max model into their production agent harness. The problem involved adapting their existing agent framework to work optimally with Codex's specific training and behavioral patterns, which differed from other frontier models. Their solution included prompt engineering adjustments, tool naming conventions aligned with shell commands, reasoning trace preservation, strategic instructions to bias the model toward autonomous action, and careful message ordering to prevent contradictory instructions. The results demonstrated significant performance improvements, with their experiments showing that dropping reasoning traces caused a 30% performance degradation for Codex, highlighting the critical importance of their implementation decisions.

Optimizing Email Engagement Using LLMs and Rejection Sampling

Nextdoor

Nextdoor developed a novel system to improve email engagement by generating optimized subject lines using a combination of ChatGPT API and a custom reward model. The system uses prompt engineering to generate authentic subject lines without hallucination, and employs rejection sampling with a reward model to select the most engaging options. The solution includes robust engineering components for cost optimization and model performance maintenance, resulting in a 1% lift in sessions and 0.4% increase in Weekly Active Users.

Optimizing Medical Record Processing with Prompt Caching at Scale

Care Access

Care Access, a global health services and clinical research organization, faced significant operational challenges when processing 300-500+ medical records daily for their health screening program. Each medical record required multiple LLM-based analyses through Amazon Bedrock, but the approach of reprocessing substantial portions of medical data for each separate analysis question led to high costs and slower processing times. By implementing Amazon Bedrock's prompt caching feature—caching the static medical record content while varying only the analysis questions—Care Access achieved an 86% reduction in data processing costs (7x decrease) and 66% faster processing times (3x speedup), saving 4-8+ hours of processing time daily. This optimization enabled the organization to scale their health screening program efficiently while maintaining strict HIPAA compliance and privacy standards, allowing them to connect more participants with personalized health resources and clinical trial opportunities.

Optimizing Production LLM Chatbot Performance Through Multi-Model Classification

IDIADA

IDIADA developed AIDA, an intelligent chatbot powered by Amazon Bedrock, to assist their workforce with various tasks. To optimize performance, they implemented specialized classification pipelines using different approaches including LLMs, k-NN, SVM, and ANN with embeddings from Amazon Titan and Cohere models. The optimized system achieved 95% accuracy in request routing and drove a 20% increase in team productivity, handling over 1,000 interactions daily.

Optimizing Production Vision Pipelines for Planet Image Generation

Prem AI

At Prem AI, they tackled the challenge of generating realistic ethereal planet images at scale with specific constraints like aspect ratio and controllable parameters. The solution involved fine-tuning Stable Diffusion XL with a curated high-quality dataset, implementing custom upscaling pipelines, and optimizing performance through various techniques including LoRA fusion, model quantization, and efficient serving frameworks like Ray Serve.

Optimizing RAG Latency Through Model Racing and Self-Hosted Infrastructure

ElevenLabs

ElevenLabs faced significant latency challenges in their production RAG system, where query rewriting accounted for over 80% of RAG latency due to reliance on a single externally-hosted LLM. They redesigned their architecture to implement model racing, where multiple models (including self-hosted Qwen 3-4B and 3-30B-A3B models) process queries in parallel, with the first valid response winning. This approach reduced median RAG latency from 326ms to 155ms (a 50% improvement), while also improving system resilience by providing fallbacks during provider outages and reducing dependency on external services.

Optimizing RAG Systems: Lessons from Production

AWS GenAIIC

AWS GenAIIC shares comprehensive lessons learned from implementing Retrieval-Augmented Generation (RAG) systems across multiple industries. The case study covers key challenges in RAG implementation and provides detailed solutions for improving retrieval accuracy, managing context, and ensuring response reliability. Solutions include hybrid search techniques, metadata filtering, query rewriting, and advanced prompting strategies to reduce hallucinations.

Optimizing RAG-based Search Results for Production: A Journey from POC to Production

Statista

Statista, a global data platform, developed and optimized a RAG-based AI search system to enhance their platform's search capabilities. Working with Urial Labs and Talent Formation, they transformed a basic prototype into a production-ready system that improved search quality by 140%, reduced costs by 65%, and decreased latency by 10%. The resulting Research AI product has seen growing adoption among paying customers and demonstrates superior performance compared to general-purpose LLMs for domain-specific queries.

Optimizing Research Report Generation with LangChain Stack and LLM Observability

Athena Intelligence

Athena Intelligence developed an AI-powered enterprise analytics platform that generates complex research reports by leveraging LangChain, LangGraph, and LangSmith. The platform needed to handle complex data tasks and generate high-quality reports with proper source citations. Using LangChain for model abstraction and tool management, LangGraph for agent orchestration, and LangSmith for development iteration and production monitoring, they successfully built a reliable system that significantly improved their development speed and report quality.

Optimizing Security Incident Response with LLMs at Google

Google

Google implemented LLMs to streamline their security incident response workflow, particularly focusing on incident summarization and executive communications. They used structured prompts and careful input processing to generate high-quality summaries while ensuring data privacy and security. The implementation resulted in a 51% reduction in time spent on incident summaries and 53% reduction in executive communication drafting time, while maintaining or improving quality compared to human-written content.

Optimizing Security Threat Investigation with Multi-Model LLM Strategy

Trellix

Trellix implemented an AI-powered security threat investigation system using multiple foundation models on Amazon Bedrock to automate and enhance their security analysis workflow. By strategically combining Amazon Nova Micro with Anthropic's Claude Sonnet, they achieved 3x faster inference speeds and nearly 100x lower costs while maintaining investigation quality through a multi-pass approach with smaller models. The system uses RAG architecture with Amazon OpenSearch Service to process billions of security events and provide automated risk scoring.

Optimizing Text-to-SQL Pipeline Using Agent Experiments

IDInsight

Ask-a-Metric developed a WhatsApp-based AI data analyst that converts natural language questions to SQL queries. They evolved from a simple sequential pipeline to testing an agent-based approach using CrewAI, ultimately creating a hybrid "pseudo-agent" pipeline that combined the best aspects of both approaches. While the agent-based system achieved high accuracy, its high costs and slow response times led to the development of an optimized pipeline that maintained accuracy while reducing query response time to under 15 seconds and costs to less than $0.02 per query.

Overcoming LLM Production Deployment Challenges

Neeva

A comprehensive analysis of the challenges and solutions in deploying LLMs to production, presented by a machine learning expert from Neeva. The presentation covers both infrastructural challenges (speed, cost, API reliability, evaluation) and output-related challenges (format variability, reproducibility, trust and safety), along with practical solutions and strategies for successful LLM deployment, emphasizing the importance of starting with non-critical workflows and planning for scale.

Panel Discussion on Building Production LLM Applications

Various

A panel discussion featuring experts from Various companies discussing key aspects of building production LLM applications. The discussion covers critical topics including hallucination management, prompt engineering, evaluation frameworks, cost considerations, and model selection. Panelists share practical experiences and insights on deploying LLMs in production, highlighting the importance of continuous feedback loops, evaluation metrics, and the trade-offs between open source and commercial LLMs.

Panel Discussion on LLM Evaluation and Production Deployment Best Practices

Various

Industry experts from Gantry, Structured.ie, and NVIDIA discuss the challenges and approaches to evaluating LLMs in production. They cover the transition from traditional ML evaluation to LLM evaluation, emphasizing the importance of domain-specific benchmarks, continuous monitoring, and balancing automated and human evaluation methods. The discussion highlights how LLMs have lowered barriers to entry while creating new challenges in ensuring accuracy and reliability in production deployments.

Panel Discussion on LLMOps Challenges: Model Selection, Ethics, and Production Deployment

Google, Databricks,

A panel discussion featuring leaders from various AI companies discussing the challenges and solutions in deploying LLMs in production. Key topics included model selection criteria, cost optimization, ethical considerations, and architectural decisions. The discussion highlighted practical experiences from companies like Interact.ai's healthcare deployment, Inflection AI's emotionally intelligent models, and insights from Google and Databricks on responsible AI deployment and tooling.

Panel Discussion: Real-World LLM Production Use Cases

Various

A panel discussion featuring multiple companies and consultants sharing their experiences with LLMs in production. Key highlights include Resides using LLMs to improve property management customer service (achieving 95-99% question answering rates), applications in sales optimization with 30% improvement in sales through argument analysis, and insights on structured outputs and validation for executive coaching use cases.

Panel Discussion: Scaling Generative AI in Enterprise - Challenges and Best Practices

Various

A panel discussion featuring leaders from Google Cloud AI, Symbol AI, Chain ML, and Deloitte discussing the adoption, scaling, and implementation challenges of generative AI across different industries. The panel explores key considerations around model selection, evaluation frameworks, infrastructure requirements, and organizational readiness while highlighting practical approaches to successful GenAI deployment in production.

Parallel Asynchronous AI Coding Agents for Development Workflows

Google

Google Labs introduced Jules, an asynchronous coding agent designed to execute development tasks in parallel in the background while developers focus on higher-value work. The product addresses the challenge of serial development workflows by enabling developers to spin up multiple cloud-based agents simultaneously to handle tasks like SDK updates, testing, accessibility audits, and feature development. Launched two weeks prior to the presentation, Jules had already generated 40,000 public commits. The demonstration showcased how a developer could parallelize work on a conference schedule website by simultaneously running multiple test framework implementations, adding features like calendar integration and AI summaries, while conducting accessibility and security audits—all managed through a VM-based cloud infrastructure powered by Gemini 2.5 Pro.

PerfInsights: AI-Powered Performance Optimization for Go Services

Uber

Uber developed PerfInsights to address the unsustainable compute costs of their Go services, where the top 10 services alone accounted for multi-million dollars in monthly compute spend. The solution combines runtime profiling with GenAI-powered static analysis to automatically detect performance antipatterns in Go code, validate findings through LLM juries and rule-based checking (LLMCheck), and generate optimization recommendations. Results include a 93% reduction in time required to detect and fix performance issues (from 14.5 hours to 1 hour), over 80% reduction in false positives, hundreds of merged optimization diffs, and a 33.5% reduction in detected antipatterns over four months, translating to approximately 3,800 hours of engineering time saved annually.

Personalized Meal Plan Generator with LLM-Powered Recommendations

Cherrypick

Cherrypick, a meal planning service, launched an LLM-powered meal generator to create personalized meal plans with natural language explanations for recipe selections. The company faced challenges around cost management, interface design, and output reliability when moving from a traditional rule-based system to an LLM-based approach. By carefully constraining the problem space, avoiding chatbot interfaces in favor of structured interactions, implementing multi-layered evaluation frameworks, and working with rather than against model randomness, they achieved significant improvements: customers changed their plans 30% less and used plans in their baskets 14% more compared to the previous system.

Pitfalls and Best Practices for Production LLM Applications

Humanloop

A comprehensive overview from Human Loop's experience helping hundreds of companies deploy LLMs in production. The talk covers key challenges and solutions around evaluation, prompt management, optimization strategies, and fine-tuning. Major lessons include the importance of objective evaluation, proper prompt management infrastructure, avoiding premature optimization with agents/chains, and leveraging fine-tuning effectively. The presentation emphasizes taking lessons from traditional software engineering while acknowledging the unique needs of LLM applications.

Pivoting from GPU Infrastructure to Building an AI-Powered Development Environment

Windsurf

Windsurf began as a GPU virtualization company but pivoted in 2022 when they recognized the transformative potential of large language models. They developed an AI-powered development environment that evolved from a VS Code extension to a full-fledged IDE, incorporating advanced code understanding and generation capabilities. The product now serves hundreds of thousands of daily active users, including major enterprises, and has achieved significant success in automating software development tasks while maintaining high precision through sophisticated evaluation systems.

Platform-Centric AI-Assisted Code Generation with Context-Aware Systems

Intuit

Intuit developed a platform-centric approach to AI-assisted code generation to improve developer productivity across its 8,000+ engineering organization serving 100M customers. While off-the-shelf IDE extensions initially showed promise, they lacked awareness of Intuit-specific APIs, architectural conventions, and compliance requirements, leading to declining usage. Intuit's solution involved creating "golden repositories" containing curated, high-quality code examples that embed organizational context into AI code generation systems through context-enriched query pipelines. This approach enabled vendor-agnostic AI integration while ensuring generated code aligns with Intuit's standards. Results included 58% of AI-generated tests used without modification, 56% faster PR merge times, 3× faster backend code generation, and over 10× improvement in frontend generation tasks.

Plus One: Internal LLM Platform for Cross-Company AI Adoption

Prosus

Prosus developed Plus One, an internal LLM platform accessible via Slack, to help companies across their group explore and implement AI capabilities. The platform serves thousands of users, handling over half a million queries across various use cases from software development to business tasks. Through careful monitoring and optimization, they reduced hallucination rates to below 2% and significantly lowered operational costs while enabling both technical and non-technical users to leverage AI capabilities effectively.

Policy Search and Response System Using LLMs in Higher Education

NDUS

The North Dakota University System (NDUS) implemented a generative AI solution to tackle the challenge of searching through thousands of policy documents, state laws, and regulations. Using Databricks' Data Intelligence Platform on Azure, they developed a "Policy Assistant" that leverages LLMs (specifically Llama 2) to provide instant, accurate policy search results with proper references. This transformation reduced their time-to-market from one year to six months and made policy searches 10-20x faster, while maintaining proper governance and security controls.

Post-Training and Production LLM Systems at Scale

OpenAI

This case study explores OpenAI's approach to post-training and deploying large language models in production environments, featuring insights from a post-training researcher working on reasoning models. The discussion covers the operational complexities of reinforcement learning from human feedback at scale, the evolution from non-thinking to thinking models, and production challenges including model routing, context window optimization, token efficiency improvements, and interruptability features. Key developments include the shopping model release, improvements from GPT-4.1 to GPT-5.1, and the operational realities of managing complex RL training runs with multiple grading setups and infrastructure components that require constant monitoring and debugging.

Practical Challenges in Building Production RAG Systems

Prolego

A detailed technical discussion between Prolego engineers about the practical challenges of implementing Retrieval Augmented Generation (RAG) systems in production. The conversation covers key challenges including document processing, chunking strategies, embedding techniques, and evaluation methods. The team shares real-world experiences about how RAG implementations differ from tutorial examples, particularly in handling complex document structures and different data formats.

Practical Implementation of LLMs for Automated Test Case Generation

Cesar

A case study exploring the application of LLMs (specifically GPT-3.5 Turbo) in automated test case generation for software applications. The research developed a semi-automated approach using prompt engineering and LangChain to generate test cases from software specifications. The study evaluated the quality of AI-generated test cases against manually written ones for the Da.tes platform, finding comparable quality metrics between AI and human-generated tests, with AI tests scoring slightly higher (4.31 vs 4.18) across correctness, consistency, and completeness factors.

Practical Lessons from Deploying LLMs in Production at Scale

Mercado Libre

Mercado Libre explored multiple production applications of Large Language Models across their e-commerce and technology platform, tackling challenges in knowledge retrieval, documentation generation, and natural language processing. The company implemented a RAG system for developer documentation using Llama Index, automated documentation generation for thousands of database tables, and built natural language input interpretation systems using function calling. Through iterative development, they learned critical lessons about the importance of underlying data quality, prompt engineering iteration, quality assurance for generated outputs, and the necessity of simplifying tasks for LLMs through proper data preprocessing and structured output formats.

Practical Lessons Learned from Building and Deploying GenAI Applications

Bolbeck

A comprehensive overview of lessons learned from building GenAI applications over 1.5 years, focusing on the complexities and challenges of deploying LLMs in production. The presentation covers key aspects of LLMOps including model selection, hosting options, ensuring response accuracy, cost considerations, and the importance of observability in AI applications. Special attention is given to the emerging role of AI agents and the critical balance between model capability and operational costs.

Practical LLM Deployment: From Evaluation to Fine-tuning

Parlance Labs

A comprehensive discussion of LLM deployment challenges and solutions across multiple industries, focusing on practical aspects like evaluation, fine-tuning, and production deployment. The case study covers experiences from GitHub's Copilot development, real estate CRM implementation, and consulting work at Parlance Labs, highlighting the importance of rigorous evaluation, data inspection, and iterative development in LLM deployments.

Pragmatic Product-Led Approach to LLM Integration and Prompt Engineering

LinkedIn

Pan Cha, Senior Product Manager at LinkedIn, shares insights on integrating LLMs into products effectively. He advocates for a pragmatic approach: starting with simple implementations using existing LLM APIs to validate use cases, then iteratively improving through robust prompt engineering and evaluation. The focus is on solving real user problems rather than adding AI for its own sake, with particular attention to managing user trust and implementing proper evaluation frameworks.

Privacy-Preserving LLM Usage Analysis System for Production AI Safety

Anthropic

Anthropic developed Clio, a privacy-preserving analysis system to understand how their Claude AI models are used in production while maintaining strict user privacy. The system performs automated clustering and analysis of conversations to identify usage patterns, detect potential misuse, and improve safety measures. Initial analysis of 1 million conversations revealed insights into usage patterns across different languages and domains, while helping identify both false positives and negatives in their safety systems.

Private Equity AI Transformation: Lessons from Portfolio Companies

PwC / Warburg Pincus / Abrigo

This panel discussion featuring executives from PwC, Warburg Pincus, Abrigo (a Carlyle portfolio company), and AWS explores the practical implementation of generative AI and LLMs in production across private equity portfolio companies. The conversation covers the journey from the ChatGPT launch in late 2022 through 2025, addressing real-world challenges including prioritization, talent gaps, data readiness, and organizational alignment. Key themes include starting with high-friction business problems rather than technology-first approaches, the importance of leadership alignment over technical infrastructure, rapid experimentation cycles, and the shift from viewing AI as optional to mandatory in investment diligence. The panelists emphasize practical successes such as credit memo generation, fraud alert summarization, loan workflow optimization, and e-commerce catalog enrichment, while cautioning against over-hyped transformation projects and highlighting the need for organizational cultural change alongside technical implementation.

Product Attribute Normalization and Sorting Using DSPy for Large-Scale E-commerce

Zoro UK

Zoro UK, an e-commerce subsidiary of Grainger with 3.5 million products from 300+ suppliers, faced challenges normalizing and sorting product attributes across 75,000 different attribute types. Using DSPy (a framework for optimizing LLM prompts programmatically), they built a production system that automatically determines whether attributes require alpha-numeric sorting or semantic sorting. The solution employs a two-tier architecture: Mistral 8B for initial classification and GPT-4 for complex semantic sorting tasks. The DSPy approach eliminated manual prompt engineering, provided LLM-agnostic compatibility, and enabled automated prompt optimization using genetic algorithm-like iterations, resulting in improved product discoverability and search experience for their 1 million monthly active users.

Production Agent Platform Architecture for Multi-Agent Systems

LinkedIn

LinkedIn faced the challenge of scaling agentic AI adoption across their organization while maintaining production reliability. They transitioned from Java to Python for generative AI applications, built a standardized framework using LangChain and LangGraph, and developed a comprehensive agent platform with messaging infrastructure, multi-layered memory systems, and a centralized skill registry. Their first production agent, LinkedIn Hiring Assistant, automates recruiter workflows using a supervisor multi-agent architecture, demonstrating the ambient agent pattern with asynchronous processing capabilities.

Production Agents: Real-world Implementations of LLM-powered Autonomous Systems

Various

A panel discussion featuring three practitioners implementing LLM-powered agents in production: Sam's personal assistant with real-time feedback and router agents, Div's browser automation system Melton with reliability and monitoring features, and Devin's GitHub repository assistant that helps with code understanding and feature requests. Each presenter shared their architecture choices, testing strategies, and approaches to handling challenges like latency, reliability, and model selection in production environments.

Production Agents: Routing, Testing and Browser Automation Case Studies

Various

Three practitioners share their experiences deploying LLM agents in production: Sam discusses building a personal assistant with real-time user feedback and router agents, Div presents a browser automation assistant called Milton that can control web applications, and Devin explores using LLMs to help engineers with non-coding tasks by navigating codebases. Each case study highlights different approaches to routing between agents, handling latency, testing strategies, and model selection for production deployment.

Production AI Agents for Accounting Automation: Engineering Process Daemons at Scale

Digits

Digits, an AI-native accounting platform, shares their experience running AI agents in production for over 2 years, addressing real-world challenges in deploying LLM-based systems. The team reframes "agents" as "process daemons" to set appropriate expectations and details their implementation across three use cases: vendor data enrichment, client onboarding, and complex query handling. Their solution emphasizes building lightweight custom infrastructure over dependency-heavy frameworks, reusing existing APIs as agent tools, implementing comprehensive observability with OpenTelemetry, and establishing robust guardrails. The approach has enabled reliable automation while maintaining transparency, security, and performance through careful engineering rather than relying on framework abstractions.

Production AI Agents for Insurance Policy Management with Amazon Bedrock

CDL

CDL, a UK-based insurtech company, has developed a comprehensive AI agent system using Amazon Bedrock to handle insurance policy management tasks in production. The solution includes a supervisor agent architecture that routes customer intents to specialized domain agents, enabling customers to manage their insurance policies through conversational AI interfaces available 24/7. The implementation addresses critical production concerns through rigorous model evaluation processes, guardrails for safety, and comprehensive monitoring, while preparing their APIs to be AI-ready for future digital assistant integrations.

Production AI Agents with Dynamic Planning and Reactive Evaluation

Hex

Hex successfully implemented AI agents in production for data science notebooks by developing a unique approach to agent orchestration. They solved key challenges around planning, tool usage, and latency by constraining agent capabilities, building a reactive DAG structure, and optimizing context windows. Their success came from iteratively developing individual capabilities before combining them into agents, keeping humans in the loop, and maintaining tight feedback cycles with users.

Production AI Deployment: Lessons from Real-World Agentic AI Systems

Databricks / Various

This case study presents lessons learned from deploying generative AI applications in production, with a specific focus on Flo Health's implementation of a women's health chatbot on the Databricks platform. The presentation addresses common failure points in GenAI projects including poor constraint definition, over-reliance on LLM autonomy, and insufficient engineering discipline. The solution emphasizes deterministic system architecture over autonomous agents, comprehensive observability and tracing, rigorous evaluation frameworks using LLM judges, and proper DevOps practices. Results demonstrate that successful production deployments require treating agentic AI as modular system architectures following established software engineering principles rather than monolithic applications, with particular emphasis on cost tracking, quality monitoring, and end-to-end deployment pipelines.

Production AI Systems for News Personalization and Journalistic Workflows

Bonnier News

Bonnier News, a major Swedish media publisher with over 200 brands including Expressen and local newspapers, has deployed AI and machine learning systems in production to solve content personalization and newsroom automation challenges. The company's data science team, led by product manager Hans Yell (PhD in computational linguistics) and head of architecture Magnus Engster, has built white-label personalization engines using embedding-based recommendation systems that outperform manual content curation while scaling across multiple brands. They leverage vector similarity and user reading patterns rather than traditional metadata, achieving significant engagement lifts. Additionally, they're developing LLM-powered tools for journalists including headline generation, news aggregation summaries, and trigger questions for articles. Through a WASP-funded PhD collaboration, they're working on domain-adapted Swedish language models via continued pre-training of Llama models with Bonnier's extensive text corpus, focusing on capturing brand tone and improving journalistic workflows while maintaining data sovereignty.

Production Deployment Challenges and Infrastructure Gaps for Multi-Agent AI Systems

GetOnStack

GetOnStack's team deployed a multi-agent LLM system for market data research that initially cost $127 weekly but escalated to $47,000 over four weeks due to an infinite conversation loop between agents running undetected for 11 days. This experience exposed critical gaps in production infrastructure for multi-agent systems using Agent-to-Agent (A2A) communication and Anthropic's Model Context Protocol (MCP). In response, the company spent six weeks building comprehensive production infrastructure including message queues, monitoring, cost controls, and safeguards. GetOnStack is now developing a platform to provide one-command deployment and production-ready infrastructure specifically designed for multi-agent systems, aiming to help other teams avoid similar costly production failures.

Production Deployment of Toqan Data Analyst Agent: From Prototype to Production Scale

Toqan

Toqan developed and deployed a data analyst agent that allows users to ask questions in natural language and receive SQL-generated answers with visualizations. The team faced significant challenges transitioning from a working prototype to a production system serving hundreds of users, including behavioral inconsistencies, infinite loops, and unreliable outputs. They solved these issues through four key approaches: implementing deterministic workflows for predictable behaviors, leveraging domain experts for setup and monitoring, building resilient systems to handle edge cases and abuse, and optimizing agent tools to reduce complexity. The result was a stable production system that successfully scaled to serve hundreds of users with improved reliability and user experience.

Production Evolution of an AI-Powered Medical Consultation Assistant

Doctolib

Doctolib developed and deployed an AI-powered consultation assistant for healthcare professionals that combines speech recognition, summarization, and medical content codification. Through a comprehensive approach involving simulated consultations, extensive testing, and careful metrics tracking, they evolved from MVP to production while maintaining high quality standards. The system achieved widespread adoption and positive feedback through iterative improvements based on both explicit and implicit user feedback, combining short-term prompt engineering optimizations with longer-term model and data improvements.

Production GenAI for User Safety and Enhanced Matching Experience

Tinder

Tinder implemented two production GenAI applications to enhance user safety and experience: a username detection system using fine-tuned Mistral 7B to identify social media handles in user bios with near-perfect recall, and a personalized match explanation feature using fine-tuned Llama 3.1 8B to help users understand why recommended profiles are relevant. Both systems required sophisticated LLMOps infrastructure including multi-model serving with LoRA adapters, GPU optimization, extensive monitoring, and iterative fine-tuning processes to achieve production-ready performance at scale.

Production Intent Recognition System for Enterprise Chatbots

FeedYou

FeedYou developed a sophisticated intent recognition system for their enterprise chatbot platform, addressing challenges in handling complex conversational flows and out-of-domain queries. They experimented with different NLP approaches before settling on a modular architecture using NLP.js, implementing hierarchical intent recognition with local and global intents, and integrating generative models for handling edge cases. The system achieved a 72% success rate for local intent matching and effectively handled complex conversational scenarios across multiple customer deployments.

Production Lessons from Building and Deploying AI Agents

Rasgo

Rasgo's journey in building and deploying AI agents for data analysis reveals key insights about production LLM systems. The company developed a platform enabling customers to use standard data analysis agents and build custom agents for specific tasks, with focus on database connectivity and security. Their experience highlights the importance of agent-computer interface design, the critical role of underlying model selection, and the significance of production-ready infrastructure over raw agent capabilities.

Production LLM Systems at Scale - Lessons from Financial Services, Legal Tech, and ML Infrastructure

Nubank, Harvey AI, Galileo and Convirza

A panel discussion featuring leaders from Nubank, Harvey AI, Galileo, and Convirza discussing their experiences implementing LLMs in production. The discussion covered key challenges and solutions around model evaluation, cost optimization, latency requirements, and the transition from large proprietary models to smaller fine-tuned models. Participants shared insights on modularizing LLM applications, implementing human feedback loops, and balancing the tradeoffs between model size, cost, and performance in production environments.

Production LLM Systems: Document Processing and Real Estate Agent Co-pilot Case Studies

Various

A comprehensive webinar featuring two case studies of LLM systems in production. First, Docugami shared their experience building a document processing pipeline that leverages hierarchical chunking and semantic understanding, using custom LLMs and extensive testing infrastructure. Second, Reet presented their development of Lucy, a real estate agent co-pilot, highlighting their journey with OpenAI function calling, testing frameworks, and preparing for fine-tuning while maintaining production quality.

Production Monitoring and Issue Discovery for AI Agents

Raindrop

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

Production RAG Best Practices: Implementation Lessons at Scale

Kapa.ai

Based on experience with over 100 technical teams including Docker, CircleCI, and Reddit, this case study examines key challenges and solutions in implementing production-grade RAG systems. The analysis covers critical aspects from data curation and refresh pipelines to evaluation frameworks and security practices, highlighting how most RAG implementations fail at the POC stage while providing concrete guidance for successful production deployments.

Production RAG Stack Development Through 37 Iterations for Financial Services

jonfernandes

Independent AI engineer Jonathan Fernandez shares his experience developing a production-ready RAG (Retrieval Augmented Generation) stack through 37 failed iterations, focusing on building solutions for financial institutions. The case study demonstrates the evolution from a naive RAG implementation to a sophisticated system incorporating query processing, reranking, and monitoring components. The final architecture uses LlamaIndex for orchestration, Qdrant for vector storage, open-source embedding models, and Docker containerization for on-premises deployment, achieving significantly improved response quality for document-based question answering.

Production Vector Search and Retrieval System Optimization at Scale

Superlinked

SuperLinked, a company focused on vector search infrastructure, shares production insights from deploying information retrieval systems for e-commerce and enterprise knowledge management with indexes up to 2 terabytes. The presentation addresses challenges in relevance, latency, and cost optimization when deploying vector search systems at scale. Key solutions include avoiding vector pooling/averaging, implementing late interaction models, fine-tuning embeddings for domain-specific needs, combining sparse and dense representations, leveraging graph embeddings, and using template-based query generation instead of unconstrained text-to-SQL. Results demonstrate 5%+ precision improvements through targeted fine-tuning, significant latency reductions through proper database selection and query optimization, and improved relevance through multi-encoder architectures that combine text, graph, and metadata signals.

Production-Ready Agent Behavior: Identity, Intent, and Governance

Oso

Oso, a SaaS company that governs actions in B2B applications, presents a comprehensive framework for productionizing AI agents through three critical stages: prototype to QA, QA to production, and running in production. The company addresses fundamental challenges including agent identity (requiring user, agent, and session context), intent-based tool filtering to prevent unwanted behaviors like prompt injection attacks, and real-time governance mechanisms for monitoring and quarantining misbehaving agents. Using LangChain 1.0 middleware capabilities, Oso demonstrates how to implement deterministic guardrails that wrap both tool calls and model calls, preventing data exfiltration scenarios and ensuring agents only execute actions aligned with user intent. The solution enables security teams and product managers to dynamically control agent behavior in production without code changes, limiting blast radius when agents misbehave.

Production-Ready LLM Integration Using Retrieval-Augmented Generation and Custom ReAct Implementation

Buzzfeed

BuzzFeed Tech tackled the challenges of integrating LLMs into production by addressing dataset recency limitations and context window constraints. They evolved from using vanilla ChatGPT with crafted prompts to implementing a sophisticated retrieval-augmented generation system. After exploring self-hosted models and LangChain, they developed a custom "native ReAct" implementation combined with an enhanced Nearest Neighbor Search Architecture using Pinecone, resulting in a more controlled, cost-efficient, and production-ready LLM system.

Production-Scale Document Parsing with Vision-Language Models and Specialized OCR

Reducto

Reducto has built a production document parsing system that processes over 1 billion documents by combining specialized vision-language models, traditional OCR, and layout detection models in a hybrid pipeline. The system addresses critical challenges in document parsing including hallucinations from frontier models, dense tables, handwritten forms, and complex charts. Their approach uses a divide-and-conquer strategy where different models are routed to different document regions based on complexity, achieving higher accuracy than AWS Textract, Microsoft Azure Document Intelligence, and Google Cloud OCR on their internal benchmarks. The company has expanded beyond parsing to offer extraction with pixel-level citations and an edit endpoint for automated form filling.

Productionizing Generative AI Applications: From Exploration to Scale

LinkedIn

A LinkedIn product manager shares insights on bringing LLMs to production, focusing on their implementation of various generative AI features across the platform. The case study covers the complete lifecycle from idea exploration to production deployment, highlighting key considerations in prompt engineering, GPU resource management, and evaluation frameworks. The presentation emphasizes practical approaches to building trust-worthy AI products while maintaining scalability and user focus.

Productionizing LLM-Powered Data Governance with LangChain and LangSmith

Grab

Grab enhanced their LLM-powered data governance system (Metasense V2) by improving model performance and operational efficiency. The team tackled challenges in data classification by splitting complex tasks, optimizing prompts, and implementing LangChain and LangSmith frameworks. These improvements led to reduced misclassification rates, better collaboration between teams, and streamlined prompt experimentation and deployment processes while maintaining robust monitoring and safety measures.

Quantitative Framework for Production LLM Evaluation in Security Applications

Elastic

Elastic developed a comprehensive framework for evaluating and improving GenAI features in their security products, including an AI Assistant and Attack Discovery tool. The framework incorporates test scenarios, curated datasets, tracing capabilities using LangGraph and LangSmith, evaluation rubrics, and a scoring mechanism to ensure quantitative measurement of improvements. This systematic approach enabled them to move from manual to automated evaluations while maintaining high quality standards for their production LLM applications.

RAG System for Investment Policy Search and Advisory at RBC

Arcane

RBC developed an internal RAG (Retrieval Augmented Generation) system called Arcane to help financial advisors quickly access and interpret complex investment policies and procedures. The system addresses the challenge of finding relevant information across semi-structured documents, reducing the time specialists spend searching through documentation. The solution combines advanced parsing techniques, vector databases, and LLM-powered generation with a chat interface, while implementing robust evaluation methods to ensure accuracy and prevent hallucinations.

RAG-based Chatbot for Utility Operations and Customer Service

Xcel Energy

Xcel Energy implemented a RAG-based chatbot system to streamline operations including rate case reviews, legal contract analysis, and earnings call report processing. Using Databricks' Data Intelligence Platform, they developed a production-grade GenAI system incorporating Vector Search, MLflow, and Foundation Model APIs. The solution reduced rate case review times from 6 months to 2 weeks while maintaining strict security and governance requirements for sensitive utility data.

RAG-Based Dasher Support Automation with LLM Guardrails and Quality Monitoring

Doordash

DoorDash developed an LLM-based chatbot system to automate support for Dashers (delivery contractors) who encounter issues during deliveries. The existing flow-based automated support system could only handle a limited subset of issues, and while a knowledge base existed, it was difficult to navigate, time-consuming to parse, and only available in English. The solution involved implementing a RAG (Retrieval Augmented Generation) system that retrieves relevant information from knowledge base articles and generates contextually appropriate responses. To address LLM challenges including hallucinations, context summarization accuracy, language consistency, and latency, DoorDash built three key systems: an LLM Guardrail for real-time response validation, an LLM Judge for quality monitoring and evaluation, and a quality improvement pipeline. The system now autonomously assists thousands of Dashers daily, reducing hallucinations by 90% and compliance issues by 99%, while allowing human agents to focus on more complex support scenarios.

RAG-Based Industry Classification System for Customer Segmentation

Ramp

Ramp faced challenges with inconsistent industry classification across teams using homegrown taxonomies that were inaccurate, too generic, and not auditable. They solved this by building an in-house RAG (Retrieval-Augmented Generation) system that migrated all industry classification to standardized NAICS codes, featuring a two-stage process with embedding-based retrieval and LLM-based selection. The system improved data quality, enabled consistent cross-team communication, and provided interpretable results with full control over the classification process.

RAG-Based Industry Classification System for Financial Services

Ramp

Ramp, a financial services company, replaced their fragmented homegrown industry classification system with a standardized NAICS-based taxonomy powered by an in-house RAG model. The old system relied on stitched-together third-party data and multiple non-auditable sources of truth, leading to inconsistent, overly broad, and sometimes incorrect business categorizations. By building a custom RAG system that combines embeddings-based retrieval with LLM-based re-ranking, Ramp achieved significant improvements in classification accuracy (up to 60% in retrieval metrics and 5-15% in final prediction accuracy), gained full control over the model's behavior and costs, and enabled consistent cross-team usage of industry data for compliance, risk assessment, sales targeting, and product analytics.

RAG-Based System for Climate Finance Document Analysis

ClimateAligned

ClimateAligned, an early-stage startup, developed a RAG-based system to analyze climate-related financial documents and assess their "greenness." Starting with a small team of 2-3 engineers, they built a solution that combines LLMs, hybrid search, and human-in-the-loop processes to achieve 99% accuracy in document analysis. The system reduced analysis time from 2 hours to 20 minutes per company, even with human verification, and successfully evolved from a proof-of-concept to serving their first users while maintaining high accuracy standards.

RAG-Enhanced Code Review Bot Using Historical Incident Data

PayPay

PayPay, a rapidly growing fintech company, developed GBB RiskBot to address the challenge of scaling code review processes across an expanding engineering organization. The system leverages historical postmortem and incident data combined with RAG (Retrieval-Augmented Generation) to automatically analyze pull requests and identify potential risks based on past incidents. When developers open pull requests, the bot uses OpenAI embeddings and ChromaDB to perform semantic similarity searches against a vector database of historical incidents, then employs GPT-4o-mini to generate contextual comments highlighting relevant risks. The system operates at remarkably low cost (approximately $0.59 USD monthly for 380+ analyses across 12 repositories) while addressing critical challenges including knowledge silos, manual knowledge sharing inefficiencies, and inconsistent risk assessment across teams.

RAG-Powered Agent Assist Tool for Insurance Contact Centers

Allianz Direct

Allianz Direct implemented a GenAI-powered agent assist tool using RAG to help contact center agents quickly and accurately answer customer questions about insurance policies. Built on the Databricks Data Intelligence Platform using Mosaic AI tools, the solution improved answer accuracy by 10-15% compared to their previous system, while allowing agents to focus more on customer relationships rather than searching through documentation.

RAG-Powered Chatbot for Sports Team Roster Management

Philadelphia Union

Philadelphia Union implemented a GenAI chatbot using Databricks Data Intelligence Platform to simplify complex MLS roster management. The solution uses RAG architecture with Databricks Vector Search and DBRX Instruct model to provide instant interpretations of roster regulations. The chatbot, deployed through Databricks Apps, enables quick decision-making and helps the front office maintain compliance with MLS guidelines while focusing on strategic tasks.

RAG-Powered Customer Support Enhancement Using GPT-4

Thomson Reuters

Thomson Reuters implemented a Retrieval-Augmented Generation (RAG) system to enhance customer support for their legal and tax domain products. The challenge involved customer support agents experiencing cognitive overload while navigating hundreds of thousands of knowledge base articles across complex product lines like Westlaw, Practical Law, and Checkpoint. By building a RAG architecture combining dense retrieval systems (using Milvus vector database and sentence transformers) with GPT-4, Thomson Reuters created a conversational interface that provides agents with relevant, accurate solutions from their curated knowledge base. The solution reduced resolution times and improved the accuracy of support responses by grounding GPT-4's outputs in company-specific documentation, avoiding hallucinations common in standalone LLM deployments.

RAG-powered Decision Intelligence Platform for Manufacturing Knowledge Management

Circuitry.ai

Circuitry.ai addressed the challenge of managing complex product information for manufacturers by developing an AI-powered decision intelligence platform. Using Databricks' infrastructure, they implemented RAG chatbots to process and serve proprietary customer data, resulting in a 60-70% reduction in information search time. The solution integrated Delta Lake for data management, Unity Catalog for governance, and custom knowledge bases with Llama and DBRX models for accurate response generation.

RAG-Powered LLM System for Automated Analytics and Fraud Investigation

Grab

Grab's Integrity Analytics team developed a comprehensive LLM-based solution to automate routine analytical tasks and fraud investigations. The system combines an internal LLM tool (Spellvault) with a custom data middleware (Data-Arks) to enable automated report generation and fraud investigation assistance. By implementing RAG instead of fine-tuning, they created a scalable, cost-effective solution that reduced report generation time by 3-4 hours per report and streamlined fraud investigations to minutes.

RAG-Powered Terraform Support Slackbot

Benchling

Benchling developed a Slackbot to help engineers navigate their complex Terraform Cloud infrastructure by implementing a RAG-based system using Amazon Bedrock. The solution combines documentation from Confluence, public Terraform docs, and past Slack conversations to provide instant, relevant answers to infrastructure questions, eliminating the need to search through lengthy FAQs or old Slack threads. The system successfully demonstrates a practical application of LLMs in production for internal developer support.

RAG-Powered Virtual Assistant for Retail Store Operations

Co-op

Co-op, a major UK retailer, developed a GenAI-powered virtual assistant to help store employees quickly access essential operational information from over 1,000 policy and procedure documents. Using RAG and the Databricks Data Intelligence Platform, the solution aims to handle 50,000-60,000 weekly queries more efficiently than their previous keyword-based search system. The project, currently in proof-of-concept stage, demonstrates promising results in improving information retrieval speed and reducing support center workload.

Rapid Development and Deployment of Enterprise LLM Features Through Centralized LLM Service Architecture

PagerDuty

PagerDuty successfully developed and deployed multiple GenAI features in just two months by implementing a centralized LLM API service architecture. They created AI-powered features including runbook generation, status updates, postmortem reports, and an AI assistant, while addressing challenges of rapid development with new technology. Their solution included establishing clear processes, role definitions, and a centralized LLM service with robust security, monitoring, and evaluation frameworks.

Rapid Development of AI-Powered Video Interview Analysis System

Vericant

Vericant, an educational testing company, developed and deployed an AI-powered video interview analysis system in just 30 days. The solution automatically processes 15-minute admission interview videos to generate summaries, key points, and topic analyses, enabling admissions teams to review interviews in 20-30 seconds instead of watching full recordings. The implementation was achieved through iterative prompt engineering and a systematic evaluation framework, without requiring significant engineering resources or programming expertise.

Rapid Integration of Advanced AI Models through Modular Architecture and Workflow Orchestration

Harvey

Harvey, a legal AI platform, demonstrated their ability to rapidly integrate new AI capabilities by incorporating OpenAI's Deep Research feature into their production system within 12 hours of its API release. This achievement was enabled by their AI-native architecture featuring a modular Workflow Engine, composable AI building blocks, transparent "thinking states" for user visibility, and a culture of rapid prototyping using AI-assisted development tools. The case study showcases how purpose-built infrastructure and engineering practices can accelerate the deployment of complex AI features while maintaining enterprise-grade reliability and user transparency in legal workflows.

Rapid Prototyping and Scaling AI Applications Using Open Source Models

Hassan El Mghari

Hassan El Mghari, a developer relations leader at Together AI, demonstrates how to build and scale AI applications to millions of users using open source models and a simplified architecture. Through building approximately 40 AI apps over four years (averaging one per month), he developed a streamlined approach that emphasizes simplicity, rapid iteration, and leveraging the latest open source models. His applications, including commit message generators, text-to-app builders, and real-time image generators, have collectively served millions of users and generated tens of millions of outputs, proving that simple architectures with single API calls can achieve significant scale when combined with good UI design and viral sharing mechanics.

Real-time AI Agent Assistance in Contact Center Operations

US Bank

US Bank implemented a generative AI solution to enhance their contact center operations by providing real-time assistance to agents handling customer calls. The system uses Amazon Q in Connect and Amazon Bedrock with Anthropic's Claude model to automatically transcribe conversations, identify customer intents, and provide relevant knowledge base recommendations to agents in real-time. While still in production pilot phase with limited scope, the solution addresses key challenges including reducing manual knowledge base searches, improving call handling times, decreasing call transfers, and automating post-call documentation through conversation summarization.

Real-Time AI Chief of Staff for Product Teams

Earmark

Earmark built a productivity suite for product teams that transforms meeting conversations into finished work in real-time, addressing the problem of endless context-switching and manual follow-up work that plagues modern product development. Founded by Mark Barb and Sandon, who both came from the product management SaaS space, Earmark uses live transcription and multiple parallel AI agents to generate product specs, tickets, summaries, and other artifacts during meetings rather than after them. The company pivoted from an Apple Vision Pro communication training tool to a web-based real-time meeting assistant after discovering through 60 customer interviews that few people actually prepare for presentations. With 78% of survey respondents saying they'd be "super bummed" if the product disappeared, Earmark has achieved strong product-market fit by focusing specifically on product managers, engineering leaders, and adjacent roles who spend most of their time in back-to-back meetings with different audiences and deliverables.

Real-Time Generative AI for Immersive Theater Performance

University of California Los Angeles

The University of California Los Angeles (UCLA) Office of Advanced Research Computing (OARC) partnered with UCLA's Center for Research and Engineering in Media and Performance (REMAP) to build an AI-powered system for an immersive production of the musical "Xanadu." The system enabled up to 80 concurrent audience members and performers to create sketches on mobile phones, which were processed in near real-time (under 2 minutes) through AWS generative AI services to produce 2D images and 3D meshes displayed on large LED screens during live performances. Using a serverless-first architecture with Amazon SageMaker AI endpoints, Amazon Bedrock foundation models, and AWS Lambda orchestration, the system successfully supported 7 performances in May 2025 with approximately 500 total audience members, demonstrating that cloud-based generative AI can reliably power interactive live entertainment experiences.

Real-time Question-Answering System with Two-Stage LLM Architecture for Sales Content Recommendations

Microsoft

Microsoft developed a real-time question-answering system for their MSX Sales Copilot to help sellers quickly find and share relevant sales content from their Seismic repository. The solution uses a two-stage architecture combining bi-encoder retrieval with cross-encoder re-ranking, operating on document metadata since direct content access wasn't available. The system was successfully deployed in production with strict latency requirements (few seconds response time) and received positive feedback from sellers with relevancy ratings of 3.7/5.

Real-World LLM Implementation: RAG, Documentation Generation, and Natural Language Processing at Scale

Mercado Libre

Mercado Libre implemented three major LLM use cases: a RAG-based documentation search system using Llama Index, an automated documentation generation system for thousands of database tables, and a natural language processing system for product information extraction and service booking. The project revealed key insights about LLM limitations, the importance of quality documentation, prompt engineering, and the effective use of function calling for structured outputs.

Rebuilding a Production Chatbot with Direct API Access and Multi-Agent Architecture

Langchain

LangChain rebuilt their public documentation chatbot after discovering their support engineers preferred using their own internal workflow over the existing tool. The original chatbot used traditional vector embedding retrieval, which suffered from fragmented context, constant reindexing, and vague citations. The solution involved building two distinct architectures: a fast CreateAgent for simple documentation queries delivering sub-15-second responses, and a Deep Agent with specialized subgraphs for complex queries requiring codebase analysis. The new approach replaced vector embeddings with direct API access to structured content (Mintlify for docs, Pylon for knowledge base, and ripgrep for codebase search), enabling the agent to search iteratively like a human. Results included dramatically faster response times, precise citations with line numbers, elimination of reindexing overhead, and internal adoption by support engineers for complex troubleshooting.

Rebuilding an AI SDR Agent with Multi-Agent Architecture for Enterprise Sales Automation

11x

11x rebuilt their AI Sales Development Representative (SDR) product Alice from scratch in just 3 months, transitioning from a basic campaign creation tool to a sophisticated multi-agent system capable of autonomous lead sourcing, research, and email personalization. The team experimented with three different agent architectures - React, workflow-based, and multi-agent systems - ultimately settling on a hierarchical multi-agent approach with specialized sub-agents for different tasks. The rebuilt system now processes millions of leads and messages with a 2% reply rate comparable to human SDRs, demonstrating the evolution from simple AI tools to true digital workers in production sales environments.

Red Teaming AI Agents: Uncovering Security Vulnerabilities in Production Systems

Casco

Casco, a Y Combinator company specializing in red teaming AI agents and applications, conducted a security assessment of 16 live production AI agents, successfully compromising 7 of them within 30 minutes each. The research identified three critical security vulnerabilities common across production AI agents: cross-user data access through insecure direct object references (IDOR), arbitrary code execution through improperly secured code sandboxes leading to lateral movement across infrastructure, and server-side request forgery (SSRF) enabling credential theft from private repositories. The findings demonstrate that agent security extends far beyond LLM-specific concerns like prompt injection, requiring developers to apply traditional web application security principles including proper authentication and authorization, input/output sanitization, and use of enterprise-grade code sandboxes rather than custom implementations.

Reducing False Positives in AI Code Review Agents Through Architecture Refinement

cubic

cubic, an AI-native GitHub platform, developed an AI code review agent that initially suffered from excessive false positives and low-value comments, causing developers to lose trust in the system. Through three major architecture revisions and extensive offline testing, the team implemented explicit reasoning logs, streamlined tooling, and specialized micro-agents instead of a single monolithic agent. These changes resulted in a 51% reduction in false positives without sacrificing recall, significantly improving the agent's precision and usefulness in production code reviews.

Refining Input Guardrails for Safer LLM Applications Through Chain-of-Thought Fine-Tuning

Capital One

Capital One developed enhanced input guardrails to protect LLM-powered conversational assistants from adversarial attacks and malicious inputs. The company used chain-of-thought prompting combined with supervised fine-tuning (SFT) and alignment techniques like Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) to improve the accuracy of LLM-as-a-Judge moderation systems. Testing on four open-source models (Mistral 7B, Mixtral 8x7B, Llama2 13B, and Llama3 8B) showed significant improvements in F1 scores and attack detection rates of over 50%, while maintaining low false positive rates, demonstrating that effective guardrails can be achieved with small training datasets and minimal computational resources.

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

Responsible AI Implementation for Healthcare Form Automation

WellSky

WellSky, serving over 2,000 hospitals and handling 100 million forms annually, partnered with Google Cloud to address clinical documentation burden and clinician burnout. They developed an AI-powered solution focusing on form automation, implementing a comprehensive responsible AI framework with emphasis on evidence citation, governance, and technical foundations. The project aimed to reduce "pajama time" - where 75% of nurses complete documentation after hours - while ensuring patient safety through careful AI deployment.

Retrieval Augmented LLMs for Real-time CRM Account Linking

Schneider Electric

Schneider Electric partnered with AWS Machine Learning Solutions Lab to automate their CRM account linking process using Retrieval Augmented Generation (RAG) with Flan-T5-XXL model. The solution combines LangChain, Google Search API, and SEC-10K data to identify and maintain up-to-date parent-subsidiary relationships between customer accounts, improving accuracy from 55% to 71% through domain-specific prompt engineering.

Revamping Query Understanding with LLMs in E-commerce Search

Instacart

Instacart transformed their query understanding (QU) system from multiple independent traditional ML models to a unified LLM-based approach to better handle long-tail, specific, and creatively-phrased search queries. The solution employed a layered strategy combining retrieval-augmented generation (RAG) for context engineering, post-processing guardrails, and fine-tuning of smaller models (Llama-3-8B) on proprietary data. The production system achieved significant improvements including 95%+ query rewrite coverage with 90%+ precision, 6% reduction in scroll depth for tail queries, 50% reduction in complaints for poor tail query results, and sub-300ms latency through optimizations like adapter merging, H100 GPU upgrades, and autoscaling.

Revenue Intelligence Platform with Ambient AI Agents

Tabs

Tabs, a vertical AI company in the finance space, has built a revenue intelligence platform for B2B companies that uses ambient AI agents to automate financial workflows. The company extracts information from sales contracts to create a "commercial graph" and deploys AI agents that work autonomously in the background to handle billing, collections, and reporting tasks. Their approach moves beyond traditional guided AI experiences toward fully ambient agents that monitor communications and trigger actions automatically, with the goal of creating "beautiful operational software that no one ever has to go into."

RoBERTa for Large-Scale Merchant Classification

Square

Square developed and deployed a RoBERTa-based merchant classification system to accurately categorize millions of merchants across their platform. The system replaced unreliable self-selection methods with an ML approach that combines business names, self-selected information, and transaction data to achieve a 30% improvement in accuracy. The solution runs daily predictions at scale using distributed GPU infrastructure and has become central to Square's business metrics and strategic decision-making.

Running LLM Agents in Production for Accounting Automation

Digits

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

Scalable Intelligent Document Processing with Multi-Tenant Serverless Architecture

Ricoh

Ricoh USA faced significant scalability challenges in their healthcare document processing operations, where each new customer implementation required 40-60 hours of custom engineering work involving unique prompt engineering, model fine-tuning, and integration testing. To address anticipated sevenfold growth in document volume (from 10,000 to 70,000 documents monthly), Ricoh partnered with AWS to implement the GenAI IDP Accelerator using a serverless architecture combining Amazon Textract for OCR and Amazon Bedrock foundation models for intelligent classification and extraction. The solution reduced customer onboarding time from 4-6 weeks to 2-3 days, decreased engineering hours per deployment by over 90% (from ~80 hours to <5 hours), and created a reusable, multi-tenant framework that maintains strict healthcare compliance standards (HITRUST, HIPAA, SOC 2) while enabling effective human-in-the-loop workflows through confidence scoring mechanisms.

Scaling a High-Traffic LLM Chat Application to 30,000 Messages Per Second

Character.ai

Character.ai scaled their open-domain conversational AI platform from 300 to over 30,000 generations per second within 18 months, becoming the third most-used generative AI application globally. They tackled unique engineering challenges around data volume, cost optimization, and connection management while maintaining performance. Their solution involved custom model architectures, efficient GPU caching strategies, and innovative prompt management tools, all while balancing performance, latency, and cost considerations at scale.

Scaling Agent-Based Architecture for Legal AI Assistant

Harvey

Harvey, a legal AI platform provider, transitioned their Assistant product from bespoke orchestration to a fully agentic framework to enable multiple engineering teams to scale feature development collaboratively. The company faced challenges with feature discoverability, complex retrieval integrations, and limited pathways for new capabilities, leading them to adopt an agent architecture in mid-2025. By implementing three core principles—eliminating custom orchestration through the OpenAI Agent SDK, creating Tool Bundles for modular capabilities with partial system prompt control, and establishing eval gates with leave-one-out validation—Harvey successfully scaled in-thread feature development from one to four teams while maintaining quality and enabling emergent feature combinations across retrieval, drafting, review, and third-party integrations.

Scaling Agentic AI for Digital Accessibility and Content Intelligence

Siteimprove

Siteimprove, a SaaS platform provider for digital accessibility, analytics, SEO, and content strategy, embarked on a journey from generative AI to production-scale agentic AI systems. The company faced the challenge of processing up to 100 million pages per month for accessibility compliance while maintaining trust, speed, and adoption. By leveraging AWS Bedrock, Amazon Nova models, and developing a custom AI accelerator architecture, Siteimprove built a multi-agent system supporting batch processing, conversational remediation, and contextual image analysis. The solution achieved 75% cost reduction on certain workloads, enabled autonomous multi-agent orchestration across accessibility, analytics, SEO, and content domains, and was recognized as a leader in Forrester's digital accessibility platforms assessment. The implementation demonstrated how systematic progression through human-in-the-loop, human-on-the-loop, and autonomous stages can bridge the prototype-to-production chasm while delivering measurable business value.

Scaling Agentic AI Systems for Real Estate Due Diligence: Managing Prompt Tax at Production Scale

Orbital

Orbital, a real estate technology company, developed an agentic AI system called Orbital Co-pilot to automate legal due diligence for property transactions. The system processes hundreds of pages of legal documents to extract key information traditionally done manually by lawyers. Over 18 months, they scaled from zero to processing 20 billion tokens monthly and achieved multiple seven figures in annual recurring revenue. The presentation focuses on their concept of "prompt tax" - the hidden costs and complexities of continuously upgrading AI models in production, including prompt migration, regression risks, and the operational challenges of shipping at the AI frontier.

Scaling AI Agent Deployment Across a Global E-commerce Organization

Prosus

Prosus, a global e-commerce and technology company operating in 100 countries, deployed approximately 30,000 AI agents across their organization to transform both customer-facing experiences and internal operations. The company developed an internal tool called Toqan to enable employees across all departments—from sales and marketing to HR and logistics—to create their own AI agents without requiring engineering expertise. The solution addressed the challenge of moving from occasional AI assistants to trusted, domain-specific agents that could execute end-to-end tasks. Results include significant productivity gains (such as one agent doing the work of 30 full-time employees), improved quality of service, increased independence for employees, and greater agility across the organization. The deployment scaled rapidly through organizational change management, including competitions, upskilling programs, and democratization of agent creation.

Scaling AI Agents Across Enterprise Sales and Customer Service Operations

Salesforce

Salesforce deployed its Agentforce platform across the entire organization as "Customer Zero," learning critical lessons about agent deployment, testing, data quality, and human-AI collaboration over the course of one year. The company scaled AI agents across sales and customer service operations, with their service agent handling over 1.5 million support requests, the SDR agent generating $1.7 million in new pipeline from dormant leads after working on 43,000+ leads, and agents in Slack saving employees 500,000 hours annually. Early challenges included high "I don't know" response rates (30%), overly restrictive guardrails that prevented legitimate customer interactions, and data inconsistency issues across 650+ data streams, which were addressed through iterative refinement, data governance improvements using Salesforce Data Cloud, and a shift from prescriptive instructions to goal-oriented agent design.

Scaling AI Agents to Production: A Blueprint for Autonomous Customer Service

Cox Automotive

Cox Automotive, a dominant player in the automotive software industry with visibility into 5.1 trillion vehicle insights, faced the challenge of moving AI agents from prototype to production at scale. In response to an aggressive 5-week deadline set in summer 2024, the company launched five agentic AI products using Amazon Bedrock Agent Core and the Strands framework. The flagship product was a fully automated virtual assistant for dealership customer conversations that operates autonomously after hours without human oversight. By establishing foundational infrastructure with Agent Core, implementing comprehensive red teaming practices, designing both hard and soft guardrails, automating evaluation with LLM-as-judge techniques, and setting circuit breakers for cost and conversation limits, Cox Automotive successfully deployed three products to production beta, with dealers reporting that customers receive timely responses both during business hours and after hours.

Scaling AI Applications with LLMs: Dynamic Context Injection and Few-Shot Learning for Order Processing

Choco

Choco built a comprehensive AI system to automate food supply chain order processing, addressing challenges with diverse order formats across text messages, PDFs, and voicemails. The company developed a production LLM system using few-shot learning with dynamically retrieved examples, semantic embedding-based retrieval, and context injection techniques to improve information extraction accuracy. Their approach prioritized prompt-based improvements over fine-tuning, enabling faster iteration and model flexibility while building towards more autonomous AI systems through continuous learning from human annotations.

Scaling AI Assistants Across Swedish Government Offices Through Rapid Experimentation and Business-Led Innovation

Government of Sweden

The Government of Sweden's offices embarked on an ambitious AI transformation initiative starting in early 2023, deploying over 30 AI assistants across various departments to cognitively enhance civil servants rather than replace them. By adopting a "fail fast" approach centered on business-driven innovation rather than IT-led technology push, they achieved significant efficiency gains including reducing company analysis workflows from 24 weeks to 6 weeks and streamlining citizen inquiry analysis. The initiative prioritized early adopters, transparent sharing of both successes and failures, and maintained human accountability throughout all processes while rapidly testing assistants at scale using cloud-based platforms like Intric that provide access to multiple LLM providers.

Scaling AI Coding Assistant Adoption Across Engineering Organization

Hubspot

HubSpot scaled AI coding assistant adoption from experimental use to near-universal deployment (over 90%) across their engineering organization over a two-year period starting in summer 2023. The company began with a GitHub Copilot proof of concept backed by executive support, ran a large-scale pilot with comprehensive measurement, and progressively removed adoption barriers while establishing a dedicated Developer Experience AI team in October 2024. Through strategic enablement, data-driven validation showing no correlation between AI adoption and production incidents, peer validation mechanisms, and infrastructure investments including local MCP servers with curated configurations, HubSpot achieved widespread adoption while maintaining code quality and ultimately made AI fluency a baseline hiring expectation for engineers.

Scaling AI Coding Tool Adoption Across Engineering Teams

Plaid

Plaid, a fintech company operating in the regulated consumer finance space, faced the challenge of transforming hundreds of highly effective engineers into AI power users without disrupting existing workflows. Over six months, they developed a comprehensive strategy that achieved over 75% adoption of advanced AI coding tools through streamlined procurement processes, dedicated ownership of adoption metrics, creation of in-house content demonstrating tools on their actual codebase, and positioning AI tools as complements rather than replacements to existing IDEs. The initiative culminated in a company-wide AI Day with 80%+ engineering participation and 90%+ satisfaction scores, though they continue to address challenges around cost controls, benchmarking, and code review processes adapted for AI-generated code.

Scaling AI Development with DGX Cloud: ServiceNow and SLB Production Deployments

Nvidia

ServiceNow and SLB (formerly Schlumberger) leveraged Nvidia DGX Cloud on AWS to develop and deploy foundation models for their respective industries. ServiceNow focused on building efficient small language models (5B-15B parameters) for enterprise process automation and agentic systems that match frontier model performance at a fraction of the cost and size, achieving nearly 100% GPU utilization through Run AI orchestration. SLB developed domain-specific multi-modal foundation models for seismic and petrophysical data to assist geoscientists and engineers in the energy sector, accelerating time-to-market for two major product releases over two years. Both organizations benefited from the fully optimized, turnkey infrastructure stack combining high-performance GPUs, networking, Lustre storage, EKS optimization, and enterprise-grade support, enabling them to focus on model development rather than infrastructure management while achieving zero or near-zero downtime.

Scaling AI Evaluation for Legal AI Systems Through Multi-Modal Assessment

Harvey

Harvey, a legal AI company, developed a comprehensive evaluation strategy for their production AI systems that handle complex legal queries, document analysis, and citation generation. The solution combines three core pillars: expert-led reviews involving direct collaboration with legal professionals from prestigious law firms, automated evaluation pipelines for continuous monitoring and rapid iteration, and dedicated data services for secure evaluation data management. The system addresses the unique challenges of evaluating AI in high-stakes legal environments, achieving over 95% accuracy in citation verification and demonstrating statistically significant improvements in model performance through structured A/B testing and expert feedback loops.

Scaling AI Infrastructure for Legal AI Applications at Enterprise Scale

Harvey

Harvey, a legal AI platform company, developed a comprehensive AI infrastructure system to handle millions of daily requests across multiple AI models for legal document processing and analysis. The company built a centralized Python library that manages model deployments, implements load balancing, quota management, and real-time monitoring to ensure reliability and performance. Their solution includes intelligent model endpoint selection, distributed rate limiting using Redis-backed token bucket algorithms, a proxy service for developer access, and comprehensive observability tools, enabling them to process billions of prompt tokens while maintaining high availability and seamless scaling for their legal AI products.

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

Scaling AI-Assisted Developer Tools and Agentic Workflows at Scale

Slack

Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.

Scaling AI-Powered Code Generation in Browser and Enterprise Environments

Qodo / Stackblitz

The case study examines two companies' approaches to deploying LLMs for code generation at scale: Stackblitz's Bolt.new achieving over $8M ARR in 2 months with their browser-based development environment, and Qodo's enterprise-focused solution handling complex deployment scenarios across 96 different configurations. Both companies demonstrate different approaches to productionizing LLMs, with Bolt.new focusing on simplified web app development for non-developers and Qodo targeting enterprise testing and code review workflows.

Scaling AI-Powered Student Support Chatbots Across Campus

UC Santa Barbara

UC Santa Barbara implemented an AI-powered chatbot platform called "Story" (powered by Gravity's Ivy and Ocelot services) to address challenges in student support after COVID-19, particularly helping students navigate campus services and reducing staff workload. Starting with a pilot of five departments in 2022, UCSB scaled to 19 chatbot instances across diverse student services over two and a half years. The implementation resulted in nearly 40,000 conversations, with 30% occurring outside business hours, significantly reducing phone and email volume to departments while enabling staff to focus on more complex student inquiries. The university took a phased cohort approach, training departments in groups over 10-week periods, with student testers providing crucial feedback on language and expectations before launch.

Scaling an AI-Powered Conversational Shopping Assistant to 250 Million Users

Rufus

Amazon built Rufus, an AI-powered shopping assistant that serves over 250 million customers with conversational shopping experiences. Initially launched using a custom in-house LLM specialized for shopping queries, the team later adopted Amazon Bedrock to accelerate development velocity by 6x, enabling rapid integration of state-of-the-art foundation models including Amazon Nova and Anthropic's Claude Sonnet. This multi-model approach combined with agentic capabilities like tool use, web grounding, and features such as price tracking and auto-buy resulted in monthly user growth of 140% year-over-year, interaction growth of 210%, and a 60% increase in purchase completion rates for customers using Rufus.

Scaling an Autonomous AI Customer Support Agent from Demo to Production

Intercom

Intercom developed Finn, an autonomous AI customer support agent, evolving it from early prototypes with GPT-3.5 to a production system using GPT-4 and custom architecture. Initially hampered by hallucinations and safety concerns, the system now successfully resolves 58-59% of customer support conversations, up from 25% at launch. The solution combines multiple AI processes including disambiguation, ranking, and summarization, with careful attention to brand voice control and escalation handling.

Scaling an MCP Server for Error Monitoring to 60 Million Monthly Requests

Sentry

Sentry, an error monitoring platform, built a Model Context Protocol (MCP) server to improve the workflow where developers would copy error details from Sentry's UI and paste them into AI coding assistants like Cursor. The MCP server provides direct integration with 10-15 tools, including retrieving issue details and triggering automated fix attempts through Sentry's AI agent. The implementation scaled from 30 million to 60 million requests per month, with over 5,000 organizations using it. The company learned critical lessons about treating MCP servers as production services, implementing comprehensive observability, managing context pollution, and taking responsibility for agent behavior through careful prompt engineering and tool description design.

Scaling and Operating Large Language Models at the Frontier

Anthropic

This case study examines Anthropic's journey in scaling and operating large language models, focusing on their transition from GPT-3 era training to current state-of-the-art systems like Claude. The company successfully tackled challenges in distributed computing, model safety, and operational reliability while growing 10x in revenue. Key innovations include their approach to constitutional AI, advanced evaluation frameworks, and sophisticated MLOps practices that enable running massive training operations with hundreds of team members.

Scaling Audio Content Generation with LLMs and TTS for Language Learning

Duolingo

Duolingo tackled the challenge of scaling their DuoRadio feature, a podcast-like audio learning experience, by implementing an AI-driven content generation pipeline. They transformed a labor-intensive manual process into an automated system using LLMs for script generation and evaluation, coupled with Text-to-Speech technology. This allowed them to expand from 300 to 15,000+ episodes across 25+ language courses in under six months, while reducing costs by 99% and growing daily active users from 100K to 5.5M.

Scaling Chatbot Platform with Hybrid LLM and Custom Model Approach

Voiceflow

Voiceflow, a chatbot and voice assistant platform, integrated large language models into their existing infrastructure while maintaining custom language models for specific tasks. They used OpenAI's API for generative features but kept their custom NLU model for intent/entity detection due to superior performance and cost-effectiveness. The company implemented extensive testing frameworks, prompt engineering, and error handling while dealing with challenges like latency variations and JSON formatting issues.

Scaling Contact Center Operations with AI Agents in Fintech and Travel Industries

Propel Holdings / Xanterra Travel Collection

Propel Holdings (fintech) and Xanterra Travel Collection (travel/hospitality) implemented Cresta's AI agent solutions to address scaling challenges and operational efficiency in their contact centers. Both organizations started with agent assist capabilities before deploying conversational AI agents for chat and voice channels. Propel Holdings needed to support 40% year-over-year growth without proportionally scaling human agents, while Xanterra sought to reduce call volume for routine inquiries and provide 24/7 coverage. Starting with FAQ-based use cases and later integrating APIs for transactional capabilities, both companies achieved significant results: Propel Holdings reached 58% chat containment after API integration, while Xanterra achieved 60-90% containment on chat and 20-30% on voice channels. Within five months, Xanterra deployed 12 AI agents across different properties and channels, demonstrating rapid scaling capability while maintaining customer satisfaction and redeploying human agents to higher-value interactions.

Scaling Content Production and Fan Engagement with Gen AI

Bundesliga

Bundesliga (DFL), Germany's premier soccer league, deployed multiple Gen AI solutions to address two key challenges: scaling content production for over 1 billion global fans across 200 countries, and enhancing personalized fan engagement to reduce "second screen chaos" during live matches. The organization implemented three main production-scale solutions: automated match report generation that saves editors 90% of their time, AI-powered story creation from existing articles that reduces production time by 80%, and on-demand video localization that cuts processing time by 75% while reducing costs by 3.5x. Additionally, they developed MatchMade, an AI-powered fan companion featuring dynamic text-to-SQL workflows and proactive content nudging. By leveraging Amazon Nova for cost-performance optimization alongside other models like Anthropic's Claude, Bundesliga achieved a 70% cost reduction in image assignment tasks, 35% cost reduction through dynamic routing, and scaled personalized content delivery by 5x per user while serving over 100,000 fans in production.

Scaling Custom AI Application Development Through Modular LLM Framework

BlackRock

BlackRock developed an internal framework to accelerate AI application development for investment operations, reducing development time from 3-8 months to a couple of days. The solution addresses challenges in document extraction, workflow automation, Q&A systems, and agentic systems by providing a modular sandbox environment for domain experts to iterate on prompt engineering and LLM strategies, coupled with an app factory for automated deployment. The framework emphasizes human-in-the-loop processes for compliance in regulated financial environments and enables rapid prototyping through configurable extraction templates, document management, and low-code transformation workflows.

Scaling Customer Support AI Chatbot to Production with Multiple LLM Providers

Intercom

Intercom developed Fin, an AI customer support chatbot that resolves up to 86% of conversations instantly. They faced challenges scaling from proof-of-concept to production, particularly around reliability and cost management. The team successfully improved their system from 99% to 99.9%+ reliability by implementing cross-region inference, strategic use of streaming, and multiple model fallbacks while using Amazon Bedrock and other LLM providers. The solution has processed over 13 million conversations for 4,000+ customers with most achieving over 50% automated resolution rates.

Scaling Customer Support with an LLM-Powered Conversational Chatbot

Coinbase

Coinbase faced the challenge of handling tens of thousands of monthly customer support queries that scaled unpredictably during high-traffic events like crypto bull runs. To address this, they developed the Conversational Coinbase Chatbot (CBCB), an LLM-powered system that integrates knowledge bases, real-time account APIs, and domain-specific logic through a multi-stage architecture. The solution enables the chatbot to deliver context-aware, personalized, and compliant responses while reducing reliance on human agents, allowing customer experience teams to focus on complex issues. CBCB employs multiple components including query rephrasing, semantic retrieval with ML-based ranking, response styling, and comprehensive guardrails to ensure accuracy, compliance, and scalability.

Scaling Customer Support, Compliance, and Developer Productivity with Gen AI

Coinbase

Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.

Scaling Document Processing with LLMs and Human Review

Vendr / Extend

Vendr partnered with Extend to extract structured data from SaaS order forms and contracts using LLMs. They implemented a hybrid approach combining LLM processing with human review to achieve high accuracy in entity recognition and data extraction. The system successfully processed over 100,000 documents, using techniques such as document embeddings for similarity clustering, targeted human review, and robust entity mapping. This allowed Vendr to unlock valuable pricing insights for their customers while maintaining high data quality standards.

Scaling Email Content Extraction Using LLMs in Production

Yahoo

Yahoo Mail faced challenges with their existing ML-based email content extraction system, hitting a coverage ceiling of 80% for major senders while struggling with long-tail senders and slow time-to-market for model updates. They implemented a new solution using Google Cloud's Vertex AI and LLMs, achieving 94% coverage for standard domains and 99% for tail domains, with 51% increase in extraction richness and 16% reduction in tracking API errors. The implementation required careful consideration of hybrid infrastructure, cost management, and privacy compliance while processing billions of daily messages.

Scaling ESG Compliance Analysis with RAG and Vector Search

IntellectAI

IntellectAI developed Purple Fabric, a platform-as-a-service that processes and analyzes ESG compliance data for a major sovereign wealth fund. Using MongoDB Atlas and Vector Search, they transformed the manual analysis of 100-150 companies into an automated system capable of processing over 8,000 companies' data across multiple languages, achieving over 90% accuracy in compliance assessments. The system processes 10 million documents in 30+ formats, utilizing RAG to provide real-time investment decision insights.

Scaling Finance Operations with Agentic AI in a High-Growth EV Manufacturer

Lucid Motors

Lucid Motors, a software-defined electric vehicle manufacturer, partnered with PWC and AWS to implement agentic AI solutions across their finance organization to prepare for massive growth with the launch of their mid-size vehicle platform. The company developed 14 proof-of-concept use cases in just 10 weeks, spanning demand forecasting, investor analytics, treasury, accounting, and internal audit functions. By leveraging AWS Bedrock and PWC's Agent OS orchestration layer, along with access to diverse data sources across SAP, Redshift, and Salesforce, Lucid is transforming finance from a traditional reporting function into a strategic competitive advantage that provides real-time predictive analytics and enables data-driven decision making at sapphire speed.

Scaling Financial Software with GenAI and Production ML

Ramp

Ramp, a financial technology company, has integrated AI and ML throughout their operations, from their core financial products to their sales and customer service. They evolved from traditional ML use cases like fraud detection and underwriting to more advanced generative AI applications. Their Ramp Intelligence suite now includes features like automated price comparison, expense categorization, and an experimental AI agent that can guide users through the platform's interface. The company has achieved significant productivity gains, with their sales development representatives booking 3-4x more meetings than competitors through AI augmentation.

Scaling Game Content Production with LLMs and Data Augmentation

Ubisoft

Ubisoft leveraged AI21 Labs' LLM capabilities to automate tedious scriptwriting tasks and generate training data for their internal models. By implementing a writer-in-the-loop workflow for NPC dialogue generation and using AI21's models for data augmentation, they successfully scaled their content production while maintaining creative control. The solution included optimized token pricing for extensive prompt experimentation and resulted in significant efficiency gains in their game development process.

Scaling Generative AI Features to Millions of Users with Infrastructure Optimization and Quality Evaluation

Slack

Slack faced significant challenges in scaling their generative AI features (Slack AI) to millions of daily active users while maintaining security, cost efficiency, and quality. The company needed to move from a limited, provisioned infrastructure to a more flexible system that could handle massive scale (1-5 billion messages weekly) while meeting strict compliance requirements. By migrating from SageMaker to Amazon Bedrock and implementing sophisticated experimentation frameworks with LLM judges and automated metrics, Slack achieved over 90% reduction in infrastructure costs (exceeding $20 million in savings), 90% reduction in cost-to-serve per monthly active user, 5x increase in scale, and 15-30% improvements in user satisfaction across features—all while maintaining quality and enabling experimentation with over 15 different LLMs in production.

Scaling Generative AI for Manufacturing Operations with RAG and Multi-Model Architecture

Georgia-Pacific

Georgia-Pacific, a forest products manufacturing company with 30,000+ employees and 140+ facilities, deployed generative AI to address critical knowledge transfer challenges as experienced workers retire and new employees struggle with complex equipment. The company developed an "Operator Assistant" chatbot using AWS Bedrock, RAG architecture, and vector databases to provide real-time troubleshooting guidance to factory operators. Starting with a 6-8 week MVP deployment in December 2023, they scaled to 45 use cases across multiple facilities within 7-8 months, serving 500+ users daily with improved operational efficiency and reduced waste.

Scaling Generative AI in Gaming: From Safety to Creation Tools

Roblox

Roblox has implemented a comprehensive suite of generative AI features across their gaming platform, addressing challenges in content moderation, code assistance, and creative tools. Starting with safety features using transformer models for text and voice moderation, they expanded to developer tools including AI code assistance, material generation, and specialized texture creation. The company releases new AI features weekly, emphasizing rapid iteration and public testing, while maintaining a balance between automation and creator control. Their approach combines proprietary solutions with open-source contributions, demonstrating successful large-scale deployment of AI in a production gaming environment serving 70 million daily active users.

Scaling LLM Application Observability Through Automated Conversation Clustering and Analysis

Manus

This case study presents a methodology for understanding and improving LLM applications at scale when manual review of conversations becomes infeasible. The core problem addressed is that traditional logging misses critical issues in AI applications, and teams face data paralysis when dealing with millions of complex, multi-turn agent conversations across multiple languages. The solution involves using LLMs themselves to automatically summarize, cluster, and analyze user conversations at scale, following a framework inspired by Anthropic's CLEO (Claude Language Insights and Observations) system. The presenter demonstrates this through Kura, an open-source library that summarizes conversations, generates embeddings, performs hierarchical clustering, and creates classifiers for ongoing monitoring. The approach enabled identification of high-leverage fixes (like adding two-line prompt changes for upselling that yielded 20-30% revenue increases) and helped Anthropic launch their educational product by analyzing patterns in one million student conversations. Results show that this systematic approach allows teams to prioritize fixes based on volume and impact, track improvements quantitatively, and scale their analysis capabilities beyond manual review limitations.

Scaling LLM Applications in Telecommunications: Learnings from Verizon and Industry Partners

Various

A panel discussion featuring Verizon, Anthropic, and Infosys executives sharing their experiences implementing LLM applications in telecommunications. The discussion covers multiple use cases including content generation, software development lifecycle enhancement, and customer service automation. Key challenges discussed include accuracy requirements, ROI justification, user adoption, and the need for proper evaluation frameworks when moving from proof of concept to production.

Scaling LLM-Powered Financial Insights with Continuous Evaluation

Fintool

Fintool, an AI equity research assistant, faced the challenge of processing massive amounts of financial data (1.5 billion tokens across 70 million document chunks) while maintaining high accuracy and trust for institutional investors. They implemented a comprehensive LLMOps evaluation workflow using Braintrust, combining automated LLM-based evaluation, golden datasets, format validation, and human-in-the-loop oversight to ensure reliable and accurate financial insights at scale.

Scaling LLMs for Product Knowledge and Search in E-commerce

Doordash

Doordash leverages LLMs to enhance their product knowledge graph and search capabilities as they expand into new verticals beyond food delivery. They employ LLM-assisted annotations for attribute extraction, use RAG for generating training data, and implement LLM-based systems for detecting catalog inaccuracies and understanding search intent. The solution includes distributed computing frameworks, model optimization techniques, and careful consideration of latency and throughput requirements for production deployment.

Scaling Local News Coverage with AI-Powered Newsletter Generation

Patch

Patch transformed its local news coverage by implementing AI-powered newsletter generation, enabling them to expand from 1,100 to 30,000 communities while maintaining quality and trust. The system combines curated local data sources, weather information, event calendars, and social media content, processed through AI to create relevant, community-specific newsletters. This approach resulted in over 400,000 new subscribers and a 93.6% satisfaction rating, while keeping costs manageable and maintaining editorial standards.

Scaling Meta AI's Feed Deep Dive from Launch to Product-Market Fit

Meta

Meta launched Feed Deep Dive as an AI-powered feature on Facebook in April 2024 to address information-seeking and context enrichment needs when users encounter posts they want to learn more about. The challenge was scaling from launch to product-market fit while maintaining high-quality responses at Meta scale, dealing with LLM hallucinations and refusals, and providing more value than users would get from simply scrolling Facebook Feed. Meta's solution involved evolving from traditional orchestration to agentic models with planning, tool calling, and reflection capabilities; implementing auto-judges for online quality evaluation; using smart caching strategies focused on high-traffic posts; and leveraging ML-based user cohort targeting to show the feature to users who derived the most value. The results included achieving product-market fit through improved quality and engagement, with the team now moving toward monetization and expanded use cases.

Scaling Multi-Agent Autonomous Coding Systems

Cursor

Cursor experimented with running hundreds of concurrent LLM-based coding agents autonomously for weeks on large-scale software projects. The problem was that single agents work well for focused tasks but struggle with complex projects requiring months of work. Their solution evolved from flat peer-to-peer coordination (which failed due to locking bottlenecks and risk-averse behavior) to a hierarchical planner-worker architecture where planner agents create tasks and worker agents execute them independently. Results included agents successfully building a web browser from scratch (1M+ lines of code over a week), completing a 3-week React migration (266K additions/193K deletions), optimizing video rendering by 25x, and running multiple other ambitious projects with thousands of commits and millions of lines of code.

Scaling Open-Ended Customer Service Analysis with Foundation Models

MaestroQA

MaestroQA enhanced their customer service quality assurance platform by integrating Amazon Bedrock to analyze millions of customer interactions at scale. They implemented a solution that allows customers to ask open-ended questions about their service interactions, enabling sophisticated analysis beyond traditional keyword-based approaches. The system successfully processes high volumes of transcripts across multiple regions while maintaining low latency, leading to improved compliance detection and customer sentiment analysis for their clients across various industries.

Scaling Order Processing Automation Using Modular LLM Architecture

Choco

Choco developed an AI system to automate the order intake process for food and beverage distributors, handling unstructured orders from various channels (email, voicemail, SMS, WhatsApp). By implementing a modular LLM architecture with specialized components for transcription, information extraction, and product matching, along with comprehensive evaluation pipelines and human feedback loops, they achieved over 95% prediction accuracy. One customer reported 60% reduction in manual order entry time and 50% increase in daily order processing capacity without additional staffing.

Scaling Parallel Agent Operations with LangChain and LangSmith Monitoring

Paradigm

Paradigm (YC24) built an AI-powered spreadsheet platform that runs thousands of parallel agents for data processing tasks. They utilized LangChain for rapid agent development and iteration, while leveraging LangSmith for comprehensive monitoring, operational insights, and usage-based pricing optimization. This enabled them to build task-specific agents for schema generation, sheet naming, task planning, and contact lookup while maintaining high performance and cost efficiency.

Scaling Privacy Infrastructure for GenAI Product Innovation

Meta

Meta addresses the challenge of maintaining user privacy while deploying GenAI-powered products at scale, using their AI glasses as a primary example. The company developed Privacy Aware Infrastructure (PAI), which integrates data lineage tracking, automated policy enforcement, and comprehensive observability across their entire technology stack. This infrastructure automatically tracks how user data flows through systems—from initial collection through sensor inputs, web processing, LLM inference calls, data warehousing, to model training—enabling Meta to enforce privacy controls programmatically while accelerating product development. The solution allows engineering teams to innovate rapidly with GenAI capabilities while maintaining auditable, verifiable privacy guarantees across thousands of microservices and products globally.

Scaling Product Categorization from Manual Tagging to LLM-Based Classification

GetYourGuide

GetYourGuide, a global marketplace for travel experiences, evolved their product categorization system from manual tagging to an LLM-based solution to handle 250,000 products across 600 categories. The company progressed through rule-based systems and semantic NLP models before settling on a hybrid approach using OpenAI's GPT-4-mini with structured outputs, combined with embedding-based ranking and batch processing with early stopping. This solution processes one product-category pair at a time, incorporating reasoning and confidence fields to improve decision quality. The implementation resulted in significant improvements: Matthew's Correlation Coefficient increased substantially, 50 previously excluded categories were reintroduced, 295 new categories were enabled, and A/B testing showed a 1.3% increase in conversion rate, improved quote rate, and reduced bounce rate.

Scaling Product Categorization with Batch Inference and Prompt Engineering

GoDaddy

GoDaddy sought to improve their product categorization system that was using Meta Llama 2 for generating categories for 6 million products but faced issues with incomplete/mislabeled categories and high costs. They implemented a new solution using Amazon Bedrock's batch inference capabilities with Claude and Llama 2 models, achieving 97% category coverage (exceeding their 90% target), 80% faster processing time, and 8% cost reduction while maintaining high quality categorization as verified by subject matter experts.

Scaling RAG Accuracy from 49% to 86% in Finance Q&A Assistant

Amazon Finance

Amazon Finance Automation developed a RAG-based Q&A chat assistant using Amazon Bedrock to help analysts quickly retrieve answers to customer queries. Through systematic improvements in document chunking, prompt engineering, and embedding model selection, they increased the accuracy of responses from 49% to 86%, significantly reducing query response times from days to minutes.

Scaling Search Query Understanding with LLMs: From POC to Production

Yelp

Yelp implemented LLMs to enhance their search query understanding capabilities, focusing on query segmentation and review highlights. They followed a systematic approach from ideation to production, using a combination of GPT-4 for initial development, creating fine-tuned smaller models for scale, and implementing caching strategies for head queries. The solution successfully improved search relevance and user engagement, while managing costs and latency through careful architectural decisions and gradual rollout strategies.

Scaling Trust and Safety Using LLMs at Tinder

Tinder

Tinder implemented a comprehensive LLM-based trust and safety system to combat various forms of harmful content at scale. The solution involves fine-tuning open-source LLMs using LoRA (Low-Rank Adaptation) for different types of violation detection, from spam to hate speech. Using the Lorax framework, they can efficiently serve multiple fine-tuned models on a single GPU, achieving real-time inference with high precision and recall while maintaining cost-effectiveness. The system demonstrates superior generalization capabilities against adversarial behavior compared to traditional ML approaches.

Scientific Intent Translation System for Healthcare Analytics Using Amazon Bedrock

Aetion

Aetion developed a Measures Assistant to help healthcare professionals translate complex scientific queries into actionable analytics measures using generative AI. By implementing Amazon Bedrock with Claude 3 Haiku and a custom RAG system, they created a production system that allows users to express scientific intent in natural language and receive immediate guidance on implementing complex healthcare data analyses. This reduced the time required to implement measures from days to minutes while maintaining high accuracy and security standards.

Security Learnings from LLM Production Deployments

NVIDIA

Based on a year of experience with NVIDIA's product security and AI red team, this case study examines real-world security challenges in LLM deployments, particularly focusing on RAG systems and plugin architectures. The study reveals common vulnerabilities in production LLM systems, including data leakage through RAG, prompt injection risks, and plugin security issues, while providing practical mitigation strategies for each identified threat vector.

Self-Improving Agentic Systems Using DSPy for Production Email Generation

Relevance AI

Relevance AI implemented DSPy-powered self-improving AI agents for outbound sales email composition, addressing the challenge of building truly adaptive AI systems that evolve with real-world usage. The solution integrates DSPy's optimization framework with a human-in-the-loop feedback mechanism, where agents pause for approval at critical checkpoints and incorporate corrections into their training data. Through this approach, the system achieved emails matching human-written quality 80% of the time and exceeded human performance in 6% of cases, while reducing agent development time by 50% through elimination of manual prompt tuning. The system demonstrates continuous improvement through automated collection of human-approved examples that feed back into DSPy's optimization algorithms.

Self-Learning Generative AI System for Product Catalog Enrichment

Amazon

Amazon's Catalog Team faced the challenge of extracting structured product attributes and generating quality content at massive scale while managing the tradeoff between model accuracy and computational costs. They developed a self-learning system using multiple smaller models working in consensus to process routine cases, with a supervisor agent using more capable models to investigate disagreements and generate reusable learnings stored in a dynamic knowledge base. This architecture, implemented with Amazon Bedrock, resulted in continuously declining error rates and reduced costs over time, as accumulated learnings prevented entire classes of future disagreements without requiring model retraining.

Semantic Data Processing at Scale with AI-Powered Query Optimization

DocETL

Shreyaa Shankar presents DocETL, an open-source system for semantic data processing that addresses the challenges of running LLM-powered operators at scale over unstructured data. The system tackles two major problems: how to make semantic operator pipelines scalable and cost-effective through novel query optimization techniques, and how to make them steerable through specialized user interfaces. DocETL introduces rewrite directives that decompose complex tasks and data to improve accuracy and reduce costs, achieving up to 86% cost reduction while maintaining target accuracy. The companion tool Doc Wrangler provides an interactive interface for iteratively authoring and debugging these pipelines. Real-world applications include public defenders analyzing court transcripts for racial bias and medical analysts extracting information from doctor-patient conversations, demonstrating significant accuracy improvements (2x in some cases) compared to baseline approaches.

Semantic Relevance Evaluation and Enhancement Framework for E-commerce Search

Etsy

Etsy's Search Relevance team developed a comprehensive Semantic Relevance Evaluation and Enhancement Framework to address the limitations of engagement-based search models that favored popular listings over semantically relevant ones. The solution employs a three-tier cascaded distillation approach: starting with human-curated "golden" labels, scaling with an LLM annotator (o3 model) to generate training data, fine-tuning a teacher model (Qwen 3 VL 4B) for efficient large-scale evaluation, and distilling to a lightweight BERT-based student model for real-time production inference. The framework integrates semantic relevance signals into search through filtering, feature enrichment, loss weighting, and relevance boosting. Between August and October 2025, the percentage of fully relevant listings increased from 58% to 62%, demonstrating measurable improvements in aligning search results with buyer intent while addressing the cold-start problem for smaller sellers.

Semantic Search for Aviation Safety Reports Using Embeddings and Hybrid Search

Beams

Beams, a startup operating in aviation safety, built a semantic search system to help airlines analyze thousands of safety reports written daily by pilots and ground crew. The problem they addressed was the manual, time-consuming process of reading through unstructured, technical, jargon-filled free-text reports to identify trends and manage risks. Their solution combined vector embeddings (using Azure OpenAI's text-embedding-3-large model) with PostgreSQL and PG Vector for similarity search, alongside a two-stage retrieval and reranking pipeline. They also integrated structured filtering with semantic search to create a hybrid search system. The system was deployed on AWS using Lambda functions, RDS with PostgreSQL, and SQS for event-driven orchestration. Results showed that users could quickly search through hundreds of thousands of reports using natural language queries, finding semantically similar incidents even when terminology varied, significantly improving efficiency in safety analysis workflows.

Six Principles for Building Production AI Agents

App.build

App.build shared six empirical principles learned from building production AI agents that help overcome common challenges in agentic system development. The principles focus on investing in system prompts with clear instructions, splitting context to manage costs and attention, designing straightforward tools with limited parameters, implementing feedback loops with actor-critic patterns, using LLMs for error analysis, and recognizing that frustrating agent behavior often indicates system design issues rather than model limitations. These guidelines emerged from practical experience in developing software engineering agents and emphasize systematic approaches to building reliable, recoverable agents that fail gracefully.

Smart Business Analyst: Automating Competitor Analysis in Medical Device Procurement

Philips

A procurement team developed an advanced LLM-powered system called "Smart Business Analyst" to automate competitor analysis in the medical device industry. The system addresses the challenge of gathering and analyzing competitor data across multiple dimensions, including features, pricing, and supplier relationships. Unlike general-purpose LLMs like ChatGPT, this solution provides precise numerical comparisons and leverages multiple data sources to deliver accurate, industry-specific insights, significantly reducing the time required for competitive analysis from hours to seconds.

Source-Grounded LLM Assistant with Multi-Modal Output Capabilities

Google / NotebookLLM

Google's NotebookLM tackles the challenge of making large language models more focused and personalized by introducing source grounding - allowing users to upload their own documents to create a specialized AI assistant. The system combines Gemini 1.5 Pro with sophisticated audio generation to create human-like podcast-style conversations about user content, complete with natural speech patterns and disfluencies. The solution includes built-in safety features, privacy protections through transient context windows, and content watermarking, while enabling users to generate insights from personal documents without contributing to model training data.

Specialized Language Models for Contact Center Transformation

Accenture

Accenture partnered with Databricks to transform a client's customer contact center by implementing specialized language models (SLMs) that go beyond simple prompt engineering. The client faced challenges with high call volumes, impersonal service, and missed revenue opportunities. Using Databricks' MLOps platform and GPU infrastructure, they developed and deployed fine-tuned language models that understand industry-specific context, cultural nuances, and brand styles, resulting in improved customer experience and operational efficiency. The solution includes real-time monitoring and multimodal capabilities, setting a new standard for AI-driven customer service operations.

Specialized Text Editing LLM Development through Instruction Tuning

Grammarly

Grammarly developed CoEdIT, a specialized text editing LLM that outperforms larger models while being up to 60 times smaller. Through targeted instruction tuning on a carefully curated dataset of text editing tasks, they created models ranging from 770M to 11B parameters that achieved state-of-the-art performance on multiple editing benchmarks, outperforming models like GPT-3-Edit (175B parameters) and ChatGPT in both automated and human evaluations.

SQL Generation and RAG for Financial Data Q&A Chatbot

Q4

Q4 Inc. developed a chatbot for Investor Relations Officers to query financial data using Amazon Bedrock and RAG with SQL generation. The solution addresses challenges with numerical and structured datasets by using LLMs to generate SQL queries rather than traditional RAG approaches, achieving high accuracy and single-digit second response times. The system uses multiple foundation models through Amazon Bedrock for different tasks (SQL generation, validation, summarization) optimized for performance and cost.

SQL Query Agent for Data Democratization

Prosus

Prosus developed a SQL-generating agent called "Token Data Analyst" to help democratize data access across their portfolio companies. The agent serves as a first-line support for data queries, allowing non-technical users to get insights from databases through natural language questions in Slack. The system achieved a 74% reduction in query response time and significantly increased the total number of data insights generated, while maintaining high accuracy through careful prompt engineering and context management.

Strategic Framework for Generative AI Implementation in Food Delivery Platform

Doordash

DoorDash outlines a comprehensive strategy for implementing Generative AI across five key areas: customer assistance, interactive discovery, personalized content generation, information extraction, and employee productivity enhancement. The company aims to revolutionize its delivery platform while maintaining strong considerations for data privacy and security, focusing on practical applications ranging from automated cart building to SQL query generation.

Strategic Implementation of Generative AI at Scale

TomTom

TomTom implemented a comprehensive generative AI strategy across their organization, using a hub-and-spoke model to democratize AI innovation. They successfully deployed multiple AI applications including a ChatGPT location plugin, an in-car AI assistant (Tommy), and internal tools for mapmaking and development, all without significant additional investment. The strategy focused on responsible AI use, workforce upskilling, and strategic partnerships with cloud providers, resulting in 30-60% task performance improvements.

Strategic LLM Implementation in Chemical Manufacturing with Focus on Documentation and Virtual Agents

Chevron Philips Chemical

Chevron Phillips Chemical is implementing generative AI with a focus on virtual agents and document processing, taking a measured approach to deployment. They formed a cross-functional team including legal, IT security, and data science to educate leadership and identify appropriate use cases. The company is particularly focusing on processing unstructured documents and creating virtual agents for specific topics, while carefully considering bias, testing challenges, and governance in their implementation strategy.

Streamlining Background Check Classification with Fine-tuned Small Language Models

Checkr

Checkr tackled the challenge of classifying complex background check records by implementing a fine-tuned small language model (SLM) solution. They moved from using GPT-4 to fine-tuning Llama-2 models on Predibase, achieving 90% accuracy for their most challenging cases while reducing costs by 5x and improving response times to 0.15 seconds. This solution helped automate their background check adjudication process, particularly for the 2% of complex cases that required classification into 230 distinct categories.

Streamlining Clinical Trial Documentation Generation with RAG and LLMs

Clario

Clario, a clinical trials endpoint data solutions provider, transformed their time-consuming manual documentation process by implementing a generative AI solution using Amazon Bedrock. The system automates the generation of business requirement specifications from medical imaging charter documents using RAG architecture with Amazon OpenSearch for vector storage and Claude 3.7 Sonnet for text generation. The solution improved accuracy, reduced manual errors, and significantly streamlined their documentation workflow while maintaining security and compliance requirements.

Streamlining Corporate Audits with GenAI-Powered Document Processing

Hapag-Lloyd

Hapag-Lloyd faced challenges with time-consuming manual corporate audit processes. They implemented a GenAI solution using Databricks Mosaic AI to automate audit finding generation and executive summary creation. By fine-tuning the DBRX model and implementing a RAG-based chatbot, they achieved a 66% decrease in time spent creating new findings and a 77% reduction in executive summary review time, significantly improving their audit efficiency.

Structured AI Workflow Orchestration for Developer Productivity at Scale

Shopify

Shopify's Augmented Engineering team developed Roast, an open-source workflow orchestration framework that structures AI agents to solve developer productivity challenges like flaky tests and low test coverage. The team discovered that breaking complex AI tasks into discrete, structured steps was essential for reliable performance at scale, leading them to create a convention-over-configuration tool that combines deterministic code execution with AI-powered analysis, enabling reproducible and testable AI workflows that can be version-controlled and integrated into development processes.

Structured LLM Conversations for Language Learning Video Calls

Duolingo

Duolingo implemented an AI-powered video call feature called "Video Call with Lily" that enables language learners to practice speaking with an AI character. The system uses carefully structured prompts, conversational blueprints, and dynamic evaluations to ensure appropriate difficulty levels and natural interactions. The implementation includes memory management to maintain conversation context across sessions and separate processing steps to prevent LLM overload, resulting in a personalized and effective language learning experience.

Structured Workflow Orchestration for Large-Scale Code Operations with Claude

Shopify

Shopify's augmented engineering team developed ROAST, an open-source workflow orchestration tool designed to address challenges of maintaining developer productivity at massive scale (5,000+ repositories, 500,000+ PRs annually, millions of lines of code). The team recognized that while agentic AI tools like Claude Code excel at exploratory tasks, deterministic structured workflows are better suited for predictable, repeatable operations like test generation, coverage optimization, and code migrations. By interleaving Claude Code's non-deterministic agentic capabilities with ROAST's deterministic workflow orchestration, Shopify created a bidirectional system where ROAST can invoke Claude Code as a tool within workflows, and Claude Code can execute ROAST workflows for specific steps. The solution has rapidly gained adoption within Shopify, reaching 500 daily active users and 250,000 requests per second at peak, with developers praising the combination for minimizing instruction complexity at each workflow step and reducing entropy accumulation in multi-step processes.

Student Innovation with Claude: Multi-Domain AI Applications from Education to National Security

Various

This case study presents four distinct student-led projects that leverage Claude (Anthropic's LLM) through API credits provided to thousands of students. The projects span multiple domains: Isabelle from Stanford developed a computational simulation using CERN's Geant4 software to detect nuclear weapons in space via X-ray inspection systems for national security verification; Mason from UC Berkeley learned to code through a top-down approach with Claude, building applications like CalGPT for course scheduling and GetReady for codebase visualization; Rohill from UC Berkeley created SideQuest, a system where AI agents hire humans for physical tasks using computer vision verification; and Daniel from USC developed Claude Cortex, a multi-agent system that dynamically creates specialized agents for parallel reasoning and enhanced decision-making. These projects demonstrate Claude's capabilities in education, enabling students to tackle complex problems ranging from nuclear non-proliferation to AI-human collaboration frameworks.

Supervised Fine-Tuning for AI-Powered Travel Recommendations

Booking.com

Booking.com built an AI Trip Planner to handle unstructured, natural language queries from travelers seeking personalized recommendations. The challenge was combining LLMs' ability to understand conversational requests with years of structured behavioral data (searches, clicks, bookings). Instead of relying solely on prompt engineering with external APIs, they used supervised fine-tuning on open-source LLMs with parameter-efficient methods. This approach delivered superior recommendation metrics while achieving 3x faster inference compared to prompt-based solutions, while maintaining data privacy and security by keeping all processing internal.

Supply Chain Intelligence Platform Using Compound AI Systems

Altana

Altana, a global supply chain intelligence company, faced challenges in efficiently deploying and managing multiple GenAI models for diverse customer use cases. By implementing Databricks Mosaic AI platform, they transformed their ML lifecycle management, combining custom deep learning models with fine-tuned LLMs and RAG workflows. This led to 20x faster model deployment times and 20-50% performance improvements, while maintaining data privacy and governance requirements across their global operations.

Swarm-Coding with Multiple Background Agents for Large-Scale Code Maintenance

Faire

Faire implemented "swarm-coding" using GitHub Copilot's background agents to automate tedious engineering tasks like cleaning up expired feature flags and migrating test infrastructure. By coordinating multiple autonomous AI agents working in parallel, they enabled non-engineers to land simple code changes and freed up engineering teams to focus on innovation rather than maintenance work. Within the first month of deployment, 18% of the engineering team adopted the approach, merging over 500 Copilot pull requests with an average time savings of 39.6 minutes per PR and a 25% increase in overall PR volume among users. The company enhanced the background agents through custom instructions, MCP (Model Context Protocol) servers, and programmatic task assignment to create specialized agent profiles for common workflows.

Synthetic Consumer Survey Generation Using LLMs with Semantic Similarity Response Mapping

Colgate

PyMC Labs partnered with Colgate to address the limitations of traditional consumer surveys for product testing by developing a novel synthetic consumer methodology using large language models. The challenge was that standard approaches of asking LLMs to provide numerical ratings (1-5) resulted in biased, middle-of-the-road responses that didn't reflect real consumer behavior. The solution involved allowing LLMs to provide natural text responses which were then mapped to quantitative scales using embedding similarity to reference responses. This approach achieved 90% of the maximum achievable correlation with real survey data, accurately reproduced demographic effects including age and income patterns, eliminated positivity bias present in human surveys, and provided richer qualitative feedback while being faster and cheaper than traditional surveys.

Synthetic Data Generation for Privacy-Preserving Search Evaluation

Canva

Canva faced the challenge of evaluating and improving their private design search functionality for 200M monthly active users while maintaining strict privacy constraints that prevented viewing actual user designs or queries. The company developed a novel solution using GPT-4o to generate entirely synthetic but realistic test datasets, including design content, titles, and queries at various difficulty levels. This LLM-powered approach enabled engineers to run reproducible offline evaluations in under 10 minutes using local testcontainers, achieving 300x faster iteration cycles compared to traditional A/B testing while maintaining strong correlation with online experiment results, all without compromising user privacy.

System Prompt Learning for Coding Agents Using LLM-as-Judge Evaluation

Arize

This case study explores how Arize applied "system prompt learning" to improve the performance of production coding agents (Claude and Cline) without model fine-tuning. The problem addressed was that coding agents rely heavily on carefully crafted system prompts that require continuous iteration, but traditional reinforcement learning approaches are sample-inefficient and resource-intensive. Arize's solution involved an iterative process using LLM-as-judge evaluations to generate English-language feedback on agent failures, which was then fed into a meta-prompt to automatically generate improved system prompt rules. Testing on the SWEBench benchmark with just 150 examples, they achieved a 5% improvement in GitHub issue resolution for Claude and 15% for Cline, demonstrating that well-engineered evaluation prompts can efficiently optimize agent performance with minimal training data compared to approaches like DSPy's MIPRO optimizer.

Systematic AI Application Improvement Through Evaluation-Driven Development

Ragas, Various

This case study presents Ragas' comprehensive approach to improving AI applications through systematic evaluation practices, drawn from their experience working with various enterprises and early-stage startups. The problem addressed is the common challenge of AI engineers making improvements to LLM applications without clear measurement frameworks, leading to ineffective iteration cycles and poor user experiences. The solution involves a structured evaluation methodology encompassing dataset curation, human annotation, LLM-as-judge scaling, error analysis, experimentation, and continuous feedback loops. The results demonstrate that teams can move from subjective "vibe checks" to objective, data-driven improvements that systematically enhance AI application performance and user satisfaction.

Systematic Analysis of Prompt Templates in Production LLM Applications

Uber, Microsoft

The research analyzes real-world prompt templates from open-source LLM-powered applications to understand their structure, composition, and effectiveness. Through analysis of over 2,000 prompt templates from production applications like those from Uber and Microsoft, the study identifies key components, patterns, and best practices for template design. The findings reveal that well-structured templates with specific patterns can significantly improve LLMs' instruction-following abilities, potentially enabling weaker models to achieve performance comparable to more advanced ones.

Systematic Approach to Building Reliable LLM Data Processing Pipelines Through Iterative Development

DocETL

UC Berkeley researchers studied how organizations struggle with building reliable LLM pipelines for unstructured data processing, identifying two critical gaps: data understanding and intent specification. They developed DocETL, a research framework that helps users systematically iterate on LLM pipelines by first understanding failure modes in their data, then clarifying prompt specifications, and finally applying accuracy optimization strategies, moving beyond the common advice of simply "iterate on your prompts."

Systematic LLM Evaluation Framework for Content Generation

Canva

Canva developed a systematic framework for evaluating LLM outputs in their design transformation feature called Magic Switch. The framework focuses on establishing clear success criteria, codifying these into measurable metrics, and using both rule-based and LLM-based evaluators to assess content quality. They implemented a comprehensive evaluation system that measures information preservation, intent alignment, semantic order, tone appropriateness, and format consistency, while also incorporating regression testing principles to ensure prompt improvements don't negatively impact other metrics.

T-RAG: Tree-Based RAG Architecture for Question Answering Over Organizational Documents

Qatar Computing Research Institute

Qatar Computing Research Institute developed a novel question-answering system for organizational documents combining RAG, finetuning, and a tree-based entity structure. The system, called T-RAG, handles confidential documents on-premise using open source LLMs and achieves 73% accuracy on test questions, outperforming baseline approaches while maintaining robust entity tracking through a custom tree structure.

Test-Driven Vibe Development: Integrating Quality Engineering with AI Code Generation

Asos

ASOS, a major e-commerce retailer, developed Test-Driven Vibe Development (TDVD), a novel methodology that combines test-first quality engineering practices with LLM-driven code generation to address the quality and reliability challenges of "vibe coding." The company applied this approach to build an internal stock discrepancy reporting system, using AI agents to generate both tests and code in a structured workflow that prioritizes acceptance test-driven development (ATDD), behavior-driven development (BDD), and test-driven development (TDD). With a team of effectively 2.5 people working part-time, they delivered a full-stack MVP (backend API, Azure Functions, React frontend) in 4 weeks—representing a 7-10x acceleration compared to traditional development estimates—while maintaining quality through continuous validation against predefined test requirements and catching hallucinations early in the development cycle.

Testing and Evaluation Strategies for AI-Powered Code Editor with Agentic Editing

Zed

Zed, an AI-enabled code editor built from scratch in Rust, implemented comprehensive testing and evaluation strategies to ensure reliable agentic editing capabilities. The company faced the challenge of maintaining their rigorous empirical testing approach while dealing with the non-deterministic nature of LLM outputs. They developed a multi-layered approach combining stochastic testing with deterministic unit tests, addressing issues like streaming edits, XML tag parsing, indentation handling, and escaping behaviors. Through statistical testing methods running hundreds of iterations and setting pass/fail thresholds, they successfully deployed reliable AI-powered code editing features that work effectively with frontier models like Claude 4.

Text-to-Floor Plan Generation Using LLMs with Prompt Engineering and Fine-Tuning

ZURU

ZURU Tech, a construction technology company, collaborated with AWS to develop a text-to-floor plan generator that allows users to create building designs using natural language descriptions. The project aimed to improve upon existing GPT-2 baseline results by implementing both prompt engineering with Claude 3.5 Sonnet on Amazon Bedrock and fine-tuning approaches with Llama models on Amazon SageMaker. Through careful dataset preparation, dynamic few-shot prompting, and comprehensive evaluation frameworks, the team achieved a 109% improvement in instruction adherence accuracy compared to their baseline model, with fine-tuning also delivering a 54% improvement in mathematical correctness for spatial relationships and dimensions.

Text-to-SQL AI Agent for Democratizing Data Access in Slack

Salesforce

Salesforce built Horizon Agent, an internal text-to-SQL Slack agent, to address a data access gap where engineers and data scientists spent dozens of hours weekly writing custom SQL queries for non-technical users. The solution combines Large Language Models with Retrieval-Augmented Generation (RAG) to allow users to ask natural language questions in Slack and receive SQL queries, answers, and explanations within seconds. After launching in Early Access in August 2024 and reaching General Availability in January 2025, the system freed technologists from routine query work and enabled non-technical users to self-serve data insights in minutes instead of waiting hours or days, transforming the role of technical staff from data gatekeepers to guides.

Text-to-SQL Solution for Data Democratization in Food Delivery Operations

Swiggy

Swiggy, a food delivery and quick commerce company, developed Hermes, a text-to-SQL solution that enables non-technical users to query company data using natural language through Slack. The problem addressed was the significant time and technical expertise required for teams to access specific business metrics, creating bottlenecks in decision-making. The solution evolved from a basic GPT-3.5 implementation (V1) to a sophisticated RAG-based architecture with GPT-4o (V2) that compartmentalizes business units into "charters" with dedicated metadata and knowledge bases. Results include hundreds of users across the organization answering several thousand queries with average turnaround times under 2 minutes, dramatically improving data accessibility for product managers, data scientists, and analysts while reducing dependency on technical resources.

Text-to-SQL System for Complex Healthcare Database Queries

MSD

MSD collaborated with AWS Generative Innovation Center to implement a text-to-SQL solution using Amazon Bedrock and Anthropic's Claude models to translate natural language queries into SQL for complex healthcare databases. The system addresses challenges like coded columns, non-intuitive naming, and complex medical code lists through custom lookup tools and prompt engineering, significantly reducing query time from hours to minutes while democratizing data access for non-technical staff.

Text-to-SQL System with RAG-Enhanced Table Selection

Pinterest

Pinterest developed a Text-to-SQL system to help data analysts convert natural language questions into SQL queries. The system evolved through two iterations: first implementing a basic LLM-powered SQL generator integrated into their Querybook tool, then enhancing it with RAG-based table selection to help users identify relevant tables from their vast data warehouse. The implementation showed a 35% improvement in task completion speed for SQL query writing, with first-shot acceptance rates improving from 20% to over 40% as the system matured.

Text-to-SQL System with Structured RAG and Comprehensive Evaluation

ICE / NYSE

ICE/NYSE developed a text-to-SQL application using structured RAG to enable business users to query financial data without needing SQL knowledge. The system leverages Databricks' Mosaic AI stack including Unity Catalog, Vector Search, Foundation Model APIs, and Model Serving. They implemented comprehensive evaluation methods using both syntactic and execution matching, achieving 77% syntactic accuracy and 96% execution match across approximately 50 queries. The system includes continuous improvement through feedback loops and few-shot learning from incorrect queries.

The Hidden Complexities of Building Production LLM Features: Lessons from Honeycomb's Query Assistant

Honeycomb

Honeycomb shares candid insights from building Query Assistant, their natural language to query interface, revealing the complex reality behind LLM-powered product development. Key challenges included managing context window limitations with large schemas, dealing with LLM latency (2-15+ seconds per query), navigating prompt engineering without established best practices, balancing correctness with usefulness, addressing prompt injection vulnerabilities, and handling legal/compliance requirements. The article emphasizes that successful LLM implementation requires treating models as feature engines rather than standalone products, and argues that early access programs often fail to reveal real-world implementation challenges.

Thinking Machines' Tinker: Low-Level Fine-Tuning API for Production LLM Training

Thinking Machines

Thinking Machines, a new AI company founded by former OpenAI researcher John Schulman, has developed Tinker, a low-level fine-tuning API designed to enable sophisticated post-training of language models without requiring teams to manage GPU infrastructure or distributed systems complexity. The product aims to abstract away infrastructure concerns while providing low-level primitives for expressing nearly all post-training algorithms, allowing researchers and companies to build custom models without developing their own training infrastructure. The company plans to release their own models and expand Tinker's capabilities to include multimodal functionality and larger-scale training jobs, while making the platform more accessible to non-experts through higher-level tooling.

Tool Masking for Enterprise Agentic AI Systems at Scale

Databook

Databook, which automates sales processes for large tech companies like Microsoft, Salesforce, and AWS, faced challenges running reliable agentic AI workflows at enterprise scale. The primary problem was that connecting services through Model Context Protocol (MCP) exposed entire APIs to LLMs, polluting execution with irrelevant data, increasing tokens and costs, and reducing reliability through "choice entropy." Their solution involved implementing "tool masks"—a configuration layer between agents and tool handlers that filters and reshapes input/output schemas, customizes tool interfaces per agent context, and enables prompt engineering of tools themselves. This approach resulted in cleaner, faster, more reliable agents with reduced costs, better self-correction capabilities, and the ability to rapidly adapt to customer requirements without code deployments.

Training and Deploying AI Coding Agents at Scale with GPT-5 Codex

OpenAI

OpenAI's Bill and Brian discuss their work on GPT-5 Codex and Codex Max, AI coding agents designed for production use. The team focused on training models with specific "personalities" optimized for pair programming, including traits like communication, planning, and self-checking behaviors. They trained separate model lines: Codex models optimized specifically for their agent harness with strong opinions about tool use (particularly terminal tools), and mainline GPT-5 models that are more general and steerable across different tooling environments. The result is a coding agent that OpenAI employees trust for production work, with approximately 50% of OpenAI staff using it daily, and some engineers like Brian claiming they haven't written code by hand in months. The team emphasizes the shift toward shipping complete agents rather than just models, with abstractions moving upward to enable developers to build on top of pre-configured agentic systems.

Transforming a Late-Stage SaaS Company into an AI-First Agent Business

Intercom

Intercom successfully pivoted from a struggling traditional customer support SaaS business facing near-zero growth to an AI-first agent-based company through the development and deployment of Fin, their AI customer service agent. CEO Eoghan McCabe implemented a top-down transformation strategy involving strategic focus, cultural overhaul, aggressive cost-cutting, and significant investment in AI talent and infrastructure. The company went from low single-digit growth to becoming one of the fastest-growing B2B software companies, with Fin projected to surpass $100 million ARR within three quarters and growing at over 300% year-over-year.

Transforming a Voice Assistant from Scripted Commands to Generative AI Conversation at Scale

AWS (Alexa)

AWS (Alexa) faced the challenge of evolving their voice assistant from scripted, command-based interactions to natural, generative AI-powered conversations while serving over 600 million devices and maintaining complete backward compatibility with existing integrations. The team completely rearchitected Alexa using large language models (LLMs) to create Alexa Plus, which supports conversational interactions, complex multi-step planning, and real-world action execution. Through extensive experimentation with prompt engineering, multi-model architectures, speculative execution, prompt caching, API refactoring, and fine-tuning, they achieved the necessary balance between accuracy, latency (sub-2-second responses), determinism, and model flexibility required for a production voice assistant serving hundreds of millions of users daily.

Transforming Agent and Customer Experience with Generative AI in Health Insurance

nib

nib, an Australian health insurance provider covering approximately 2 million people, transformed both customer and agent experiences using AWS generative AI capabilities. The company faced challenges around contact center efficiency, agent onboarding time, and customer service scalability. Their solution involved deploying a conversational AI chatbot called "Nibby" built on Amazon Lex, implementing call summarization using large language models to reduce after-call work, creating an internal knowledge-based GPT application for agents, and developing intelligent document processing for claims. These initiatives resulted in approximately 60% chat deflection, $22 million in savings from Nibby alone, and a reported 50% reduction in after-call work time through automated call summaries, while significantly improving agent onboarding and overall customer experience.

Transforming HR Operations with AI-Powered Solutions at Scale

Nubank

Nubank, a rapidly growing fintech company with over 8,000 employees across multiple countries, faced challenges in managing HR operations at scale while maintaining employee experience quality. The company deployed multiple AI and LLM-powered solutions to address these challenges: AskNu, a Slack-based AI assistant for instant access to internal information; generative AI for analyzing thousands of open-ended employee feedback comments from engagement surveys; time-series forecasting models for predicting employee turnover; machine learning models for promotion budget planning; and AI quality scoring for optimizing their internal knowledge base (WikiPeople). These initiatives resulted in measurable improvements including 14 percentage point increase in turnover prediction accuracy, faster insights from employee feedback, more accurate promotion forecasting, and enhanced knowledge accessibility across the organization.

Transforming Insurance Agent Support with RAG-Powered Chat Assistant

InsuranceDekho

InsuranceDekho addressed the challenge of slow response times in insurance agent queries by implementing a RAG-based chat assistant using Amazon Bedrock and Anthropic's Claude Haiku. The solution eliminated the need for constant SME consultation, cached frequent responses using Redis, and leveraged OpenSearch for vector storage, resulting in an 80% reduction in response times for customer queries about insurance plans.

Troubleshooting and Optimizing RAG Pipelines: Lessons from Production

Lemonade

A comprehensive analysis of common challenges and solutions in implementing RAG (Retrieval Augmented Generation) pipelines at Lemonade, an insurance technology company. The case study covers issues ranging from missing content and retrieval problems to reranking challenges, providing practical solutions including data cleaning, prompt engineering, hyperparameter tuning, and advanced retrieval strategies.

Tuning RAG Search for Production Customer Support Chatbot

Elastic

Elastic's Field Engineering team developed and improved a customer support chatbot using RAG and LLMs. They faced challenges with search relevance, particularly around CVE and version-specific queries, and implemented solutions including hybrid search strategies, AI-generated summaries, and query optimization techniques. Their improvements resulted in a 78% increase in search relevance for top-3 results and generated over 300,000 AI summaries for future applications.

UI/UX Design Considerations for Production GenAI Chatbots

Elastic

Elastic's Field Engineering team developed a customer support chatbot, focusing on crucial UI/UX design considerations for production deployment. The case study details how they tackled challenges including streaming response handling, timeout management, context awareness, and user engagement through carefully designed animations. The team created a custom chat interface using their EUI component library, implementing innovative solutions for handling long-running LLM requests and managing multiple types of contextual information in a user-friendly way.

Unified Data Foundation for AI-Fueled Mortgage and Home Ownership Platform

Rocket

Rocket Companies, America's largest mortgage provider serving 1 in 6 mortgages, transformed its fragmented data landscape into a unified data foundation to support AI-driven home ownership services. The company consolidated 10+ petabytes of data from 12+ OLTP systems into a single S3-based data lake using open table formats like Apache Iceberg and Parquet, creating standardized data products (Customer 360, Mortgage 360, Transaction 360) accessible via APIs. This foundation enabled 210+ machine learning models running in full automation, reduced mortgage approval times from weeks to under 8 minutes, and powered production agentic AI applications that provide real-time business intelligence to executives. The integration of acquired companies (Redfin and Mr. Cooper) resulted in a 20% increase in refinance pipeline, 3x industry recapture rate, 10% lift in conversion rates, and 9-point improvement in banker follow-ups.

Unified Property Management Search and Digital Assistant Using Amazon Bedrock

CBRE

CBRE, the world's largest commercial real estate services firm, faced challenges with fragmented property data scattered across 10 distinct sources and four separate databases, forcing property management professionals to manually search through millions of documents and switch between multiple systems. To address this, CBRE partnered with AWS to build a next-generation unified search and digital assistant experience within their PULSE system using Amazon Bedrock, Amazon OpenSearch Service, and other AWS services. The solution combines retrieval augmented generation (RAG), multiple foundation models (Amazon Nova Pro for SQL generation and Claude Haiku for document interaction), and advanced prompt engineering to provide natural language query capabilities across both structured and unstructured data. The implementation achieved significant results including a 67% reduction in SQL query generation time (from 12 seconds to 4 seconds with Amazon Nova Pro), 80% improvement in database query performance, 60% reduction in token usage through optimized prompt architecture, and 95% accuracy in search results, ultimately enhancing operational efficiency and enabling property managers to make faster, more informed decisions.

Unlocking Patient Population Insights Using Smart Subgroups and LLMs

Aetion

Aetion developed a system to help healthcare researchers discover patterns in patient populations using natural language queries. The solution combines unsupervised machine learning for patient clustering with Amazon Bedrock and Claude 3 LLMs to enable natural language interaction with the data. This allows users unfamiliar with real-world healthcare data to quickly discover patterns and generate hypotheses, reducing analysis time from days to minutes while maintaining scientific rigor.

Usability Challenges in Commercial AI Agent Systems: A Study of Industry Aspirations vs. User Realities

Carnegie Mellon

This research study addresses the gap between how AI agents are marketed by the technology industry and how end-users actually experience them in practice. Researchers from Carnegie Mellon conducted a systematic review of 102 commercial AI agent products to understand industry positioning, identifying three core use case categories: orchestration (automating GUI tasks), creation (generating structured documents), and insight (providing analysis and recommendations). They then conducted a usability study with 31 participants attempting representative tasks using popular commercial agents (Operator and Manus), revealing five critical usability barriers: misalignment between agent capabilities and user mental models, premature trust assumptions, inflexible collaboration styles, overwhelming communication overhead, and lack of meta-cognitive abilities. While users generally succeeded at assigned tasks and were impressed with the technology, these barriers significantly impacted the user experience and highlighted the disconnect between marketed capabilities and practical usability.

User Journey Identification Using LLMs for Personalized Recommendations

Pinterest

Pinterest sought to evolve from a simple content recommendation platform to an inspiration-to-realization platform by understanding users' underlying, long-term goals through identifying "user journeys" - sequences of interactions centered on particular interests and intents. To address the challenge of limited training data, Pinterest built a hybrid system that dynamically extracts keywords from user activities, performs hierarchical clustering to identify journey candidates, and then applies specialized models for journey ranking, stage prediction, naming, and expansion. The team leveraged pretrained foundation models and increasingly incorporated LLMs for tasks like journey naming, expansion, and relevance evaluation. Initial experiments with journey-aware notifications demonstrated substantial improvements, including an 88% higher email click rate and 32% higher push open rate compared to interest-based notifications, along with a 23% increase in positive user feedback.

Using Evaluation Systems and Inference-Time Scaling for Beautiful, Scannable QR Code Generation

Modal

Modal's engineering team tackled the challenge of generating aesthetically pleasing QR codes that consistently scan by implementing comprehensive evaluation systems and inference-time compute scaling. The team developed automated evaluation pipelines that measured both scan rate and aesthetic quality, using human judgment alignment to validate their metrics. They applied inference-time compute scaling by generating multiple QR codes in parallel and selecting the best candidates, achieving a 95% scan rate service-level objective while maintaining aesthetic quality and returning results in under 20 seconds.

Using GenAI to Automatically Fix Java Resource Leaks

Uber

Uber developed FixrLeak, a framework combining generative AI and Abstract Syntax Tree (AST) analysis to automatically detect and fix resource leaks in Java code. The system processes resource leaks identified by SonarQube, analyzes code safety through AST, and uses GPT-4 to generate appropriate fixes. When tested on 124 resource leaks in Uber's codebase, FixrLeak successfully automated fixes for 93 out of 102 eligible cases, significantly reducing manual intervention while maintaining code quality.

Using LLMs for Automated Opinion Summary Evaluation in E-commerce

Flipkart

Flipkart faced the challenge of evaluating AI-generated opinion summaries of customer reviews, where traditional metrics like ROUGE failed to align with human judgment and couldn't comprehensively assess summary quality across multiple dimensions. The company developed OP-I-PROMPT, a novel single-prompt framework that uses LLMs as evaluators across seven critical dimensions (fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, and specificity), along with SUMMEVAL-OP, a new benchmark dataset with 2,912 expert annotations. The solution achieved a 0.70 Spearman correlation with human judgments, significantly outperforming previous approaches especially on open-source models like Mistral-7B, while demonstrating that high-quality summaries directly impact business metrics like conversion rates and product return rates.

Using LLMs to Combat Health Insurance Claim Denials

Fight Health Insurance

Fight Health Insurance is an open-source project that uses fine-tuned large language models to help people appeal denied health insurance claims in the United States. The system processes denial letters, extracts relevant information, and generates appeal letters based on training data from independent medical review boards. The project addresses the widespread problem of insurance claim denials by automating the complex and time-consuming process of crafting effective appeals, making it accessible to individuals who lack the resources or knowledge to navigate the appeals process themselves. The tool is available both as an open-source Python package and as a free hosted service, though the sustainability model is still being developed.

Using LLMs to Enhance Search Discovery and Recommendations

Instacart

Instacart integrated LLMs into their search stack to enhance product discovery and user engagement. They developed two content generation techniques: a basic approach using LLM prompting and an advanced approach incorporating domain-specific knowledge from query understanding models and historical data. The system generates complementary and substitute product recommendations, with content generated offline and served through a sophisticated pipeline. The implementation resulted in significant improvements in user engagement and revenue, while addressing challenges in content quality, ranking, and evaluation.

Using LLMs to Scale Insurance Operations at a Small Company

Anzen

Anzen, a small insurance company with under 20 people, leveraged LLMs to compete with larger insurers by automating their underwriting process. They implemented a document classification system using BERT and AWS Textract for information extraction, achieving 95% accuracy in document classification. They also developed a compliance document review system using sentence embeddings and question-answering models to provide immediate feedback on legal documents like offer letters.

Using RAG to Improve Industry Classification Accuracy

Ramp

Ramp tackled the challenge of inconsistent industry classification by developing an in-house Retrieval-Augmented Generation (RAG) system to migrate from a homegrown taxonomy to standardized NAICS codes. The solution combines embedding-based retrieval with a two-stage LLM classification process, resulting in improved accuracy, better data quality, and more precise customer understanding across teams. The system includes comprehensive logging and monitoring capabilities, allowing for quick iterations and performance improvements.

Using Token Log-Probabilities to Detect and Filter LLM Hallucinations in Customer Support

Gusto

Gusto developed a method to improve the reliability of their LLM-based customer support system by using token log-probabilities as a confidence metric. The approach monitors sequence log-probability scores to identify and filter out potentially hallucinated or low-quality LLM responses. In their case study, they found a 69% relative difference in accuracy between high and low confidence responses, with the highest confidence responses achieving 76% accuracy compared to 45% for the lowest confidence responses.

Variable Aggression Code Autocomplete with Fine-Tuned LLMs

Windsurf

Windsurf developed Tab v2, an AI-powered code autocomplete system that addresses the challenge of balancing prediction frequency, accuracy, and code length in developer tooling. The team reimagined their LLM-based autocomplete by focusing on total keystrokes saved rather than just acceptance rate, implementing extensive context engineering to reduce prompt length by 76%, and using reinforcement learning to train models with different "aggression" levels. The result was a 54% average increase in characters per prediction and 25-75% more accepted code, with user-selectable aggression parameters allowing developers to customize behavior based on personal preferences.

Video Content Summarization and Metadata Enrichment for Streaming Platform

Paramount+

Paramount+ partnered with Google Cloud Consulting to develop two key AI use cases: video summarization and metadata extraction for their streaming platform containing over 50,000 videos. The project used Gen AI jumpstarts to prototype solutions, implementing prompt chaining, embedding generation, and fine-tuning approaches. The system was designed to enhance content discoverability and personalization while reducing manual labor and third-party costs. The implementation included a three-component architecture handling transcription creation, content generation, and personalization integration.

Voice AI Agent Development and Production Challenges

Various (Canonical, Prosus, DeepMind)

Panel discussion with experts from various companies exploring the challenges and solutions in deploying voice AI agents in production. The discussion covers key aspects of voice AI development including real-time response handling, emotional intelligence, cultural adaptation, and user retention. Experts shared experiences from e-commerce, healthcare, and tech sectors, highlighting the importance of proper testing, prompt engineering, and understanding user interaction patterns for successful voice AI deployments.