ZenML

LLMOps Tag: google_gcp

304 tools with this tag

← Back to LLMOps Database

Common industries

View all industries →

A Practical Blueprint for Evaluating Conversational AI at Scale

Dropbox

Dropbox shares their comprehensive approach to building and evaluating Dropbox Dash, their conversational AI product. The company faced challenges with ad-hoc testing leading to unpredictable regressions where changes to any part of their LLM pipeline—intent classification, retrieval, ranking, prompt construction, or inference—could cause previously correct answers to fail. They developed a systematic evaluation-first methodology treating every experimental change like production code, requiring rigorous testing before merging. Their solution involved curating diverse datasets (both public and internal), defining actionable metrics using LLM-as-judge approaches that outperformed traditional metrics like BLEU and ROUGE, implementing the Braintrust evaluation platform, and automating evaluation throughout the development-to-production pipeline. This resulted in a robust system with layered gates catching regressions early, continuous live-traffic scoring for production monitoring, and a feedback loop for continuous improvement that significantly improved reliability and deployment safety.

Abstractive Conversation Summarization for Google Chat Spaces

Google

Google deployed an abstractive summarization system to automatically generate conversation summaries in Google Chat Spaces to address information overload from unread messages, particularly in hybrid work environments. The solution leveraged the Pegasus transformer model fine-tuned on a custom ForumSum dataset of forum conversations, then distilled into a hybrid transformer-encoder/RNN-decoder architecture for lower latency. The system surfaces summaries through cards when users enter Spaces with unread messages, with quality controls including heuristics for triggering, detection of low-quality summaries, and ephemeral caching of pre-generated summaries to reduce latency, ultimately delivering production value to premium Google Workspace business customers.

Agent Registry and Dynamic Prompt Management for AI Feature Development

Gitlab

Gitlab faced challenges with delivering prompt improvements for their AI-powered issue description generation feature, particularly for self-managed customers who don't update frequently. They developed an Agent Registry system within their AI Gateway that abstracts provider models, prompts, and parameters, allowing for rapid prompt updates and model switching without requiring monolith changes or new releases. This system enables faster iteration on AI features and seamless provider switching while maintaining a clean separation of concerns.

Agent-Based Workflow Automation in Spreadsheets for Non-Technical Users

Otto

Otto, founded by Suli Omar, addresses the challenge of making AI agents accessible to non-technical users by embedding agent workflows directly into spreadsheet interfaces. The company transforms unstructured data processing tasks into spreadsheet-based workflows where each cell acts as an autonomous agent capable of executing tasks, waiting for dependencies, and outputting structured results. By leveraging the familiar spreadsheet UX instead of traditional chatbot interfaces, Otto enables finance teams, accountants, and other business users to harness agent capabilities without requiring technical expertise. The solution involves sophisticated model selection across three tiers (workhorse, middle-tier, and heavy reasoning models) to optimize cost and performance, continuous evaluation through customer usage patterns, and iterative model testing to maintain service quality as new LLM capabilities emerge.

Agent-First AI Development Platform with Multi-Surface Orchestration

Google Deepmind

Google DeepMind launched Anti-gravity, an agent-first AI development platform designed to handle increasingly complex, long-running software development tasks powered by Gemini 3 Pro. The platform addresses the challenge of managing AI agents operating across multiple surfaces (editor, browser, and agent manager) by introducing "artifacts" - dynamic representations that help organize agent outputs and enable asynchronous feedback. The solution emerged from close collaboration between product and research teams at DeepMind, creating a feedback loop where internal dogfooding identified model gaps and drove improvements. Initial launch experienced capacity constraints due to high demand, but users who accessed the product reported significant workflow improvements from the multi-surface agent orchestration approach.

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

Agentic RAG Implementation for Retail Personalization and Customer Support

MongoDB

MongoDB and Dataworkz partnered to implement an agentic RAG (Retrieval Augmented Generation) solution for retail and e-commerce applications. The solution combines MongoDB Atlas's vector search capabilities with Dataworkz's RAG builder to create a scalable system that integrates operational data with unstructured information. This enables personalized customer experiences through intelligent chatbots, dynamic product recommendations, and enhanced search functionality, while maintaining context-awareness and real-time data access.

AI Agent Development and Evaluation Platform for Insurance Underwriting

Snorkel

Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.

AI Agent for Real Estate Legal Document Analysis and Lease Reporting

Orbital

Orbital Witness developed Orbital Copilot, an AI agent specifically designed for real estate legal work, to address the time-intensive nature of legal due diligence and lease reporting. The solution evolved from classical machine learning models through LLM-based approaches to a sophisticated agentic architecture that combines planning, memory, and tool use capabilities. The system analyzes hundreds of pages across multiple legal documents, answers complex queries by following information trails across documents, and provides transparent reasoning with source citations. Deployed with prestigious law firms including BCLP, Clifford Chance, and others, Orbital Copilot demonstrated up to 70% time savings on lease reporting tasks, translating to significant cost reductions for complex property analyses that typically require 2-10+ hours of lawyer time.

AI Agents and Intelligent Observability for DevOps Modernization

HRS Group / Netflix / Harness

This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.

AI Agents for Interpretability Research: Experimenter Agents in Production

Goodfire

Goodfire, an AI interpretability research company, deployed AI agents extensively for conducting experiments in their research workflow over several months. They distinguish between "developer agents" (for software development) and "experimenter agents" (for research and discovery), identifying key architectural differences needed for the latter. Their solution, code-named Scribe, leverages Jupyter notebooks with interactive, stateful access via MCP (Model Context Protocol), enabling agents to iteratively run experiments across domains like genomics, vision transformers, and diffusion models. Results showed agents successfully discovering features in genomics models, performing circuit analysis, and executing complex interpretability experiments, though validation, context engineering, and preventing reward hacking remain significant challenges that require human oversight and critic systems.

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

AI Managed Services and Agent Operations at Enterprise Scale

PriceWaterhouseCooper

PriceWaterhouseCooper (PWC) addresses the challenge of deploying and maintaining AI systems in production through their managed services practice focused on data analytics and AI. The organization has developed frameworks for deploying AI agents in enterprise environments, particularly in healthcare and back-office operations, using their Agent OS framework built on Python. Their approach emphasizes process standardization, human-in-the-loop validation, continuous model tuning, and comprehensive measurement through evaluations to ensure sustainable AI operations at scale. Results include successful deployments in healthcare pre-authorization processes and the establishment of specialized AI managed services teams comprising MLOps engineers and data scientists who continuously optimize production models.

AI Strategy and LLM Application Development in Swedish Public Sector

Swedish Tax Authority

The Swedish Tax Authority (Skatteverket) has been on a multi-decade digitalization journey, progressively incorporating AI and large language models into production systems to automate and enhance tax services. The organization has developed various NLP applications including text categorization, transcription, OCR pipelines, and question-answering systems using RAG architectures. They have tested both open-source models (Llama 3.1, Mixtral 7B, Cohere) and commercial solutions (GPT-3.5), finding that open-source models perform comparably for simpler queries while commercial models excel at complex questions. The Authority operates within a regulated environment requiring on-premise deployment for sensitive data, adopting Agile/SAFe methodologies and building reusable AI infrastructure components that can serve multiple business domains across different public sector silos.

AI-Assisted Database Debugging Platform at Scale

Databricks

Databricks built an agentic AI platform to help engineers debug thousands of OLTP database instances across hundreds of regions on AWS, Azure, and GCP. The platform addresses the problem of fragmented tooling and dispersed expertise by unifying metrics, logs, and operational workflows into a single intelligent interface with a chat assistant. The solution reduced debugging time by up to 90%, enabled new engineers to start investigations in under 5 minutes, and has achieved company-wide adoption, fundamentally changing how engineers interact with their infrastructure.

AI-Driven Collateral Allocation Optimization in Fintech

Mercado Libre

Mercado Pago, the fintech arm of Mercado Libre, faced the challenge of optimizing collateral allocation across billions of dollars in credit lines secured from major banks, requiring daily selection from millions of loans with complex contractual constraints. The company developed Enigma, a solution leveraging linear programming via Google OR-Tools combined with a custom grouping heuristic to handle scalability challenges. While the article primarily focuses on traditional optimization techniques rather than LLMs, it hints at future AI agent exploration for enhanced analytics, strategic constraint proposals, and automated translation of contractual conditions into mathematical constraints, representing a potential future evolution toward LLM integration in financial operations.

AI-Driven Documentation Generation for dbt Data Models

Loblaw Digital

Loblaw Digital addressed the challenge of maintaining comprehensive documentation for over 3,000 dbt data models across their analytics engineering infrastructure. Manual documentation proved labor-intensive and often led to incomplete or outdated documentation that confused business users. The team implemented an LLM-based solution using the open-source dbt-documentor tool integrated with Google Cloud's Vertex AI platform, which automatically generates descriptions for models and their columns by ingesting dbt's manifest.json files without accessing actual data. This automation significantly improved documentation coverage and productivity while maintaining data security, enabling analysts to better understand model purposes and dependencies through the dbt documentation website.

AI-Powered Automated Issue Resolution Achieving State-of-the-Art Performance on SWE-bench

Trae

Trae developed an AI engineering system that achieved 70.6% accuracy on the SWE-bench Verified benchmark, setting a new state-of-the-art record for automated software issue resolution. The solution combines multiple large language models (Claude 3.7, Gemini 2.5 Pro, and OpenAI o4-mini) in a sophisticated multi-stage pipeline featuring generation, filtering, and voting mechanisms. The system uses specialized agents including a Coder agent for patch generation, a Tester agent for regression testing, and a Selector agent that employs both syntax-based voting and multi-selection voting to identify the best solution from multiple candidate patches.

AI-Powered Background Coding Agents for Large-Scale Software Maintenance

Spotify

Spotify faced the challenge of scaling complex code migrations and maintenance tasks across thousands of repositories, where their existing Fleet Management system handled simple transformations well but required specialized expertise for complex changes. They integrated AI coding agents into their Fleet Management platform, allowing engineers to define fleet-wide code changes using natural language prompts instead of writing complex AST manipulation scripts. Since February 2025, this approach has generated over 1,500 merged pull requests handling complex tasks like language modernization, breaking API changes, and UI component migrations, achieving 60-90% time savings compared to manual implementation while expanding to ad hoc background coding tasks accessible via Slack and GitHub.

AI-Powered Call Center Agents for Healthcare Operations

HeyRevia

HeyRevia has developed an AI call center solution specifically for healthcare operations, where over 30% of operations run on phone calls. Their system uses AI agents to handle complex healthcare-related calls, including insurance verifications, prior authorizations, and claims processing. The solution incorporates real-time audio processing, context understanding, and sophisticated planning capabilities to achieve performance that reportedly exceeds human operators while maintaining compliance with healthcare regulations.

AI-Powered Chatbot Automation with Hybrid NLU and LLM Approach

Scotiabank

Scotiabank developed a hybrid chatbot system combining traditional NLU with modern LLM capabilities to handle customer service inquiries. They created an innovative "AI for AI" approach using three ML models (nicknamed Luigi, Eva, and Peach) to automate the review and improvement of chatbot responses, resulting in 80% time savings in the review process. The system includes LLM-powered conversation summarization to help human agents quickly understand customer contexts, marking the bank's first production use of generative AI features.

AI-Powered Contact Center Copilot: From Research to Enterprise-Scale Production

Cresta / OpenAI

Cresta, founded in 2017 by Stanford PhD students with OpenAI research experience, developed an AI copilot system for contact center agents that provides real-time suggestions during customer conversations. The company tackled the challenge of transforming academic NLP and reinforcement learning research into production-grade enterprise software by building domain-specific models fine-tuned on customer conversation data. Starting with Intuit as their first customer through an unconventional internship arrangement, they demonstrated measurable ROI through A/B testing, showing improved conversion rates and agent productivity. The solution evolved from custom LSTM and transformer models to leveraging pre-trained foundation models like GPT-3/4 with fine-tuning, ultimately serving Fortune 500 customers across telecommunications, airlines, and banking with demonstrated value including a pilot generating $100 million in incremental revenue.

AI-Powered Content Understanding and Ad Targeting Platform

Dotdash

Dotdash Meredith, a major digital publisher, developed an AI-powered system called Decipher that understands user intent from content consumption to deliver more relevant advertising. Through a strategic partnership with OpenAI, they enhanced their content understanding capabilities and expanded their targeting platform across the premium web. The system outperforms traditional cookie-based targeting while maintaining user privacy, proving that high-quality content combined with AI can drive better business outcomes.

AI-Powered Customer Interest Generation for Personalized E-commerce Recommendations

Wayfair

Wayfair developed a GenAI-powered system to generate nuanced, free-form customer interests that go beyond traditional behavioral models and fixed taxonomies. Using Google's Gemini LLM, the system processes customer search queries, product views, cart additions, and purchase history to infer deep insights about preferences, functional needs, and lifestyle values. These LLM-generated interests power personalized product carousels on the homepage and product detail pages, driving measurable engagement and revenue gains while enabling more transparent and adaptable personalization at scale across millions of customers.

AI-Powered Customer Service and Call Center Transformation with Multi-Agent Systems

Fastweb / Vodafone

Fastweb / Vodafone, a major European telecommunications provider serving 9.5 million customers in Italy, transformed their customer service operations by building two AI agent systems to address the limitations of traditional customer support. They developed Super TOBi, a customer-facing agentic chatbot system, and Super Agent, an internal tool that empowers call center consultants with real-time diagnostics and guidance. Built on LangGraph and LangChain with Neo4j knowledge graphs and monitored through LangSmith, the solution achieved a 90% correctness rate, 82% resolution rate, 5.2/7 Customer Effort Score for Super TOBi, and over 86% One-Call Resolution rate for Super Agent, delivering faster response times and higher customer satisfaction while reducing agent workload.

AI-Powered Data Copilot for Autonomous Analysis in IDEs

BlaBlaCar

BlaBlaCar developed an AI-powered Data Copilot to address the inefficient workflow between Software Engineers and Data Analysts, where engineers lacked data warehouse access and analysts were overwhelmed with repetitive queries. The solution embeds an LLM-powered assistant directly in VS Code that connects to BigQuery, provides contextual business logic from curated queries, generates SQL and Python code with unit tests, and enables engineers to perform their own analyses with data health checks as guardrails. The tool leverages a "zero-infrastructure" RAG approach using VS Code's native capabilities and GitHub Copilot, treating analyses as code artifacts in pull requests that analysts review, resulting in faster question resolution (from weeks to minutes) and freeing analysts to focus on high-value modeling work.

AI-Powered Food Image Generation System at Scale

Delivery Hero

Delivery Hero built a comprehensive AI-powered image generation system to address the problem that 86% of food products lacked images, which significantly impacted conversion rates. The solution involved implementing both text-to-image generation and image inpainting workflows using Stable Diffusion models, with extensive optimization for cost efficiency and quality assurance. The system successfully generated over 100,000 production images, achieved 6-8% conversion rate improvements, and reduced costs to under $0.003 per image through infrastructure optimization and model fine-tuning.

AI-Powered Home Loan Guardian for Mortgage Refinancing

Lendi

Lendi, an Australian FinTech company, developed Guardian, an agentic AI application to transform the home loan refinancing experience. The company identified that homeowners lacked visibility into their mortgage positions and faced cumbersome refinancing processes, while brokers spent excessive time on administrative tasks. Using Amazon Bedrock's foundation models, Lendi built a multi-agent system deployed on Amazon EKS that monitors loan competitiveness, tracks equity positions in real-time, and streamlines refinancing through conversational AI. The solution was developed in 16 weeks and has already settled millions in home loans with significantly reduced refinance cycle times, enabling customers to complete refinancing in as little as 10 minutes through the Rate Radar feature.

AI-Powered Image Generation for Customizable Grocery Products

Instacart

Instacart's FoodStorm Order Management System faced the challenge of providing high-quality product images for countless customizable grocery items like deli sandwiches, cakes, and prepared foods, where professional photography for every configuration was impractical and costly. The solution involved integrating generative AI image generation capabilities through Instacart's internal Pixel service (which provides access to Google Imagen and other models) directly into FoodStorm's user interface, allowing grocery retailers to create product images on-demand with customizable prompts. Through multiple design iterations, the system evolved from simple one-click generation to a sophisticated interface where users can fine-tune prompts, preview multiple variations, and inspect details for quality control, ultimately enabling retailers to efficiently produce images for ingredients, toppings, promotional banners, and category thumbnails across the Instacart platform.

AI-Powered Natural Language Flight Search Implementation

Alaska Airlines

Alaska Airlines implemented a natural language destination search system powered by Google Cloud's Gemini LLM to transform their flight booking experience. The system moves beyond traditional flight search by allowing customers to describe their desired travel experience in natural language, considering multiple constraints and preferences simultaneously. The solution integrates Gemini with Alaska Airlines' existing flight data and customer information, ensuring recommendations are grounded in actual available flights and pricing.

AI-Powered Personal Health Coach Using Gemini Models

Fitbit

Fitbit developed an AI-powered personal health coach to address the fragmented and generic nature of traditional health and fitness guidance. Using Gemini models within a multi-agent framework, the system provides proactive, personalized, and adaptive coaching grounded in behavioral science and individual health metrics such as sleep and activity data. The solution employs a conversational agent for orchestration, a data science agent for numerical reasoning on physiological time series, and domain expert agents for specialized guidance. The system underwent extensive validation through the SHARP evaluation framework, involving over 1 million human annotations and 100k hours of expert evaluation across multiple health disciplines. The health coach entered public preview for eligible US-based Fitbit Premium users, providing personalized insights, goal setting, and adaptive plans to build sustainable health habits.

AI-Powered Personalized Content Recommendations for Sports and Entertainment Venue

Golden State Warriors

The Golden State Warriors implemented a recommendation engine powered by Google Cloud's Vertex AI to personalize content delivery for their fans across multiple platforms. The system integrates event data, news content, game highlights, retail inventory, and user analytics to provide tailored recommendations for both sports events and entertainment content at Chase Center. The solution enables personalized experiences for 18,000+ venue seats while operating with limited technical resources.

AI-Powered Similar Issues Detection for Project Management

Linear

Linear developed a Similar Issues matching feature to address the persistent challenge of duplicate issues and backlog management in large team workflows. The solution uses large language models to generate vector embeddings that capture the semantic meaning of issue descriptions, enabling accurate detection of related or duplicate issues across their project management platform. The feature integrates at multiple touchpoints—during issue creation, in the Triage inbox, and within support integrations like Intercom—allowing teams to identify duplicates before they enter the system. The implementation uses PostgreSQL with pgvector on Google Cloud Platform for vector storage and search, with partitioning strategies to handle tens of millions of issues at scale.

Architecture Patterns for Production AI Systems: Lessons from Building and Failing with Generative AI Products

Outropy

Phil Calçado shares a post-mortem analysis of Outropy, a failed AI productivity startup that served thousands of users, revealing why most AI products struggle in production. Despite having superior technology compared to competitors like Salesforce's Slack AI, Outropy failed commercially but provided valuable insights into building production AI systems. Calçado argues that successful AI products require treating agents as objects and workflows as data pipelines, applying traditional software engineering principles rather than falling into "Twitter-driven development" or purely data science approaches.

Auto-generated Document Summaries Using Abstractive Summarization

Google

Google Docs implemented automatic document summary generation to help users manage the volume of documents they receive daily. The challenge was to create concise, high-quality summaries that capture document essence while maintaining writer control over the final output. Google developed a solution based on Pegasus, a Transformer-based abstractive summarization model with custom pre-training, combined with careful data curation focusing on quality over quantity, knowledge distillation to optimize serving efficiency (distilling to a Transformer encoder + RNN decoder hybrid), and TPU-based serving infrastructure. The feature was launched for Google Workspace business customers, providing 1-2 sentence suggestions that writers can accept, edit, or ignore, helping both document creators and readers navigate content more efficiently.

Automated Inventory Counting with Multimodal LLMs in Grocery Fulfillment

Picnic

Picnic, an online grocery delivery company, implemented a multimodal LLM-based computer vision system to automate inventory counting in their automated warehouse. The manual stock counting process was time-consuming at scale, and traditional approaches like weighing scales proved unreliable due to measurement variance. The solution involved deploying camera setups to capture high-quality images of grocery totes, using Google Gemini's multimodal models with carefully crafted prompts and supply chain reference images to count products. Through fine-tuning, they achieved performance comparable to expensive pro-tier models using cost-effective flash models, deployed via a Fast API service with LiteLLM as a proxy layer for model interchangeability, and implemented continuous validation through selective manual checks.

Automated LLM Evaluation and Quality Monitoring in Customer Support Analytics

Echo AI

Echo AI, leveraging Log10's platform, developed a system for analyzing customer support interactions at scale using LLMs. They faced the challenge of maintaining accuracy and trust while processing high volumes of customer conversations. The solution combined Echo AI's conversation analysis capabilities with Log10's automated feedback and evaluation system, resulting in a 20-point F1 score improvement in accuracy and the ability to automatically evaluate LLM outputs across various customer-specific use cases.

Automated Product Classification and Attribute Extraction Using Vision LLMs

Shopify

Shopify tackled the challenge of automatically understanding and categorizing millions of products across their platform by implementing a multi-step Vision LLM solution. The system extracts structured product information including categories and attributes from product images and descriptions, enabling better search, tax calculation, and recommendations. Through careful fine-tuning, evaluation, and cost optimization, they scaled the solution to handle tens of millions of predictions daily while maintaining high accuracy and managing hallucinations.

Automating Community Conference Operations with AI Coding Agents

PyCon

A volunteer-run conference organization (PyData/PyConDE) with events serving up to 1,500 attendees faced significant operational overhead in managing tickets, marketing, video production, and community engagement. Over a three-month period, the team experimented with various AI coding agents (Claude, Gemini, Qwen Coder Plus, Codex) to automate tasks including LinkedIn scraping for social media content, automated video cutting using computer vision, ticket management integration, and multi-step workflow automation. The results were mixed: while AI agents proved valuable for well-documented API integration, boilerplate code generation, and specific automation tasks like screenshot capture and video processing, they struggled with multi-step procedural workflows, data normalization, and maintaining code quality without close human oversight. The team concluded that AI agents work best when kept on a "short leash" with narrow use cases, frequent commits, and human validation, delivering time savings for generalist tasks but requiring careful expectation management and not delivering the "10x productivity" improvements often claimed.

Automating Search Engine Marketing Ad Generation with Multi-Stage LLM Pipeline

Thumbtack

Thumbtack faced significant challenges with their manual Search Engine Marketing (SEM) ad creation process, where 80% of ad assets were generic templates across all ad groups, leading to suboptimal performance and requiring extensive manual effort. They developed a multi-stage LLM-powered solution that automates the generation, review, and grouping of Google Responsive Search Ads (RSAs) headlines and descriptions, incorporating specific keywords and value propositions for each ad group. The implementation was rolled out in four phases, with initial proof-of-concept showing 20% increase in traffic and 10% increase in conversions, and the final phase demonstrating statistically significant improvements in click-through rates and conversion value using Google's Drafts and Experiments feature for robust measurement.

Automating Supplier Ticket Management with LLM Agents

Wayfair

Wayfair developed Wilma, an LLM-based ticket automation system, to automate the manual triage of supplier support tickets in their SupportHub JIRA-based system. The solution uses LangGraph to orchestrate LLM calls and tool interactions for intent classification, language detection, and supplier ID lookup through a ReAct agent with BigQuery access. The system achieved better-than-human performance with 93% accuracy on question type identification (vs. 75% human accuracy), 98% on language detection, and 88% on supplier ID identification, while reducing processing time and allowing associates to focus on higher-value work.

Autonomous Coding Agent Evolution: From Short-Burst to Extended Runtime Operations

Replit

Replit evolved their AI coding agent from V1 (running autonomously for only a couple of minutes) to V2 (running for 10-15 minutes of productive work) through significant rearchitecting and leveraging new frontier models. The company focuses on enabling non-technical users to build complete applications without writing code, emphasizing performance and cost optimization over latency while maintaining comprehensive observability through tools like Langsmith to manage the complexity of production AI agents at scale.

Autonomous SRE Agent for Cloud Infrastructure Monitoring Using FastMCP

FuzzyLabs

FuzzyLabs developed an autonomous Site Reliability Engineering (SRE) agent using Anthropic's Model Context Protocol (MCP) with FastMCP to automate the diagnosis of production incidents in cloud-native applications. The agent integrates with Kubernetes, GitHub, and Slack to automatically detect issues, analyze logs, identify root causes in source code, and post diagnostic summaries to development teams. While the proof-of-concept successfully demonstrated end-to-end incident response automation using a custom MCP client with optimizations like tool caching and filtering, the project raises important questions about effectiveness measurement, security boundaries, and cost optimization that require further research.

Background Coding Agents for Large-Scale Software Maintenance and Migrations

Spotify

Spotify faced challenges in scaling complex code transformations across thousands of repositories despite having a successful Fleet Management system that automated simple, repetitive maintenance tasks. The company integrated AI coding agents into their existing Fleet Management infrastructure, allowing engineers to define fleet-wide code changes using natural language prompts instead of writing complex transformation scripts. Since February 2025, this approach has generated over 1,500 merged pull requests handling complex tasks like language modernization, breaking-change upgrades, and UI component migrations, achieving 60-90% time savings compared to manual approaches while expanding the system's use to ad-hoc development tasks through IDE and chat integrations.

Benchmarking AI Agents for Software Bug Detection and Maintenance Tasks

Bismuth

Bismuth, a startup focused on software agents, developed SM-100, a comprehensive benchmark to evaluate AI agents' capabilities in software maintenance tasks, particularly bug detection and fixing. The benchmark revealed significant limitations in existing popular agents, with most achieving only 7% accuracy in finding complex bugs and exhibiting high false positive rates (90%+). While agents perform well on feature development benchmarks like SWE-bench, they struggle with real-world maintenance tasks that require deep system understanding, cross-file reasoning, and holistic code evaluation. Bismuth's own agent achieved better performance (10 out of 100 bugs found vs. 7 for the next best), demonstrating that targeted improvements in model architecture, prompting strategies, and navigation techniques can enhance bug detection capabilities in production software maintenance scenarios.

Blueprint for Scalable and Reliable Enterprise LLM Systems

Various

A panel discussion featuring leaders from Bank of America, NVIDIA, Microsoft, and IBM discussing best practices for deploying and scaling LLM systems in enterprise environments. The discussion covers key aspects of LLMOps including business alignment, production deployment, data management, monitoring, and responsible AI considerations. The panelists share insights on the evolution from traditional ML deployments to LLM systems, highlighting unique challenges around testing, governance, and the increasing importance of retrieval and agent-based architectures.

BM25 vs Vector Search for Large-Scale Code Repository Search

Github

Github faces the challenge of providing efficient search across 100+ billion documents while maintaining low latency and supporting diverse search use cases. They chose BM25 over vector search due to its computational efficiency, zero-shot capabilities, and ability to handle diverse query types. The solution involves careful optimization of search infrastructure, including strategic data routing and field-specific indexing approaches, resulting in a system that effectively serves Github's massive scale while keeping costs manageable.

Building a Collaborative Multi-Agent AI Ecosystem for Enterprise Knowledge Access

DoorDash

DoorDash developed an internal agentic AI platform to address the challenge of fragmented knowledge spread across experimentation platforms, metrics hubs, dashboards, wikis, and team communications. The solution evolved from deterministic workflows through single agents to hierarchical deep agents and exploratory agent swarms, built on foundational capabilities including hybrid vector search with RRF-based re-ranking, schema-aware SQL generation with pre-cached examples, multi-stage zero-data query validation, and LLM-as-judge evaluation frameworks. The platform integrates with Slack and Cursor to meet users in their existing workflows, enabling business teams and developers to access complex data and insights without context-switching, democratizing data access across the organization while maintaining rigorous guardrails and provenance tracking.

Building a Comprehensive LLM Platform for Healthcare Applications

IncludedHealth

IncludedHealth built Wordsmith, a comprehensive platform for GenAI applications in healthcare, starting in early 2023. The platform includes a proxy service for multi-provider LLM access, model serving capabilities, training and evaluation libraries, and prompt engineering tools. This enabled multiple production applications including automated documentation, coverage checking, and clinical documentation, while maintaining security and compliance in a regulated healthcare environment.

Building a Customer Support AI Assistant: From PoC to Production

Elastic

Elastic's Field Engineering team developed a generative AI solution to improve customer support operations by automating case summaries and drafting initial replies. Starting with a proof of concept using Google Cloud's Vertex AI, they achieved a 15.67% positive response rate, leading them to identify the need for better input refinement and knowledge integration. This resulted in a decision to develop a unified chat interface with RAG architecture leveraging Elasticsearch for improved accuracy and response relevance.

Building a Global Product Catalogue with Multimodal LLMs at Scale

Shopify

Shopify addressed the challenge of fragmented product data across millions of merchants by building a Global Catalogue using multimodal LLMs to standardize and enrich billions of product listings. The system processes over 10 million product updates daily through a four-layer architecture involving product data foundation, understanding, matching, and reconciliation. By fine-tuning open-source vision language models and implementing selective field extraction, they achieve 40 million LLM inferences daily with 500ms median latency while reducing GPU usage by 40%. The solution enables improved search, recommendations, and conversational commerce experiences across Shopify's ecosystem.

Building a Hyper-Personalized Food Ordering Agent for E-commerce at Scale

iFood

iFood, Brazil's largest food delivery platform with 160 million monthly orders and 55 million users, built ISO, an AI agent designed to address the paradox of choice users face when ordering food. The agent uses hyper-personalization based on user behavior, interprets complex natural language intents, and autonomously takes actions like applying coupons, managing carts, and processing payments. Deployed on both the iFood app and WhatsApp, ISO handles millions of users while maintaining sub-10 second P95 latency through aggressive prompt optimization, context window management, and intelligent tool routing. The team achieved this by moving from a 30-second to a 10-second P95 latency through techniques including asynchronous processing, English-only prompts to avoid tokenization penalties, and deflating bloated system prompts by improving tool naming conventions.

Building a Knowledge as a Service Platform with LLMs and Developer Community Data

Stack Overflow

Stack Overflow addresses the challenges of LLM brain drain, answer quality, and trust by transforming their extensive developer Q&A platform into a Knowledge as a Service offering. They've developed API partnerships with major AI companies like Google, OpenAI, and GitHub, integrating their 40 billion tokens of curated technical content to improve LLM accuracy by up to 20%. Their approach combines AI capabilities with human expertise while maintaining social responsibility and proper attribution.

Building a Multi-Agent Healthcare Analytics Assistant with LLM-Powered Natural Language Queries

Komodo Health

Komodo Health, a company with a large database of anonymized American patient medical events, developed an AI assistant over two years to answer complex healthcare analytics queries through natural language. The system evolved from a simple chaining architecture with fine-tuned models to a sophisticated multi-agent system using a supervisor pattern, where an intelligent agent-based supervisor routes queries to either deterministic workflows or sub-agents as needed. The architecture prioritizes trust by ensuring raw database outputs are presented directly to users rather than LLM-generated content, with LLMs primarily handling natural language to structured query conversion and explanations. The production system balances autonomous AI capabilities with control, avoiding the cost and latency issues of pure agentic approaches while maintaining flexibility for unexpected user queries.

Building a Multi-Agent LLM Platform for Customer Service Automation

Deutsche Telekom

Deutsche Telekom developed a comprehensive multi-agent LLM platform to automate customer service across multiple European countries and channels. They built their own agent computing platform called LMOS to manage agent lifecycles, routing, and deployment, moving away from traditional chatbot approaches. The platform successfully handled over 1 million customer queries with an 89% acceptable answer rate and showed 38% better performance compared to vendor solutions in A/B testing.

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

Building a Multi-Model AI Platform and Agent Marketplace

Quora

Quora built Poe as a unified platform providing consumer access to multiple large language models and AI agents through a single interface and subscription. Starting with experiments using GPT-3 for answer generation on Quora, the company recognized the paradigm shift toward chat-based AI interactions and developed Poe to serve as a "web browser for AI" - enabling users to access diverse models, create custom agents through prompting or server integrations, and monetize AI applications. The platform has achieved significant scale with creators earning millions annually while supporting various modalities including text, image, and voice models.

Building a Multi-Model LLM API Marketplace and Infrastructure Platform

OpenRouter

OpenRouter was founded in early 2023 to address the fragmented landscape of large language models by creating a unified API marketplace that aggregates over 400 models from 60+ providers. The company identified that the LLM inference market would not be winner-take-all, and built infrastructure to normalize different model APIs, provide intelligent routing, caching, and uptime guarantees. Their platform enables developers to switch between models with near-zero switching costs while providing better prices, uptime, and choice compared to using individual model providers directly.

Building a Multi-Model LLM Marketplace and Routing Platform

OpenRouter

OpenRouter was founded in 2023 to address the challenge of choosing between rapidly proliferating language models by creating a unified API marketplace that aggregates over 400 models from 60+ providers. The platform solves the problem of model selection, provider heterogeneity, and high switching costs by providing normalized access, intelligent routing, caching, and real-time performance monitoring. Results include 10-100% month-over-month growth, sub-30ms latency, improved uptime through provider aggregation, and evidence that the AI inference market is becoming multi-model rather than winner-take-all.

Building a Multi-Provider GenAI Gateway for Enterprise-Scale LLM Access

Grab

Grab developed an AI Gateway to provide centralized, secure access to multiple GenAI providers (including OpenAI, Azure, AWS Bedrock, and Google VertexAI) for their internal developers. The gateway handles authentication, cost management, auditing, and rate limiting while providing a unified API interface. Since its launch in 2023, it has enabled over 300 unique use cases across the organization, from real-time audio analysis to content moderation, while maintaining security and cost efficiency through centralized management.

Building a Next-Generation AI-Powered Code Editor

Cursor

Cursor, founded by MIT graduates, developed an AI-powered code editor that goes beyond simple code completion to reimagine how developers interact with AI while coding. By focusing on innovative features like instructed edits and codebase indexing, along with developing custom models for specific tasks, they achieved rapid growth to $100M in revenue. Their success demonstrates how combining frontier LLMs with custom-trained models and careful UX design can transform developer productivity.

Building a Production RAG-based Customer Support Assistant with Elasticsearch

Elastic

Elastic's Field Engineering team developed a customer support chatbot using RAG instead of fine-tuning, leveraging Elasticsearch for document storage and retrieval. They created a knowledge library of over 300,000 documents from technical support articles, product documentation, and blogs, enriched with AI-generated summaries and embeddings using ELSER. The system uses hybrid search combining semantic and BM25 approaches to provide relevant context to the LLM, resulting in more accurate and trustworthy responses.

Building a Production RAG-Based Slackbot for Developer Support

Vespa

Vespa developed an intelligent Slackbot to handle increasing support queries in their community Slack channel. The solution combines RAG (Retrieval-Augmented Generation) with Vespa's search capabilities and OpenAI, leveraging both past conversations and documentation. The bot features user consent management, feedback mechanisms, and automated user anonymization, while continuously learning from new interactions to improve response quality.

Building a Production-Ready AI Phone Call Assistant with Multi-Modal Processing

RealChar

RealChar is developing an AI assistant that can handle customer service phone calls on behalf of users, addressing the frustration of long wait times and tedious interactions. The system uses a complex architecture combining traditional ML and generative AI, running multiple models in parallel through an event bus system, with fallback mechanisms for reliability. The solution draws inspiration from self-driving car systems, implementing real-time processing of multiple input streams and maintaining millisecond-level observability.

Building a Scalable LLM Gateway for E-commerce Recommendations

Mercado Libre

Mercado Libre developed a centralized LLM gateway to handle large-scale generative AI deployments across their organization. The gateway manages multiple LLM providers, handles security, monitoring, and billing, while supporting 50,000+ employees. A key implementation was a product recommendation system that uses LLMs to generate personalized recommendations based on user interactions, supporting multiple languages across Latin America.

Building a Search Engine for AI Agents: Infrastructure, Product Development, and Production Deployment

Exa.ai

Exa.ai has built the first search engine specifically designed for AI agents rather than human users, addressing the fundamental problem that existing search engines like Google are optimized for consumer clicks and keyword-based queries rather than semantic understanding and agent workflows. The company trained its own models, built its own index, and invested heavily in compute infrastructure (including purchasing their own GPU cluster) to enable meaning-based search that returns raw, primary data sources rather than listicles or summaries. Their solution includes both an API for developers building AI applications and an agentic search tool called Websites that can find and enrich complex, multi-criteria queries. The results include serving hundreds of millions of queries across use cases like sales intelligence, recruiting, market research, and research paper discovery, with 95% inbound growth and expanding from 7 to 28+ employees within a year.

Building a Systematic SNAP Benefits LLM Evaluation Framework

Propel

Propel is developing a comprehensive evaluation framework for testing how well different LLMs handle SNAP (food stamps) benefit-related queries. The project aims to assess model accuracy, safety, and appropriateness in handling complex policy questions while balancing strict accuracy with practical user needs. They've built a testing infrastructure including a Slackbot called Hydra for comparing multiple LLM outputs, and plan to release their evaluation framework publicly to help improve AI models' performance on SNAP-related tasks.

Building AI-Assisted Development at Scale with Cursor

Cursor

This case study explores how Cursor's solutions team has observed enterprise companies successfully deploying AI-assisted coding in production environments. The problem addressed is helping developers leverage LLMs effectively for coding tasks while avoiding common pitfalls like context window bloat, over-reliance on AI, and hallucinations. The solution involves teaching developers to break down problems into appropriately-sized tasks, maintain clean context windows, use semantic search for brownfield codebases, and build deterministic harnesses around non-deterministic LLM outputs. Results include significant productivity gains when developers learn proper prompt engineering, context management, and maintain responsibility for AI-generated code, with specific improvements like bench scores jumping from 45% to 65% through harness optimization.

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

Building Alfred: Production-Ready Agentic Orchestration Layer for E-commerce

Loblaws

Loblaws Digital, the technology arm of one of Canada's largest retail companies, developed Alfred—a production-ready orchestration layer for running agentic AI workflows across their e-commerce, pharmacy, and loyalty platforms. The system addresses the challenge of moving agent prototypes into production at enterprise scale by providing a reusable template-based architecture built on LangGraph, FastAPI, and Google Cloud Platform components. Alfred enables teams across the organization to quickly deploy conversational commerce applications and agentic workflows (such as recipe-based shopping) while handling critical enterprise requirements including security, privacy, PII masking, observability, and integration with 50+ platform APIs through their Model Context Protocol (MCP) ecosystem.

Building an Agentic DevOps Copilot for Infrastructure Automation

Qovery

Qovery developed an agentic DevOps copilot to automate infrastructure tasks and eliminate repetitive DevOps work. The solution evolved through four phases: from basic intent-to-tool mapping, to a dynamic agentic system that plans tool sequences, then adding resilience and recovery mechanisms, and finally incorporating conversation memory. The copilot now handles complex multi-step workflows like deployments, infrastructure optimization, and configuration management, currently using Claude Sonnet 3.7 with plans for self-hosted models and improved performance.

Building an Agentic Enterprise with AI Agents in Production

Salesforce

Salesforce transformed itself into what it calls an "agentic enterprise" by deploying AI agents (branded as Agentforce) across sales, service, and marketing operations to address capacity constraints where demand exceeded headcount. The company deployed agents that autonomously handled over 2 million customer service conversations, followed up with previously untouched leads (75% of total leads), and provided 24/7 multilingual support. Key results included over $100 million in annualized cost savings from the service agent implementation, increased lead engagement leading to new revenue opportunities, and the ability to scale operations without proportional headcount increases. The initiative required significant iteration, data unification through their Data 360 platform, continuous testing and tuning of agent performance, cross-functional collaboration breaking down traditional departmental silos, and process redesigns to enable human-AI collaboration.

Building an AI API Gateway for Streamlined GenAI Service Development

DeliveryHero

DeliveryHero's Woowa Brothers division developed an AI API Gateway to address the challenges of managing multiple GenAI providers and streamlining development processes. The gateway serves as a central infrastructure component to handle credential management, prompt management, and system stability while supporting various GenAI services like AWS Bedrock, Azure OpenAI, and GCP Imagen. The initiative was driven by extensive user interviews and aims to democratize AI usage across the organization while maintaining security and efficiency.

Building an AI Co-pilot for Product Strategy with LLM Integration Patterns

Thoughtworks

Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to explore effective patterns for LLM-powered applications beyond simple chat interfaces. The team developed and documented key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed implementation insights for building sophisticated LLM applications with better user experiences.

Building an AI Legal Assistant: From Early Testing to Production Deployment

Casetext

Casetext transformed their legal research platform into an AI-powered legal assistant called Co-Counsel using GPT-4, leading to a $650M acquisition by Thomson Reuters. The company shifted their entire 120-person team to focus on building this AI assistant after early access to GPT-4 showed promising results. Through rigorous testing, prompt engineering, and a test-driven development approach, they created a reliable AI system that could perform complex legal tasks like document review and research that previously took lawyers days to complete. The product achieved rapid market acceptance and true product-market fit within months of launch.

Building an AI-Generated Movie Quiz Game with RAG and Real-Time Multiplayer

Datastax

Datastax developed UnReel, a multiplayer movie trivia game that combines AI-generated questions with real-time gaming. The system uses RAG to generate movie-related questions and fake movie quotes, implemented through Langflow, with data storage in Astra DB and real-time multiplayer functionality via PartyKit. The project demonstrates practical challenges in production AI deployment, particularly in fine-tuning LLM outputs for believable content generation and managing distributed system state.

Building an AI-Native Code Editor in a Competitive Market

Cursor

Cursor, an AI-powered code editor startup, entered an extremely competitive market dominated by Microsoft's GitHub Copilot and well-funded competitors like Poolside, Augment, and Magic.dev. Despite initial skepticism from advisors about competing against Microsoft's vast resources and distribution, Cursor succeeded by focusing on the right short-term product decisions—specifically deep IDE integration through forking VS Code and delivering immediate value through "Cursor Tab" code completion. The company differentiated itself through rapid iteration, concentrated talent, bottom-up adoption among developers, and eventually building their own fast agent models. Cursor demonstrated that startups can compete against tech giants by moving quickly, dog-fooding their own product, and correctly identifying what developers need in the near term rather than betting solely on long-term agent capabilities.

Building an AI-Powered Browser Extension for Product Documentation with RAG and Chain-of-Thought

Reforge

Reforge developed a browser extension to help product professionals draft and improve documents like PRDs by integrating expert knowledge directly into their workflow. The team evolved from simple RAG (Retrieve and Generate) to a sophisticated Chain-of-Thought approach that classifies document types, generates tailored suggestions, and filters content based on context. Operating with a lean team of 2-3 people, they built the extension through rapid prototyping and iterative development, integrating into popular tools like Google Docs, Notion, and Confluence. The extension uses OpenAI models with Pinecone for vector storage, emphasizing privacy by not storing user data, and leverages innovative testing approaches like analyzing course recommendation distributions and reference counts to optimize model performance without accessing user content.

Building an AI-Powered Help Desk with RAG and Model Evaluation

Vimeo

Vimeo developed a prototype AI help desk chat system that leverages RAG (Retrieval Augmented Generation) to provide accurate customer support responses using their existing Zendesk help center content. The system uses vector embeddings to store and retrieve relevant help articles, integrates with various LLM providers through Langchain, and includes comprehensive testing of different models (Google Vertex AI Chat Bison, GPT-3.5, GPT-4) for performance and cost optimization. The prototype demonstrates successful integration of modern LLMOps practices including prompt engineering, model evaluation, and production-ready architecture considerations.

Building an Enterprise AI Productivity Platform: From Slack Bot to Integrated AI Workforce

Toqan

Proess (previously called Prous) developed Toqan, an internal AI productivity platform that evolved from a simple Slack bot to a comprehensive enterprise AI system serving 30,000+ employees across 100+ portfolio companies. The platform addresses the challenge of enterprise AI adoption by providing access to multiple LLMs through conversational interfaces, APIs, and system integrations, while measuring success through user engagement metrics like daily active users and "super users" who ask 5+ questions per day. The solution demonstrates how large organizations can systematically deploy AI tools across diverse business functions while maintaining security and enabling bottom-up adoption through hands-on training and cultural change management.

Building an Internal AI-Powered Customer Reference Discovery Platform

Databricks

Databricks faced a significant challenge in helping sales and marketing teams discover and utilize their vast collection of over 2,400 customer stories scattered across multiple platforms including YouTube, LinkedIn, internal documents, and their website. The tribal knowledge problem meant that finding the right customer reference at the right time was difficult, leading to overused references, missed opportunities, and inefficient manual searching. To solve this, they built Reffy—a full-stack agentic application using RAG (Retrieval-Augmented Generation), Vector Search, AI Functions, and Lakebase on the Databricks platform. Since its launch in December 2025, over 1,800 employees have executed more than 7,500 queries, resulting in faster campaign execution, more relevant storytelling, and democratized access to customer proof points that were previously siloed in tribal knowledge.

Building an Internal ChatGPT for Enterprise: From Failed Support Bot to Company-Wide AI Tool

Grab

Grab's ML Platform team was overwhelmed with support inquiries in Slack channels, prompting an engineer to experiment with building an LLM-powered chatbot for platform documentation. After the initial attempt failed due to token limitations and poor embedding search results, the project pivoted to creating GrabGPT—an internal ChatGPT-like tool for all employees. Deployed over a weekend with Google authentication and leveraging Grab's existing model-serving infrastructure (Catwalk), GrabGPT rapidly grew from 300 users on day one to becoming nearly universally adopted across the company, with over 3,000 users and 600 daily active users within three months. The success was attributed to data security controls, global accessibility (especially in regions where ChatGPT is blocked), model-agnostic architecture supporting multiple LLM providers, and full auditability for governance.

Building an Internal ChatGPT-like Tool for Enterprise-wide AI Access

Grab

Grab's ML Platform team faced overwhelming support channel inquiries that consumed engineering time with repetitive questions. An engineer initially attempted to build a RAG-based chatbot for platform documentation but encountered context window limitations with GPT-3.5-turbo and scalability issues. Pivoting from this failed experiment, the engineer built GrabGPT, an internal ChatGPT-like tool accessible to all employees, deployed over a weekend using existing frameworks and Grab's model-serving platform. The tool rapidly scaled to nearly company-wide adoption, with over 3000 users within three months and 600 daily active users, providing secure, auditable, and globally accessible LLM capabilities across multiple model providers including OpenAI, Claude, and Gemini.

Building and Automating Comprehensive LLM Evaluation Framework for SNAP Benefits

Propel

Propel developed a sophisticated evaluation framework for testing and benchmarking LLM performance in handling SNAP (food stamp) benefit inquiries. The company created two distinct evaluation approaches: one for benchmarking current base models on SNAP topics, and another for product development. They implemented automated testing using Promptfoo and developed innovative ways to evaluate model responses, including using AI models as judges for assessing response quality and accessibility.

Building and Debugging Web Automation Agents with LangChain Ecosystem

Airtop

Airtop developed a web automation platform that enables AI agents to interact with websites through natural language commands. They leveraged the LangChain ecosystem (LangChain, LangSmith, and LangGraph) to build flexible agent architectures, integrate multiple LLM models, and implement robust debugging and testing processes. The platform successfully enables structured information extraction and real-time website interactions while maintaining reliability and scalability.

Building and Deploying Enterprise-Grade LLMs: Lessons from Mistral

Mistral

Mistral, a European AI company, evolved from developing academic LLMs to building and deploying enterprise-grade language models. They started with the successful launch of Mistral-7B in September 2023, which became one of the top 10 most downloaded models on Hugging Face. The company focuses not just on model development but on providing comprehensive solutions for enterprise deployment, including custom fine-tuning, on-premise deployment infrastructure, and efficient inference optimization. Their approach demonstrates the challenges and solutions in bringing LLMs from research to production at scale.

Building and Evaluating a Financial Earnings Call Summarization System

Aiera

Aiera, an investor intelligence platform, developed a system for automated summarization of earnings call transcripts. They created a custom dataset from their extensive collection of earnings call transcriptions, using Claude 3 Opus to extract targeted insights. The project involved comparing different evaluation metrics including ROUGE and BERTScore, ultimately finding Claude 3.5 Sonnet performed best for their specific use case. Their evaluation process revealed important insights about the trade-offs between different scoring methodologies and the challenges of evaluating generative AI outputs in production.

Building and Evaluating Production AI Agents: From Function Calling to Complex Multi-Agent Systems

Google Deepmind

This case study explores the evolution of LLM-based systems in production through discussions with Raven Kumar from Google DeepMind about building products like Notebook LM, Project Mariner, and working with the Gemini and Gemma model families. The conversation covers the rapid progression from simple function calling to complex agentic systems capable of multi-step reasoning, the critical importance of evaluation harnesses as competitive advantages, and practical considerations around context engineering, tool orchestration, and model selection. Key insights include how model improvements are causing teams to repeatedly rebuild agent architectures, the importance of shipping products quickly to learn from real users, and strategies for evaluating increasingly complex multi-modal agentic systems across different scales from edge devices to cloud-based deployments.

Building and Managing Taxonomies for Effective AI Systems

Adobe

Adobe's Information Architect Jessica Talisman discusses how to build and maintain taxonomies for AI and search systems. The case study explores the challenges and best practices in creating taxonomies that bridge the gap between human understanding and machine processing, covering everything from metadata extraction to ontology development. The approach emphasizes the importance of human curation in AI systems and demonstrates how well-structured taxonomies can significantly improve search relevance, content categorization, and business operations.

Building and Scaling Conversational Voice AI Agents for Enterprise Go-to-Market

Thoughtly / Gladia

Thoughtly, a voice AI platform founded in late 2023, provides conversational AI agents for enterprise sales and customer support operations. The company orchestrates speech-to-text, large language models, and text-to-speech systems to handle millions of voice calls with sub-second latency requirements. By optimizing every layer of their stack—from telephony providers to LLM inference—and implementing sophisticated caching, conditional navigation, and evaluation frameworks, Thoughtly delivers 3x conversion rates over traditional methods and 15x ROI for customers. The platform serves enterprises with HIPAA and SOC 2 compliance while handling both inbound customer support and outbound lead activation at massive scale across multiple languages and regions.

Building and Scaling Internal Data Agents and AI-Powered Frontend Development Tools

Vercel

Vercel developed two significant production AI applications: DZ, an internal text-to-SQL data agent that enables employees to query Snowflake using natural language in Slack, and V0, a public-facing AI tool for generating full-stack web applications. The company initially built DZ as a traditional tool-based agent but completely rebuilt it as a coding-style agent with simplified architecture (just two tools: bash and SQL execution), dramatically improving performance by leveraging models' native coding capabilities. V0 evolved from a 2023 prototype targeting frontend engineers into a comprehensive full-stack development tool as models improved, finding strong product-market fit with tech-adjacent users and enabling significant internal productivity gains. Both products demonstrate Vercel's philosophy that building custom agents is straightforward and preferable to buying off-the-shelf solutions, with the company successfully deploying these AI systems at scale while maintaining reliability and supporting their core infrastructure business.

Building and Testing a Production LLM-Powered Quiz Application

Google

A case study of transforming a traditional trivia quiz application into an LLM-powered system using Google's Vertex AI platform. The team evolved from using static quiz data to leveraging PaLM and later Gemini models for dynamic quiz generation, addressing challenges in prompt engineering, validation, and testing. They achieved significant improvements in quiz accuracy from 70% with Gemini Pro to 91% with Gemini Ultra, while implementing robust validation methods using LLMs themselves to evaluate quiz quality.

Building Claude Code: Scaling AI-Powered Development from Terminal Prototype to Production

Anthropic

Anthropic's Boris Churnney, creator of Claude Code, describes the journey from an accidental terminal prototype in September 2024 to a production coding tool used by 70% of startups and responsible for 4% of all public commits globally. Starting as a simple API testing tool, Claude Code evolved through continuous user feedback and rapid iteration, with the entire codebase rewritten every few months to adapt to improving model capabilities. The tool achieved remarkable productivity gains at Anthropic itself, with engineers seeing 70% productivity increases per capita despite team doubling, and total productivity improvements of 150% since launch. The development philosophy centered on building for future model capabilities rather than current ones, anticipating improvements 6 months ahead, and minimizing scaffolding that would become obsolete with each new model release.

Building Deep Research: A Production AI Research Assistant Agent

Google Deepmind

Google Deepmind developed Deep Research, a feature that acts as an AI research assistant using Gemini to help users learn about any topic in depth. The system takes a query, browses the web for about 5 minutes, and outputs a comprehensive research report that users can review and ask follow-up questions about. The system uses iterative planning, transparent research processes, and a sophisticated orchestration backend to manage long-running autonomous research tasks.

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

Building Enterprise-Grade GenAI Platform with Multi-Cloud Architecture

Coinbase

Coinbase developed CB-GPT, an enterprise GenAI platform, to address the challenges of deploying LLMs at scale across their organization. Initially focused on optimizing cost versus accuracy, they discovered that enterprise-grade LLM deployment requires solving for latency, availability, trust and safety, and adaptability to the rapidly evolving LLM landscape. Their solution was a multi-cloud, multi-LLM platform that provides unified access to models across AWS Bedrock, GCP VertexAI, and Azure, with built-in RAG capabilities, guardrails, semantic caching, and both API and no-code interfaces. The platform now serves dozens of internal use cases and powers customer-facing applications including a conversational chatbot launched in June 2024 serving all US consumers.

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

Building Evaluation Frameworks for AI Product Managers: A Workshop on Production LLM Testing

Arize

This workshop, presented by Aman, an AI product manager at Arize, addresses the challenge of shipping reliable AI applications in production by establishing evaluation frameworks specifically designed for product managers. The problem identified is that LLMs inherently hallucinate and are non-deterministic, making traditional software testing approaches insufficient. The solution involves implementing "LLM as a judge" evaluation systems, building comprehensive datasets, running experiments with prompt variations, and establishing human-in-the-loop validation workflows. The approach demonstrates how product managers can move from "vibe coding" to "thrive coding" by using data-driven evaluation methods, prompt playgrounds, and continuous monitoring. Results show that systematic evaluation can catch issues like mismatched tone, missing features, and hallucinations before production deployment, though the workshop candidly acknowledges that evaluations themselves require validation and iteration.

Building Foundation Models for Computer Use Agents

Tzafon

Tzafon, a research lab focused on training foundation models for computer use agents, tackled the challenge of enabling LLMs to autonomously interact with computers through visual understanding and action execution. The company identified fundamental limitations in existing models' ability to ground visual information and coordinate actions, leading them to develop custom infrastructure (Waypoint) for data generation at scale, fine-tune vision encoders on screenshot data, and ultimately pre-train models from scratch with specialized computer interaction capabilities. While initial approaches using supervised fine-tuning and reinforcement learning on successful trajectories showed limited generalization, their focus on solving the grounding problem through improved vision-language integration and domain-specific pre-training has positioned them to release models and desktop applications for autonomous computer use, though performance on benchmarks like OS World remains a challenge across the industry.

Building Gemini Deep Research: An Agentic Research Assistant with Custom-Tuned Models

Google Deepmind

Google DeepMind developed Gemini Deep Research, an AI-powered research assistant that autonomously browses the web for 5-10 minutes to generate comprehensive research reports with citations. The product addresses the challenge of users wanting to go from "zero to 50" on new topics quickly, automating what would typically require opening dozens of browser tabs and hours of manual research. The team solved key technical challenges around agentic planning, transparent UX design with editable research plans, asynchronous orchestration, and post-training custom models (initially Gemini 1.5 Pro, moving toward 2.0 Flash) to reliably perform iterative web search and synthesis. The product launched in December 2024 and has been widely praised as potentially the most useful public-facing AI agent to date, with users reporting it can compress hours or days of research work into minutes.

Building Internal LLM Tools with Security and Privacy Focus

Wealthsimple

Wealthsimple developed an internal LLM Gateway and suite of generative AI tools to enable secure and privacy-preserving use of LLMs across their organization. The gateway includes features like PII redaction, multi-model support, and conversation checkpointing. They achieved significant adoption with over 50% of employees using the tools, primarily for programming support, content generation, and information retrieval. The platform also enabled operational improvements like automated customer support ticket triaging using self-hosted models.

Building ISO: A Hyperpersonalized AI Food Ordering Agent for Millions of Users

iFood

iFood, Brazil's largest food delivery company, built Ailo, an AI-powered food ordering agent to address the decision paralysis users face when choosing what to eat from overwhelming options. The agent operates both within the iFood app and on WhatsApp, providing hyperpersonalized recommendations based on user behavior, handling complex intents beyond simple search, and autonomously taking actions like applying coupons, managing carts, and facilitating payments. Through careful context management, latency optimization (reducing P95 from 30 to 10 seconds), and sophisticated evaluation frameworks, the team deployed ISO to millions of users in Brazil, demonstrating significant improvements in user experience through proactive engagement and intelligent personalization.

Building Low-Latency Voice AI Agents for Home Services

Elyos AI

Elyos AI built end-to-end voice AI agents for home services companies (plumbers, electricians, HVAC installers) to handle customer calls, emails, and messages 24/7. The company faced challenges achieving human-like conversation latency (targeting sub-400ms response times) while maintaining reliability and accuracy for complex workflows including appointment booking, payment processing, and emergency dispatch. Through careful orchestration, they optimized speech-to-text, LLM, and text-to-speech components, implemented just-in-time context engineering, state machine-based workflows, and parallel monitoring streams to achieve consistent performance with approximately 85% call automation (15% requiring human involvement).

Building Modular and Scalable RAG Systems with Hybrid Batch/Incremental Processing

Bell

Bell developed a sophisticated hybrid RAG (Retrieval Augmented Generation) system combining batch and incremental processing to handle both static and dynamic knowledge bases. The solution addresses challenges in managing constantly changing documentation while maintaining system performance. They created a modular architecture using Apache Beam, Cloud Composer (Airflow), and GCP services, allowing for both scheduled batch updates and real-time document processing. The system has been successfully deployed for multiple use cases including HR policy queries and dynamic Confluence documentation management.

Building Multi-Agent Systems with MCP and Pydantic AI for Document Processing

Deepsense

Deepsense AI built a multi-agent system for a customer who operates a document processing platform that handles various file types and data sources at scale. The problem was to create both an MCP (Model Context Protocol) server for the platform's internal capabilities and a demonstration multi-agent system that could structure data on demand from documents. Using Pydantic AI as the core agent framework and Anthropic's Claude models, the team developed a solution where users specify goals for document processing, and the system automatically extracts structured information into tables. The implementation involved creating custom MCP servers, integrating with Databricks MCP, and applying 10 key lessons learned around tool design, token optimization, model selection, observability, testing, and security. The result was a modular, scalable system that demonstrates practical patterns for building production-ready agentic applications.

Building Open-Source RL Environments from Real-World Coding Tasks for Model Training

Cline

Cline's head of AI presents their experience operating a model-agnostic AI coding agent platform, arguing that the industry has over-invested in "clever scaffolding" like RAG and tool-calling frameworks when frontier models can succeed with simpler approaches. The real bottleneck to progress, they contend, isn't prompt engineering or agent architecture but rather the quality of benchmarks and RL environments used to train models. Cline developed an automated "RL environments factory" system that transforms real-world coding tasks captured from actual user interactions into standardized, containerized training environments. They announce Cline Bench, an open-source benchmark derived from genuine software development work, inviting the community to contribute by simply working on open-source projects with Cline and opting into the initiative, thereby creating a shared substrate for improving frontier models.

Building Personalized Financial and Gardening Experiences with LLMs

Bud Financial / Scotts Miracle-Gro

This case study explores how Bud Financial and Scotts Miracle-Gro leverage Google Cloud's AI capabilities to create personalized customer experiences. Bud Financial developed a conversational AI solution for personalized banking interactions, while Scotts Miracle-Gro implemented an AI assistant called MyScotty for gardening advice and product recommendations. Both companies utilize various Google Cloud services including Vertex AI, GKE, and AI Search to deliver contextual, regulated, and accurate responses to their customers.

Building Production AI Agents for E-commerce and Food Delivery at Scale

Prosus

This case study explores how Prosus builds and deploys AI agents across e-commerce and food delivery businesses serving two billion customers globally. The discussion covers critical lessons learned from deploying conversational agents in production, with a particular focus on context engineering as the most important factor for success—more so than model selection or prompt engineering alone. The team found that successful production deployments require hybrid approaches combining semantic and keyword search, generative UI experiences that mix chat with dynamic visual components, and sophisticated evaluation frameworks. They emphasize that technology has advanced faster than user adoption, leading to failures when pure chatbot interfaces were tested, and success only came through careful UI/UX design, contextual interventions, and extensive testing with both synthetic and real user data.

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

Building Production AI Coding Assistants and Agents at Scale

Sourcegraph

Sourcegraph's CTO discusses the evolution from their code search engine to building Cody, an enterprise AI coding assistant, and AMP, a coding agent released in 2024. The company serves hundreds of Fortune 500 companies and government agencies, deploying LLM-powered tools that achieve 30-60% developer productivity gains. Their approach emphasizes multi-model architectures, rapid iteration without traditional code review processes, and building application scaffolds around frontier models to generate training data for next-generation systems. The discussion explores the transition from chat-based LLM applications (requiring sophisticated RAG systems) to agentic architectures (using simple tool-calling loops), the challenges of scaling in enterprise environments, and philosophical debates about whether pure model scaling will lead to AGI or whether alternating between application development and model training is necessary for continued progress.

Building Production AI Products: A Framework for Continuous Calibration and Development

OpenAI / Various

AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.

Building Production LLM Applications with DSPy Framework

AlixPartners

A technical consultant presents a comprehensive workshop on using DSPy, a declarative framework for building modular LLM-powered applications in production. The presenter demonstrates how DSPy enables rapid iteration on LLM applications by treating LLMs as first-class citizens in Python programs, with built-in support for structured outputs, type guarantees, tool calling, and automatic prompt optimization. Through multiple real-world use cases including document classification, contract analysis, time entry correction, and multi-modal processing, the workshop shows how DSPy's core primitives—signatures, modules, tools, adapters, optimizers, and metrics—allow teams to build production-ready systems that are transferable across models, optimizable without fine-tuning, and maintainable at scale.

Building Production Multi-Agent Research Systems with Claude

Anthropic

Anthropic developed a production-grade multi-agent research system for their Claude Research feature that uses multiple LLM agents working in parallel to explore complex topics across web, Google Workspace, and integrated data sources. The system employs an orchestrator-worker pattern where a lead agent coordinates specialized subagents that search and filter information simultaneously, addressing challenges in agent coordination, evaluation, and reliability. Internal evaluations showed the multi-agent approach with Claude Opus 4 and Sonnet 4 outperformed single-agent Claude Opus 4 by 90.2% on research tasks, with token usage explaining 80% of performance variance, though the architecture consumes approximately 15× more tokens than standard chat interactions, requiring careful consideration of economic viability and deployment strategies.

Building Production-Grade Agentic AI Analytics: Lessons from Real-World Deployment

Tellius

Tellius shares hard-won lessons from building their agentic analytics platform that transforms natural language questions into trustworthy SQL-based insights. The core problem addressed is that chat-based analytics requires far more than simple text-to-SQL conversion—it demands deterministic planning, governed semantic layers, ambiguity management, multi-step consistency, transparency, performance engineering, and comprehensive observability. Their solution architecture separates language understanding from execution through typed plan artifacts that validate against schemas and policies before execution, implements clarification workflows for ambiguous queries, maintains plan/result fingerprinting for consistency, provides inline transparency with preambles and lineage, enforces latency budgets across execution hops, and treats feedback as governed policy changes. The result is a production system that achieves determinism, explainability, and sub-second interactive performance while avoiding the common pitfalls that cause 95% of AI pilot failures.

Building Production-Grade AI Agents with Guardrails, Context Management, and Security

Portia / Riff / Okta

This panel discussion features founders from Portia AI and Rift.ai (formerly Databutton) discussing the challenges of moving AI agents from proof-of-concept to production. The speakers address critical production concerns including guardrails for agent reliability, context engineering strategies, security and access control challenges, human-in-the-loop patterns, and identity management. They share real-world customer examples ranging from custom furniture makers to enterprise CRM enrichment, emphasizing that while approximately 40% of companies experimenting with AI have agents in production, the journey requires careful attention to trust, security, and supportability. Key solutions include conditional example-based prompting, sandboxed execution environments, role-based access controls, and keeping context windows smaller for better precision rather than utilizing maximum context lengths.

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

Building Production-Ready AI Agents for Internal Workflow Automation

Vercel

Vercel, a web hosting and deployment platform, addressed the challenge of identifying and implementing successful AI agent projects across their organization by focusing on employee pain points—specifically repetitive, boring tasks that humans disliked. The company deployed three internal production agents: a lead processing agent that automated sales qualification and research (saving hundreds of days of manual work), an anti-abuse agent that accelerated content moderation decisions by 59%, and a data analyst agent that automated SQL query generation for business intelligence. Their methodology centered on asking employees "What do you hate most about your job?" to identify tasks that were repetitive enough for current AI models to handle reliably while still delivering high business impact.

Building Production-Ready Customer Support AI Agents: Challenges and Solutions

Gradient Labs

Gradient Labs shares their experience building and deploying AI agents for customer support automation in production. While prototyping with LLMs is relatively straightforward, deploying agents to production introduces complex challenges around state management, knowledge integration, tool usage, and handling race conditions. The company developed a state machine-based architecture with durable execution engines to manage these challenges, successfully handling hundreds of conversations per day with high customer satisfaction.

Building Reliable AI Agent Systems with Effect TypeScript Framework

14.ai

14.ai, an AI-native customer support platform, uses Effect, a TypeScript framework, to manage the complexity of building reliable LLM-powered agent systems that interact directly with end users. The company built a comprehensive architecture using Effect across their entire stack to handle unreliable APIs, non-deterministic model outputs, and complex workflows through strong type guarantees, dependency injection, retry mechanisms, and structured error handling. Their approach enables reliable agent orchestration with fallback strategies between LLM providers, real-time streaming capabilities, and comprehensive testing through dependency injection, resulting in more predictable and resilient AI systems.

Building Reliable AI Agents Through Production Monitoring and Intent Discovery

Raindrop

Raindrop, a monitoring platform for AI products, addresses the challenge of building reliable AI agents in production where traditional offline evaluations fail to capture real-world usage patterns. The company developed a "Sentry for AI products" approach that emphasizes experimentation, production monitoring, and discovering user intents through clustering and signal detection. Their solution combines explicit signals (like thumbs up/down, regenerations) and implicit signals (detecting refusals, task failures, user frustration) to identify issues that don't manifest as traditional software errors. The platform trains custom models to detect issues across production data at scale, enabling teams to discover unknown problems, track their impact on users, and fix them systematically without breaking existing functionality.

Building Reliable AI DevOps Agents: Engineering Practices for Nondeterministic LLM Output

Trunk

Trunk developed an AI DevOps agent to handle root cause analysis (RCA) for test failures in CI pipelines, facing challenges with nondeterministic LLM outputs. They applied traditional software engineering principles adapted for LLMs, including starting with narrow use cases, switching between models (Claude to Gemini) for better tool calling, implementing comprehensive testing with mocked LLM responses, and establishing feedback loops through internal usage and user feedback collection. The approach resulted in a more reliable agent that performs well on specific tasks like analyzing test failures and posting summaries to GitHub PRs.

Building Resilient Multi-Provider AI Agent Infrastructure for Financial Services

Gradient Labs

Gradient Labs built an AI agent that handles customer interactions for financial services companies, requiring high reliability in production. The company architected a sophisticated failover system that spans multiple LLM providers (OpenAI, Anthropic, Google) and hosting platforms (native APIs, Azure, AWS, GCP), enabling both traffic distribution across rate limits and automatic failover during errors, rate limiting, or latency spikes. They use Temporal for durable execution to checkpoint progress across long-running agentic workflows, and have implemented both provider-level and model-level failover strategies with tailored prompts for backup models, ensuring continuous operation even during catastrophic provider outages.

Building Robust Enterprise Search with LLMs and Traditional IR

Glean

Glean tackles enterprise search by combining traditional information retrieval techniques with modern LLMs and embeddings. Rather than relying solely on AI techniques, they emphasize the importance of rigorous ranking algorithms, personalization, and hybrid approaches that combine classical IR with vector search. The company has achieved unicorn status and serves major enterprises by focusing on holistic search solutions that include personalization, feed recommendations, and cross-application integrations.

Building Stateful AI Agents with In-Context Learning and Memory Management

Letta

Letta addresses the fundamental limitation of current LLM-based agents: their inability to learn and retain information over time, leading to degraded performance as context accumulates. The platform enables developers to build stateful agents that learn by updating their context windows rather than model parameters, making learning interpretable and model-agnostic. The solution includes a developer platform with memory management tools, context window controls, and APIs for creating production agents that improve over time. Real-world deployments include a support agent that has been learning from Discord interactions for a month and recommendation agents for Built Rewards, demonstrating that agents with persistent memory can achieve performance comparable to fine-tuned models while remaining flexible and debuggable.

Building Synthetic Filesystems for AI Agent Navigation Across Enterprise Data Sources

Dust.tt

Dust.tt observed that their AI agents were attempting to navigate company data using filesystem-like syntax, prompting them to build synthetic filesystems that map disparate data sources (Notion, Slack, Google Drive, GitHub) into Unix-inspired navigable structures. They implemented five filesystem commands (list, find, cat, search, locate_in_tree) that allow agents to both structurally explore and semantically search across organizational data, transforming agents from search engines into knowledge workers capable of complex multi-step information tasks.

Building Unified API Infrastructure for AI Integration at Scale

Merge

Merge, a unified API provider founded in 2020, helps companies offer native integrations across multiple platforms (HR, accounting, CRM, file storage, etc.) through a single API. As AI and LLMs emerged, Merge adapted by launching Agent Handler, an MCP-based product that enables live API calls for agentic workflows while maintaining their core synced data product for RAG-based use cases. The company serves major LLM providers including Mistral and Perplexity, enabling them to access customer data securely for both retrieval-augmented generation and real-time agent actions. Internally, Merge has adopted AI tools across engineering, support, recruiting, and operations, leading to increased output and efficiency while maintaining their core infrastructure focus on reliability and enterprise-grade security.

Building Voice-Enabled AI Assistants with Real-Time Processing

Bee

A detailed exploration of building real-time voice-enabled AI assistants, featuring multiple approaches from different companies and developers. The case study covers how to achieve low-latency voice processing, transcription, and LLM integration for interactive AI assistants. Solutions demonstrated include both commercial services like Deepgram and open-source implementations, with a focus on achieving sub-second latency, high accuracy, and cost-effective deployment.

Challenges in Building Enterprise Chatbots with LLMs: A Banking Case Study

Invento Robotics

A bank's attempt to implement a customer support chatbot using GPT-4 and RAG reveals the complexities and challenges of deploying LLMs in production. What was initially estimated as a three-month project struggled to deliver after a year, highlighting key challenges in domain knowledge management, retrieval effectiveness, conversation flow design, state management, latency, and regulatory compliance.

Comprehensive LLM Evaluation Framework for Production AI Code Assistants

Github

Github describes their robust evaluation framework for testing and deploying new LLM models in their Copilot product. The team runs over 4,000 offline tests, including automated code quality assessments and chat capability evaluations, before deploying any model changes to production. They use a combination of automated metrics, LLM-based evaluation, and manual testing to assess model performance, quality, and safety across multiple programming languages and frameworks.

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

Context Engineering for AI-Assisted Employee Onboarding

Etsy

Etsy explored using prompt engineering as an alternative to fine-tuning for AI-assisted employee onboarding, focusing on Travel & Entertainment policy questions and community forum support. They implemented a RAG-style approach using embeddings-based search to augment prompts with relevant Etsy-specific documents. The system achieved 86% accuracy on T&E policy questions and 72% on community forum queries, with various prompt engineering techniques like chain-of-thought reasoning and source citation helping to mitigate hallucinations and improve reliability.

Context Engineering for Production AI Agents at Scale

Manus

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

Context Engineering for Production AI Assistants at Scale

Spotify

Shopify developed Sidekick, an AI assistant serving millions of merchants on their commerce platform. The challenge was managing context windows effectively while maintaining performance, latency, and cost efficiency for an agentic system operating at massive scale. Their solution involved sophisticated "context engineering" techniques including aggressive token management (removing processed tool messages, trimming old conversation turns), a three-tier memory system (explicit user preferences, implicit user profiles, and episodic memory via RAG), and just-in-time instruction injection that collocates instructions with tool outputs. These techniques reportedly improved instruction adherence by 5-10% while reducing jailbreak likelihood and maintaining acceptable latency despite the system managing over 20 tools and handling complex multi-step agentic workflows.

Context Engineering Platform for Multi-Domain RAG and Agentic Systems

Contextual

Contextual has developed an end-to-end context engineering platform designed to address the challenges of building production-ready RAG and agentic systems across multiple domains including e-commerce, code generation, and device testing. The platform combines multimodal ingestion, hierarchical document processing, hybrid search with reranking, and dynamic agents to enable effective reasoning over large document collections. In a recent context engineering hackathon, Contextual's dynamic agent achieved competitive results on a retail dataset of nearly 100,000 documents, demonstrating the value of constrained sub-agents, turn limits, and intelligent tool selection including MCP server management.

Context Rot: Evaluating LLM Performance Degradation with Increasing Input Tokens

ChromaDB

ChromaDB's technical report examines how large language models (LLMs) experience performance degradation as input context length increases, challenging the assumption that models process context uniformly. Through evaluation of 18 state-of-the-art models including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 across controlled experiments, the research reveals that model reliability decreases significantly with longer inputs, even on simple tasks like retrieval and text replication. The study demonstrates that factors like needle-question similarity, presence of distractors, haystack structure, and semantic relationships all impact performance non-uniformly as context length grows, suggesting that current long-context benchmarks may not adequately reflect real-world performance challenges.

Context-Seeking Conversational AI for Health Information Navigation

Google

Google Research developed a "Wayfinding AI" prototype based on Gemini to address the challenge of people struggling to find relevant, personalized health information online. Through formative user research with 33 participants and iterative design, they created an AI agent that proactively asks clarifying questions to understand user goals and context before providing answers. In a randomized study with 130 participants, the Wayfinding AI was significantly preferred over a baseline Gemini model across multiple dimensions including helpfulness, relevance, goal understanding, and tailoring, demonstrating that a context-seeking, conversational approach creates more empowering health information experiences than traditional question-answering systems.

Conversational AI Data Agent for Financial Analytics

Uber

Uber developed Finch, a conversational AI agent integrated into Slack, to address the inefficiencies of traditional financial data retrieval processes where analysts had to manually navigate multiple platforms, write complex SQL queries, or wait for data science team responses. The solution leverages generative AI, RAG, and self-querying agents to transform natural language queries into structured data retrieval, enabling real-time financial insights while maintaining enterprise-grade security through role-based access controls. The system reportedly reduces query response times from hours or days to seconds, though the text lacks quantified performance metrics or third-party validation of claimed benefits.

Cost Reduction Through Fine-tuning: Healthcare Chatbot and E-commerce Product Classification

Airtrain

Two case studies demonstrate significant cost reduction through LLM fine-tuning. A healthcare company reduced costs and improved privacy by fine-tuning Mistral-7B to match GPT-3.5's performance for patient intake, while an e-commerce unicorn improved product categorization accuracy from 47% to 94% using a fine-tuned model, reducing costs by 94% compared to using GPT-4.

Cost-Effective LLM Transaction Categorization for Business Banking

ANNA

ANNA, a UK business banking provider, implemented LLMs to automate transaction categorization for tax and accounting purposes across diverse business types. They achieved this by combining traditional ML with LLMs, particularly focusing on context-aware categorization that understands business-specific nuances. Through strategic optimizations including offline predictions, improved context utilization, and prompt caching, they reduced their LLM costs by 75% while maintaining high accuracy in their AI accountant system.

Customer Service Transformation with AI-Based Email Automation and Chatbot Implementation

Sixt

Sixt, a mobility service provider with over €4 billion in revenue, transformed their customer service operations using generative AI to handle the complexity of multiple product lines across 100+ countries. The company implemented "Project AIR" (AI-based Replies) to automate email classification, generate response proposals, and deploy chatbots across multiple channels. Within five months of ideation, they moved from proof-of-concept to production, achieving over 90% classification accuracy using Amazon Bedrock with Anthropic Claude models (up from 70% with out-of-the-box solutions), while reducing classification costs by 70%. The solution now handles customer inquiries in multiple languages, integrates with backend reservation systems, and has expanded from email automation to messaging and chatbot services deployed across all corporate countries by Q1 2025.

Data and AI Governance Integration in Enterprise GenAI Adoption

Various

A panel discussion featuring leaders from Mercado Libre, ATB Financial, LBLA retail, and Collibra discussing how they are implementing data and AI governance in the age of generative AI. The organizations are leveraging Google Cloud's Dataplex and other tools to enable comprehensive data governance, while also exploring GenAI applications for automating governance tasks, improving data discovery, and enhancing data quality management. Their approaches range from careful regulated adoption in finance to rapid e-commerce implementation, all emphasizing the critical connection between solid data governance and successful AI deployment.

Deploying AI Agents for Scalable Immigration Automation

Navismart AI

Navismart AI developed a multi-agent AI system to automate complex immigration processes that traditionally required extensive human expertise. The platform addresses challenges including complex sequential workflows, varying regulatory compliance across different countries, and the need for human oversight in high-stakes decisions. Built on a modular microservices architecture with specialized agents handling tasks like document verification, form filling, and compliance checks, the system uses Kubernetes for orchestration and scaling. The solution integrates REST APIs for inter-agent communication, implements end-to-end encryption for security, and maintains human-in-the-loop capabilities for critical decisions. The team started with US immigration processes due to their complexity and is expanding to other countries and domains like education.

Deploying AI Coding Agents in Highly Regulated Environments with Secure Infrastructure

ONA

ONA addresses the challenge faced by companies in highly regulated sectors (finance, government) that need to leverage AI coding assistants while maintaining strict data security and compliance requirements. The problem stems from the fact that many organizations initially ban AI tools like ChatGPT due to data leakage concerns, but employees use them anyway (with surveys showing 45% admit using banned AI tools and 58% sending sensitive data to public AI services). ONA's solution is a software engineering agent platform that runs entirely within the customer's own virtual private cloud (VPC), using isolated disposable development environments (virtual machines with dev containers), providing admin controls and audit logs, and ensuring all data remains within the customer's network with client-side encryption. The platform enables secure AI-assisted development with direct connections to customers' Git providers and LLM services without ONA accessing any code or sensitive data.

Deploying Generative AI at Scale Across 5,000 Developers

Liberty IT

Liberty IT, the technology division of Fortune 100 insurance company Liberty Mutual, embarked on a large-scale deployment of generative AI tools across their global workforce of over 5,000 developers and 50,000+ employees. The initiative involved rolling out custom GenAI platforms including Liberty GPT (an internal ChatGPT variant) to 70% of employees and GitHub Copilot to over 90% of IT staff within the first year. The company faced challenges including rapid technology evolution, model availability constraints, cost management, RAG implementation complexity, and achieving true adoption beyond basic usage. Through building a centralized AI platform with governance controls, implementing comprehensive learning programs across six streams, supporting 28 different models optimized for various use cases, and developing custom dashboards for cost tracking and observability, Liberty IT successfully navigated these challenges while maintaining enterprise security and compliance requirements.

Domain-Specific Agentic AI for Personalized Korean Skincare Recommendations

Glowe / Weaviate

Glowe, developed by Weaviate, addresses the challenge of finding effective skincare product combinations by building a domain-specific AI agent that understands Korean skincare science. The solution leverages dual embedding strategies with TF-IDF weighting to capture product effects from 94,500 user reviews, uses Weaviate's vector database for similarity search, and employs Gemini 2.5 Flash for routine generation. The system includes an agentic chat interface powered by Elysia that provides real-time personalized guidance, resulting in scientifically-grounded skincare recommendations based on actual user experiences rather than marketing claims.

Domain-Specific LLMs for Drug Discovery Biomarker Identification

BenchSci

BenchSci developed an AI platform for drug discovery that combines domain-specific LLMs with extensive scientific data processing to assist scientists in understanding disease biology. They implemented a RAG architecture that integrates their structured biomedical knowledge base with Google's Med-PaLM model to identify biomarkers in preclinical research, resulting in a reported 40% increase in productivity and reduction in processing time from months to days.

Dutch YouTube Interface Localization and Content Management

Tastewise

This appears to be the Dutch footer section of YouTube's interface, showcasing the platform's localization and content management system. However, without more context about specific LLMOps implementation details, we can only infer that YouTube likely employs language models for content translation, moderation, and user interface localization.

Emotionally Aware AI Tutoring Agents with Multimodal Affect Detection

GlowingStar

GlowingStar Inc. develops emotionally aware AI tutoring agents that detect and respond to learner emotional states in real-time to provide personalized learning experiences. The system addresses the gap in current AI agents that focus solely on cognitive processing without emotional attunement, which is critical for effective learning and engagement. By incorporating multimodal affect detection (analyzing tone of voice, facial expressions, interaction patterns, latency, and silence) into an expanded agent architecture, the platform aims to deliver world-class personalized education while navigating significant challenges around emotional data privacy, cross-cultural generalization, and ethical deployment in sensitive educational contexts.

End-to-End LLM Observability for RAG-Powered AI Assistant

Splunk

Splunk built an AI Assistant leveraging Retrieval-Augmented Generation (RAG) to answer FAQs using curated public content from .conf24 materials. The system was developed in a hackathon-style sprint using their internal CIRCUIT platform. To operationalize this LLM-powered application at scale, Splunk integrated comprehensive observability across the entire RAG pipeline—from prompt handling and document retrieval to LLM generation and output evaluation. By instrumenting structured logs, creating unified dashboards in Splunk Observability Cloud, and establishing proactive alerts for quality degradation, hallucinations, and cost overruns, they achieved full visibility into response quality, latency, source document reliability, and operational health. This approach enabled rapid iteration, reduced mean time to resolution for quality issues, and established reproducible governance practices for production LLM deployments.

Engineering Principles and Practices for Production LLM Systems

Langchain

This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.

Enhanced Agentic RAG for On-Call Engineering Support

Uber

Uber developed Genie, an internal on-call copilot that uses an enhanced agentic RAG (EAg-RAG) architecture to provide real-time support for engineering security and privacy queries through Slack. The system addressed significant accuracy issues in traditional RAG approaches by implementing LLM-powered agents for query optimization, source identification, and context refinement, along with enriched document processing that improved table extraction and metadata enhancement. The enhanced system achieved a 27% relative improvement in acceptable answers and a 60% relative reduction in incorrect advice, enabling deployment across critical security and privacy channels while reducing the support load on subject matter experts and on-call engineers.

Enhanced Agentic-RAG for Internal On-Call Support Copilot

Uber

Uber developed Genie, an internal on-call copilot powered by LLMs, to provide real-time support for engineering queries in Slack. When initial testing revealed significant accuracy issues with responses in the engineering security and privacy domain, the team transitioned from traditional RAG to an Enhanced Agentic RAG (EAg-RAG) architecture. This involved enriched document processing with custom Google Docs loaders and LLM-powered content formatting, plus pre- and post-processing agents for query optimization, source identification, and context refinement. The improvements resulted in a 27% relative increase in acceptable answers and a 60% relative reduction in incorrect advice, enabling deployment across critical security and privacy channels while reducing the support load on subject matter experts.

Enhancing E-commerce Search with Vector Embeddings and Generative AI

Mercado Libre / Grupo Boticario

Mercado Libre, Latin America's largest e-commerce platform, addressed the challenge of handling complex search queries by implementing vector embeddings and Google's Vector Search database. Their traditional word-matching search system struggled with contextual queries, leading to irrelevant results. The new system significantly improved search quality for complex queries, which constitute about half of all search traffic, resulting in increased click-through and conversion rates.

Enterprise Agent Orchestration Platform for Secure LLM Deployment

Airia

This case study explores how Airia developed an orchestration platform to help organizations deploy AI agents in production environments. The problem addressed is the significant complexity and security challenges that prevent businesses from moving beyond prototype AI agents to production-ready systems. The solution involves a comprehensive platform that provides agent building capabilities, security guardrails, evaluation frameworks, red teaming, and authentication controls. Results include successful deployments across multiple industries including hospitality (customer profiling across hotel chains), HR, legal (contract analysis), marketing (personalized content generation), and operations (real-time incident response through automated data aggregation), with customers reporting significant efficiency gains while maintaining enterprise security standards.

Enterprise Agentic AI Deployment: Panel Discussion on Production Realities and Technical Bottlenecks

Various

This panel discussion features leaders from Writer, You.com, Glean, and Google discussing the current state of deploying agentic AI systems in enterprise environments. The panelists address the gap between prototype development (which can now take 90 seconds) and production-ready systems that Fortune 500 companies can rely on. They identify key technical bottlenecks including data quality and governance issues, information retrieval challenges, function calling limitations, security vulnerabilities, and the difficulty of verifying agent actions. The consensus is that while every large enterprise has built some AI agents adding business value, they are far from having 50% of enterprise work handled by AI, with action agents for larger enterprises likely requiring several more years for major adoption.

Enterprise AI Agent Development: Lessons from Production Deployments

IBM, The Zig, Augmented AI Labs

This panel discussion features three companies - IBM, The Zig, and Augmented AI Labs - sharing their experiences building and deploying AI agents in enterprise environments. The panelists discuss the challenges of scaling AI agents, including cost management, accuracy requirements, human-in-the-loop implementations, and the gap between prototype demonstrations and production realities. They emphasize the importance of conservative approaches, proper evaluation frameworks, and the need for human oversight in high-stakes environments, while exploring emerging standards like agent communication protocols and the evolving landscape of enterprise AI adoption.

Enterprise AI Platform Deployment for Multi-Company Productivity Enhancement

Payfit, Alan

This case study presents the deployment of Dust.tt's AI platform across multiple companies including Payfit and Alan, focusing on enterprise-wide productivity improvements through LLM-powered assistants. The companies implemented a comprehensive AI strategy involving both top-down leadership support and bottom-up adoption, creating custom assistants for various workflows including sales processes, customer support, performance reviews, and content generation. The implementation achieved significant productivity gains of approximately 20% across teams, with some specific use cases reaching 50% improvements, while addressing challenges around security, model selection, and user adoption through structured rollout processes and continuous iteration.

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

Enterprise Challenges and Opportunities in Large-Scale LLM Deployment

Barclays

A senior leader in industry discusses the key challenges and opportunities in deploying LLMs at enterprise scale, highlighting the differences between traditional MLOps and LLMOps. The presentation covers critical aspects including cost management, infrastructure needs, team structures, and organizational adaptation required for successful LLM deployment, while emphasizing the importance of leveraging existing MLOps practices rather than completely reinventing the wheel.

Enterprise Document Data Extraction Using Agentic AI Workflows

Box

Box, an enterprise content platform serving over 115,000 customers including two-thirds of the Fortune 500, transformed their document data extraction capabilities by evolving from simple single-shot LLM prompting to sophisticated agentic AI workflows. Initially successful with basic document extraction using off-the-shelf models like GPT, Box encountered significant challenges when customers demanded extraction from complex 300-page documents with hundreds of fields, multilingual content, and poor OCR quality. The company implemented an agentic architecture using directed graphs that orchestrate multiple AI models, tools for validation and cross-checking, and iterative refinement processes. This approach dramatically improved accuracy and reliability while maintaining the flexibility to handle diverse document types and complex extraction requirements across their enterprise customer base.

Enterprise Infrastructure Challenges for Agentic AI Systems in Production

Various (Meta / Google / Monte Carlo / Azure)

A panel discussion featuring engineers from Meta, Google, Monte Carlo, and Microsoft Azure explores the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion reveals that agentic workloads differ dramatically from traditional software systems, requiring complete reimagining of reliability, security, networking, and observability approaches. Key challenges include non-deterministic behavior leading to incidents like chatbots selling cars for $1, massive scaling requirements as agents work continuously, and the need for new health checking mechanisms, semantic caching, and comprehensive evaluation frameworks to manage systems where 95% of outcomes are unknown unknowns.

Enterprise LLM Deployment with Multi-Cloud Data Platform Integration

Databricks

This presentation by Databricks' Product Management lead addresses the challenges large enterprises face when deploying LLMs into production, particularly around data governance, evaluation, and operational control. The talk centers on two primary case studies: FactSet's transformation of their query language translation system (improving from 59% to 85% accuracy while reducing latency from 15 to 6 seconds), and Databricks' internal use of Claude for automating analyst questionnaire responses. The solution involves decomposing complex prompts into multi-step agentic workflows, implementing granular governance controls across data and model access, and establishing rigorous evaluation frameworks to achieve production-grade reliability in high-risk enterprise environments.

Enterprise Neural Machine Translation at Scale

DeepL

DeepL, a translation company founded in 2017, has built a successful enterprise-focused business using neural machine translation models to tackle the language barrier problem at scale. The company handles hundreds of thousands of customers by developing specialized neural translation models that balance accuracy and fluency, training them on curated parallel and monolingual corpora while leveraging context injection rather than per-customer fine-tuning for scalability. By building their own GPU infrastructure early on and developing custom frameworks for inference optimization, DeepL maintains a competitive edge over general-purpose LLMs and established players like Google Translate, demonstrating strong product-market fit in high-stakes enterprise use cases where translation quality directly impacts legal compliance, customer experience, and business operations.

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

Enterprise-Scale Data Product AI Agent for Multi-Domain Knowledge Discovery

Bosch

Bosch, a global manufacturing and technology company with over 400,000 employees across 60+ countries, faced the challenge of accessing and understanding its vast distributed data ecosystem spanning automotive, consumer goods, power tools, and industrial equipment divisions. The company developed DPAI (Data Product AI Agent), an enterprise AI platform that enables natural language interaction with Bosch's data by combining a data mesh architecture, a centralized data marketplace, and generative AI capabilities. The solution integrates semantic understanding through ontologies, data catalogs, and Bosch-specific context to provide accurate, business-relevant answers across divisions. While still in development with an estimated one to two years until full completion, the platform demonstrates how large enterprises can overcome data fragmentation and contextual complexity to make organizational knowledge accessible through conversational AI.

Enterprise-Scale Deployment of AI Ambient Scribes Across Multiple Healthcare Systems

Memorial Sloan Kettering / McLeod Health / UCLA

This panel discussion features three major healthcare systems—McLeod Health, Memorial Sloan Kettering Cancer Center, and UCLA Health—discussing their experiences deploying generative AI-powered ambient clinical documentation (AI scribes) at scale. The organizations faced challenges in vendor evaluation, clinician adoption, and demonstrating ROI while addressing physician burnout and documentation burden. Through rigorous evaluation processes including randomized controlled trials, head-to-head vendor comparisons, and structured pilots, these systems successfully deployed AI scribes to hundreds to thousands of physicians. Results included significant reductions in burnout (20% at UCLA), improved patient satisfaction scores (5-6% increases at McLeod), time savings of 1.5-2 hours per day, and positive financial ROI through improved coding and RVU capture. Key learnings emphasized the importance of robust training, encounter-based pricing models, workflow integration, and managing expectations that AI scribes are not a universal solution for all specialties and clinicians.

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

Enterprise-Scale LLM Deployment with Licensed Content for Business Intelligence

Factiva

Factiva, a Dow Jones business intelligence platform, implemented a secure, enterprise-scale LLM solution for their content aggregation service. They developed "Smart Summaries" that allows natural language querying across their vast licensed content database of nearly 3 billion articles. The implementation required securing explicit GenAI licensing agreements from thousands of publishers, ensuring proper attribution and royalty tracking, and deploying a secure cloud infrastructure using Google's Gemini model. The solution successfully launched in November 2023 with 4,000 publishers, growing to nearly 5,000 publishers by early 2024.

Enterprise-Scale LLM Platform with Multi-Model Support and Copilot Customization

Telus

Telus developed Fuel X, an enterprise-scale LLM platform that provides centralized management of multiple AI models and services. The platform enables creation of customized copilots for different use cases, with over 30,000 custom copilots built and 35,000 active users. Key features include flexible model switching, enterprise security, RAG capabilities, and integration with workplace tools like Slack and Google Chat. Results show significant impact, including 46% self-resolution rate for internal support queries and 21% reduction in agent interactions.

Enterprise-Wide Generative AI Implementation for Marketing Content Generation and Translation

Bosch

Bosch, a global industrial and consumer goods company, implemented a centralized generative AI platform called "Gen playground" to address their complex marketing content needs across 3,500+ websites and numerous social media channels. The solution enables their 430,000+ associates to create text content, generate images, and perform translations without relying on external agencies, significantly reducing costs and turnaround time from 6-12 weeks to near-immediate results while maintaining brand consistency and quality standards.

Enterprise-Wide Virtual Assistant for Employee Knowledge Access

BNY Mellon

BNY Mellon implemented an LLM-based virtual assistant to help their 50,000 employees efficiently access internal information and policies across the organization. Starting with small pilot deployments in specific departments, they scaled the solution enterprise-wide using Google's Vertex AI platform, while addressing challenges in document processing, chunking strategies, and context-awareness for location-specific policies.

Evaluation Patterns for Deep Agents in Production

Langchain

LangChain built and deployed four production applications powered by "Deep Agents" - stateful, long-running AI agents capable of complex tasks including coding, email assistance, and agent building. The challenge was developing comprehensive evaluation strategies for these agents that went beyond traditional LLM evaluation approaches. Their solution involved five key patterns: bespoke test logic for each datapoint with custom assertions, single-step evaluations for validating specific decision points, full agent turn testing for end-to-end behavior, multi-turn conversations with conditional logic to simulate realistic interactions, and proper environment setup with clean, reproducible test conditions. Using LangSmith's Pytest and Vitest integrations, they implemented flexible evaluation frameworks that could assess agent trajectories, final responses, and state artifacts while maintaining fast, debuggable test suites through techniques like API mocking and containerized environments.

Evolution from Task-Specific Models to Multi-Agent Orchestration Platform

AI21

AI21 Labs evolved their production AI systems from task-specific models (2022-2023) to RAG-as-a-Service, and ultimately to Maestro, a multi-agent orchestration platform. The company identified that while general-purpose LLMs demonstrated impressive capabilities, they weren't optimized for specific business use cases that enterprises actually needed, such as contextual question answering and summarization. AI21 developed smaller language models fine-tuned for specific tasks, wrapped them with pre- and post-processing operations (including hallucination filters), and eventually built a comprehensive RAG system when customers struggled to identify relevant context from large document corpora. The Maestro platform emerged to handle complex multi-hop queries by automatically breaking them into subtasks, parallelizing execution, and orchestrating multiple agents and tools, achieving dramatically improved quality with full traceability for enterprise requirements.

Evolution of AI Agents: From Manual Workflows to End-to-End Training

OpenAI

OpenAI's journey in developing agentic products showcases the evolution from manually designed workflows with LLMs to end-to-end trained agents. The company has developed three main agentic products - Deep Research, Operator, and Codeex CLI - each addressing different use cases from web research to code generation. These agents demonstrate how end-to-end training with reinforcement learning enables better error recovery and more natural interaction compared to traditional manually designed workflows.

Evolution of AI Systems and LLMOps from Research to Production: Infrastructure Challenges and Application Design

NVIDA / Lepton

This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.

Evolution of an Internal AI Platform from No-Code LLM Apps to Agentic Systems

Grab

Grab developed SpellVault, an internal no-code AI platform that evolved from a simple RAG-based LLM app builder into a sophisticated agentic system supporting thousands of apps across the organization. Initially designed to democratize AI access for non-technical users through knowledge integrations and plugins, the platform progressively incorporated advanced capabilities including workflow orchestration, ReAct agent execution, unified tool frameworks, and Model Context Protocol (MCP) compatibility. This evolution enabled SpellVault to transform from supporting static question-answering apps into powering dynamic AI agents capable of reasoning, acting, and interacting with internal and external systems, while maintaining its core mission of accessibility and ease of use.

Evolution of Code Evaluation Benchmarks: From Single-Line Completion to Full Codebase Translation

Cursor

This research presentation details four years of work developing evaluation methodologies for coding LLMs across varying time horizons, from second-level code completions to hour-long codebase translations. The speaker addresses critical challenges in evaluating production coding AI systems including data contamination, insufficient test suites, and difficulty calibration. Key solutions include LiveCodeBench's dynamic evaluation approach with periodically updated problem sets, automated test generation using LLM-driven approaches, and novel reward hacking detection systems for complex optimization tasks. The work demonstrates how evaluation infrastructure must evolve alongside model capabilities, incorporating intermediate grading signals, latency-aware metrics, and LLM-as-judge approaches to detect non-idiomatic coding patterns that pass traditional tests but fail real-world quality standards.

Evolution of Industrial AI: From Traditional ML to Multi-Agent Systems

Hitachi

Hitachi's journey in implementing AI across industrial applications showcases the evolution from traditional machine learning to advanced generative AI solutions. The case study highlights how they transformed from focused applications in maintenance, repair, and operations to a more comprehensive approach integrating LLMs, focusing particularly on reliability, small data scenarios, and domain expertise. Key implementations include repair recommendation systems for fleet management and fault tree extraction from manuals, demonstrating the practical challenges and solutions in industrial AI deployment.

Financial Transaction Categorization at Scale Using LLMs and Custom Embeddings

Mercado Libre

Mercado Libre (MELI) faced the challenge of categorizing millions of financial transactions across Latin America in multiple languages and formats as Open Finance unlocked access to customer financial data. Starting with a brittle regex-based system in 2021 that achieved only 60% accuracy and was difficult to maintain, they evolved through three generations: first implementing GPT-3.5 Turbo in 2023 to achieve 80% accuracy with 75% cost reduction, then transitioning to GPT-4o-mini in 2024, and finally developing custom BERT-based semantic embeddings trained on regional financial text to reach 90% accuracy with an additional 30% cost reduction. This evolution enabled them to scale from processing tens of millions of transactions per quarter to tens of millions per week, while enabling near real-time categorization that powers personalized financial insights across their ecosystem.

Fine-Tuning and Quantizing LLMs for Dynamic Attribute Extraction

Mercari

Mercari tackled the challenge of extracting dynamic attributes from user-generated marketplace listings by fine-tuning a 2B parameter LLM using QLoRA. The team successfully created a model that outperformed GPT-3.5-turbo while being 95% smaller and 14 times more cost-effective. The implementation included careful dataset preparation, parameter efficient fine-tuning, and post-training quantization using llama.cpp, resulting in a production-ready model with better control over hallucinations.

Fine-tuning Custom Embedding Models for Enterprise Search

Glean

Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.

Foundation Model for Unified Personalization at Scale

Netflix

Netflix developed a unified foundation model based on transformer architecture to consolidate their diverse recommendation systems, which previously consisted of many specialized models for different content types, pages, and use cases. The foundation model uses autoregressive transformers to learn user representations from interaction sequences, incorporating multi-token prediction, multi-layer representation, and long context windows. By scaling from millions to billions of parameters over 2.5 years, they demonstrated that scaling laws apply to recommendation systems, achieving notable performance improvements while creating high leverage across downstream applications through centralized learning and easier fine-tuning for new use cases.

Framework for Evaluating LLM Production Use Cases

Scale Venture Partners

Barak Turovsky, drawing from his experience leading Google Translate and other AI initiatives, presents a framework for evaluating LLM use cases in production. The framework analyzes use cases based on two key dimensions: accuracy requirements and fluency needs, along with consideration of stakes involved. This helps organizations determine which applications are suitable for current LLM deployment versus those that need more development. The framework suggests creative and workplace productivity applications are better immediate fits for LLMs compared to high-stakes information/decision support use cases.

GenAI Governance in Practice: Access Control, Data Quality, and Monitoring for Production LLM Systems

Xomnia

Martin Der, a data scientist at Xomnia, presents practical approaches to GenAI governance addressing the challenge that only 5% of GenAI projects deliver immediate ROI. The talk focuses on three key pillars: access and control (enabling self-service prototyping through tools like Open WebUI while avoiding shadow AI), unstructured data quality (detecting contradictions and redundancies in knowledge bases through similarity search and LLM-based validation), and LLM ops monitoring (implementing tracing platforms like LangFuse and creating dynamic golden datasets for continuous testing). The solutions include deploying Chrome extensions for workflow integration, API gateways for centralized policy enforcement, and developing a knowledge agent called "Genie" for internal use cases across telecom, healthcare, logistics, and maritime industries.

GenAI-Powered Invoice Document Processing and Automation

Uber

Uber faced significant challenges processing a high volume of invoices daily from thousands of global suppliers, with diverse formats, 25+ languages, and varying templates requiring substantial manual intervention. The company developed TextSense, a GenAI-powered document processing platform that leverages OCR, computer vision, and large language models (specifically OpenAI GPT-4 after evaluating multiple options including fine-tuned Llama 2 and Flan T5) to automate invoice data extraction. The solution achieved 90% overall accuracy, reduced manual processing by 2x, cut average handling time by 70%, and delivered 25-30% cost savings compared to manual processes, while providing a scalable, configuration-driven platform adaptable to diverse document types.

Generating 3D Shoppable Product Visualizations with Veo Video Generation Model

Google

Google developed a three-generation evolution of AI-powered systems to transform 2D product images into interactive 3D visualizations for online shopping, culminating in a solution based on their Veo video generation model. The challenge was to replicate the tactile, hands-on experience of in-store shopping in digital environments while making the technology scalable and cost-effective for retailers. The latest approach uses Veo's diffusion-based architecture, fine-tuned on millions of synthetic 3D assets, to generate realistic 360-degree product spins from as few as one to three product images. This system now powers interactive 3D visualizations across multiple product categories on Google Shopping, significantly improving the online shopping experience by enabling customers to virtually inspect products from multiple angles.

Generating Production-Ready MCP Servers from OpenAPI Specifications

SpeakEasy

SpeakEasy tackled the challenge of enabling AI agents to interact with existing APIs by developing a tool that automatically generates Model Context Protocol (MCP) servers from OpenAPI documents. The company identified critical issues when generating over 50 production MCP servers for customers, including tool explosion (too many exposed operations), verbose descriptions consuming excessive tokens, complex data formats confusing LLMs, and inadequate access controls. Their solution involved a three-layer optimization approach: pruning OpenAPI documents with custom extensions, building intelligence into the generator to handle complex formats and streaming responses, and providing customization files for precise tool control. The result is production-ready MCP servers that balance LLM context window constraints with functional completeness, using techniques like scope-based access control, automatic data transformation, and optimized descriptions.

Generative AI Implementation in Banking Customer Service and Knowledge Management

Various

Multiple banks, including Discover Financial Services, Scotia Bank, and others, share their experiences implementing generative AI in production. The case study focuses particularly on Discover's implementation of gen AI for customer service, where they achieved a 70% reduction in agent search time by using RAG and summarization for procedure documentation. The implementation included careful consideration of risk management, regulatory compliance, and human-in-the-loop validation, with technical writers and agents providing continuous feedback for model improvement.

Generative AI-Powered Knowledge Sharing System for Travel Expertise

Hotelplan Suisse

Hotelplan Suisse implemented a generative AI solution to address the challenge of sharing travel expertise across their 500+ travel experts. The system integrates multiple data sources and uses semantic search to provide instant, expert-level travel recommendations to sales staff. The solution reduced response time from hours to minutes and includes features like chat history management, automated testing, and content generation capabilities for marketing materials.

Global News Organization's AI-Powered Content Production and Verification System

Reuters

Reuters has implemented a comprehensive AI strategy to enhance its global news operations, focusing on reducing manual work, augmenting content production, and transforming news delivery. The organization developed three key tools: a press release fact extraction system, an AI-integrated CMS called Leon, and a content packaging tool called LAMP. They've also launched the Reuters AI Suite for clients, offering transcription and translation capabilities while maintaining strict ethical guidelines around AI-generated imagery and maintaining journalistic integrity.

Google Photos Magic Editor: Transitioning from On-Device ML to Cloud-Based Generative AI for Image Editing

Google

Google Photos evolved from using on-device machine learning models for basic image editing features like background blur and object removal to implementing cloud-based generative AI for their Magic Editor feature. The team transitioned from small, specialized models (10MB) running locally on devices to large-scale generative models hosted in the cloud to enable more sophisticated image editing capabilities like scene reimagination, object relocation, and advanced inpainting. This shift required significant changes in infrastructure, capacity planning, evaluation methodologies, and user experience design while maintaining focus on grounded, memory-preserving edits rather than fantastical image generation.

Hardening AI Agents for E-commerce at Scale: Multi-Company Perspectives on RL Alignment and Reliability

Prosus / Microsoft / Inworld AI / IUD

This panel discussion features experts from Microsoft, Google Cloud, InWorld AI, and Brazilian e-commerce company IUD (Prosus partner) discussing the challenges of deploying reliable AI agents for e-commerce at scale. The panelists share production experiences ranging from Google Cloud's support ticket routing agent that improved policy adherence from 45% to 90% using DPO adapters, to Microsoft's shift away from prompt engineering toward post-training methods for all Copilot models, to InWorld AI's voice agent architecture optimization through cascading models, and IUD's struggles with personalization balance in their multi-channel shopping agent. Key challenges identified include model localization for UI elements, cost efficiency, real-time voice adaptation, and finding the right balance between automation and user control in commerce experiences.

Healthcare Conversational AI and Multi-Model Cost Management in Production

Amberflo / Interactly.ai

A panel discussion featuring Interactly.ai's development of conversational AI for healthcare appointment management, and Amberflo's approach to usage tracking and cost management for LLM applications. The case study explores how Interactly.ai handles the challenges of deploying LLMs in healthcare settings with privacy and latency constraints, while Amberflo addresses the complexities of monitoring and billing for multi-model LLM applications in production.

Human-AI Synergy in Pharmaceutical Research and Document Processing

Merantix

Merantix has implemented AI systems that focus on human-AI collaboration across multiple domains, particularly in pharmaceutical research and document processing. Their approach emphasizes progressive automation where AI systems learn from human input, gradually taking over more tasks while maintaining high accuracy. In pharmaceutical applications, they developed a system for analyzing rodent behavior videos, while in document processing, they created solutions for legal and compliance cases where error tolerance is minimal. The systems demonstrate a shift from using AI as mere tools to creating collaborative AI-human workflows that maintain high accuracy while improving efficiency.

Hybrid LLM-Optimization System for Trip Planning with Real-World Constraints

Google

Google Research developed a hybrid system for trip planning that combines LLMs with optimization algorithms to address the challenge of generating practical travel itineraries. The system uses Gemini models to generate initial trip plans based on user preferences and qualitative goals, then applies a two-stage optimization algorithm that incorporates real-world constraints like opening hours, travel times, and budget considerations to produce feasible itineraries. This approach was implemented in Google's "AI trip ideas in Search" feature, demonstrating how LLMs can be effectively deployed in production while maintaining reliability through algorithmic correction of potential feasibility issues.

Hybrid ML and LLM Approach for Automated Question Quality Feedback

Stack Overflow

Stack Overflow developed Question Assistant to provide automated feedback on question quality for new askers, addressing the repetitive nature of human reviewer comments in their Staging Ground platform. Initial attempts to use LLMs alone to rate question quality failed due to unreliable predictions and generic feedback. The team pivoted to a hybrid approach combining traditional logistic regression models trained on historical reviewer comments to flag quality indicators, paired with Google's Gemini LLM to generate contextual, actionable feedback. While the solution didn't significantly improve approval rates or review times, it achieved a meaningful 12% increase in question success rates (questions that remain open and receive answers or positive scores) across two A/B tests, leading to full deployment in March 2025.

Implementing Generative AI in Manufacturing: A Multi-Use Case Study

Accenture

Accenture's Industry X division conducted extensive experiments with generative AI in manufacturing settings throughout 2023. They developed and validated nine key use cases including operations twins, virtual mentors, test case generation, and technical documentation automation. The implementations showed significant efficiency gains (40-50% effort reduction in some cases) while maintaining a human-in-the-loop approach. The study emphasized the importance of using domain-specific data, avoiding generic knowledge management solutions, and implementing multi-agent orchestrated solutions rather than standalone models.

Implementing LLMs for Patient Education and Healthcare Communication

National Healthcare Group

National Healthcare Group addressed the challenge of inconsistent and time-consuming patient education by implementing LLM-powered chatbots integrated into their existing healthcare apps and messaging platforms. The solution provides 24/7 multilingual patient education, focusing on conditions like eczema and medical test preparation, while ensuring privacy and accuracy. The implementation emphasizes integration with existing platforms rather than creating new standalone solutions, combined with careful monitoring and refinement of responses.

Implementing MCP Gateway for Large-Scale LLM Integration Infrastructure

Anthropic

Anthropic faced the challenge of managing an explosion of LLM-powered services and integrations across their organization, leading to duplicated functionality and integration chaos. They solved this by implementing a standardized MCP (Model Context Protocol) gateway that provides a single point of entry for all LLM integrations, handling authentication, credential management, and routing to both internal and external services. This approach reduced engineering overhead, improved security by centralizing credential management, and created a "pit of success" where doing the right thing became the easiest thing to do for their engineering teams.

Implementing MCP Remote Server for CRM Agent Integration

HubSpot

HubSpot built a remote Model Context Protocol (MCP) server to enable AI agents like ChatGPT to interact with their CRM data. The challenge was to provide seamless, secure access to CRM objects (contacts, companies, deals) for ChatGPT's 500 million weekly users, most of whom aren't developers. In less than four weeks, HubSpot's team extended the Java MCP SDK to create a stateless, HTTP-based microservice that integrated with their existing REST APIs and RPC system, implementing OAuth 2.0 for authentication and user permission scoping. The solution made HubSpot the first CRM with an OpenAI connector, enabling read-only queries that allow customers to analyze CRM data through natural language interactions while maintaining enterprise-grade security and scale.

Improving LLM Food Tracking Accuracy through Systematic Evaluation and Few-Shot Learning

Taralli

A case study of Taralli's food tracking application that initially used a naive approach with GPT-4-mini for calorie and nutrient estimation, resulting in significant accuracy issues. Through the implementation of systematic evaluation methods, creation of a golden dataset, and optimization using DSPy's BootstrapFewShotWithRandomSearch technique, they improved accuracy from 17% to 76% while maintaining reasonable response times with Gemini 2.5 Flash.

Improving Multilingual Search with Few-Shot LLM Translations

Delivery Hero

Delivery Hero operates across 68 countries and faced significant challenges with multilingual search due to dialectal variations, transliterations, spelling errors, and multiple languages within single markets. Traditional machine translation systems struggled with user intent and contextual nuances, leading to poor search results. The company implemented a solution using Large Language Models (LLMs), specifically Gemini, with few-shot learning to provide context-aware translations that handle regional dialects, correct spelling mistakes, and understand transliterations. By combining LLM-generated translations with Elastic Search and Vector Search in a hybrid approach, they achieved over 90% translation accuracy for restaurant queries and demonstrated positive improvements in user engagement through A/B testing, with the solution being rolled out to their Talabat and Hungerstation brands.

Incremental LLM Adoption Strategy in Email Processing API Platform

Nylas

Nylas, an email/calendar/contacts API platform provider, implemented a systematic three-month strategy to integrate LLMs into their production systems. They started with development workflow automation using multi-agent systems, enhanced their annotation processes with LLMs, and finally integrated LLMs as a fallback mechanism in their core email processing product. This measured approach resulted in 90% reduction in bug tickets, 20x cost savings in annotation, and successful deployment of their own LLM infrastructure when usage reached cost-effective thresholds.

Infrastructure Challenges and Solutions for Agentic AI Systems in Production

Meta / Google / Monte Carlo / Microsoft

A panel discussion featuring experts from Meta, Google, Monte Carlo, and Microsoft examining the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion covers how agentic workloads differ from traditional software systems, requiring new approaches to networking, load balancing, caching, security, and observability, while highlighting specific challenges like non-deterministic behavior, massive search spaces, and the need for comprehensive evaluation frameworks to ensure reliable and secure AI agent operations at scale.

Infrastructure for AI Agents: Panel Discussion on Production Challenges and Solutions

Various

This panel discussion brings together infrastructure experts from Groq, NVIDIA, Lambda, and AMD to discuss the unique challenges of deploying AI agents in production. The panelists explore how agentic AI differs from traditional AI workloads, requiring significantly higher token generation, lower latency, and more diverse infrastructure spanning edge to cloud. They discuss the evolution from training-focused to inference-focused infrastructure, emphasizing the need for efficiency at scale, specialized hardware optimization, and the importance of smaller distilled models over large monolithic models. The discussion highlights critical operational challenges including power delivery, thermal management, and the need for full-stack engineering approaches to debug and optimize agentic systems in production environments.

Infrastructure Noise in Agentic Coding Evaluations

Anthropic

Anthropic discovered that infrastructure configuration alone can produce differences in agentic coding benchmark scores that exceed the typical margins between top models on leaderboards. Through systematic experiments running Terminal-Bench 2.0 across six resource configurations on Google Kubernetes Engine, they found a 6 percentage point gap between the most- and least-resourced setups. The research revealed that while moderate resource headroom (up to 3x specifications) primarily improves infrastructure stability by preventing spurious failures, more generous allocations actively help agents solve problems they couldn't solve before. These findings challenge the notion that small leaderboard differences represent pure model capability measurements and led to recommendations for specifying both guaranteed allocations and hard kill thresholds, calibrating resource bands empirically, and treating resource configuration as a first-class experimental variable in LLMOps practices.

Integrating Gemini for Natural Language Analytics in IoT Fleet Management

Cox 2M

Cox 2M, facing challenges with a lean analytics team and slow insight generation (taking up to a week per request), partnered with Thoughtspot and Google Cloud to implement Gemini-powered natural language analytics. The solution reduced time to insights by 88% while enabling non-technical users to directly query complex IoT and fleet management data using natural language. The implementation includes automated insight generation, change analysis, and natural language processing capabilities.

Integrating Symbolic Reasoning with LLMs for AI-Native Telecom Infrastructure

Ericsson

Ericsson's System Comprehension Lab is exploring the integration of symbolic reasoning capabilities into telecom-oriented large language models to address critical limitations in current LLM architectures for telecommunications infrastructure management. The problem centers on LLMs' inability to provide deterministic, explainable reasoning required for telecom network optimization, security, and anomaly detection—domains where hallucinations, lack of logical consistency, and black-box behavior are unacceptable. The proposed solution involves hybrid neural-symbolic AI architectures that combine the pattern recognition strengths of transformer-based LLMs with rule-based reasoning engines, connected through techniques like symbolic chain-of-thought prompting, program-aided reasoning, and external solver integration. This approach aims to enable AI-native wireless systems for 6G infrastructure that can perform cross-layer optimization, real-time decision-making, and intent-driven network management while maintaining the explainability and logical rigor demanded by production telecom environments.

Internal AI Orchestration and Automation Across Multiple Departments

Zapier

Zapier, a workflow automation platform company, faced the challenge of managing repetitive operational tasks across multiple departments while maintaining productivity and focus on strategic work. The company implemented a comprehensive AI and automation strategy using their own platform combined with LLM capabilities (primarily ChatGPT/OpenAI) to automate workflows across customer success, sales, HR, technical support, content creation, engineering, accounting, and revenue operations. The results demonstrate significant time savings through automated meeting transcriptions and summaries, AI-powered sentiment analysis of surveys, automated content generation and translation, chatbot-based internal support systems, and intelligent ticket routing and categorization, enabling teams to focus on higher-value strategic activities while maintaining operational efficiency.

Iterative Prompt Optimization and Model Selection for Nutritional Calorie Estimation

Taralli

Taralli, a calorie tracking application, demonstrates systematic LLM improvement through rigorous evaluation and prompt optimization. The developer addressed the challenge of accurate nutritional estimation by creating a 107-example evaluation dataset, testing multiple prompt optimization techniques (vanilla, few-shot bootstrapping, MIPROv2, and GEPA) across several models (Gemini 2.5 Flash, Gemini 3 Flash, and DeepSeek v3.2). Through this methodical approach, they achieved a 15% accuracy improvement by switching from Gemini 2.5 Flash to Gemini 3 Flash while using a few-shot learning approach with 16 examples, reaching 60% accuracy within a 10% calorie prediction threshold. The system was deployed with fallback model configurations and extended to support fully offline on-device inference for iOS.

Kubernetes as a Platform for LLM Operations: Practical Experiences and Trade-offs

Various

A panel discussion between experienced Kubernetes and ML practitioners exploring the challenges and opportunities of running LLMs on Kubernetes. The discussion covers key aspects including GPU management, cost optimization, training vs inference workloads, and architectural considerations. The panelists share insights from real-world implementations while highlighting both benefits (like workload orchestration and vendor agnosticism) and challenges (such as container sizes and startup times) of using Kubernetes for LLM operations.

Large Bank LLMOps Implementation: Lessons from Deutsche Bank and Others

Various

A discussion between banking technology leaders about their implementation of generative AI, focusing on practical applications, regulatory challenges, and strategic considerations. Deutsche Bank's CTO and other banking executives share their experiences in implementing gen AI across document processing, risk modeling, research analysis, and compliance use cases, while emphasizing the importance of responsible deployment and regulatory compliance.

Large Recommender Models: Adapting Gemini for YouTube Video Recommendations

Google / YouTube

YouTube developed Large Recommender Models (LRM) by adapting Google's Gemini LLM for video recommendations, addressing the challenge of serving personalized content to billions of users. The solution involved creating semantic IDs to tokenize videos, continuous pre-training to teach the model both English and YouTube-specific video language, and implementing generative retrieval systems. While the approach delivered significant improvements in recommendation quality, particularly for challenging cases like new users and fresh content, the team faced substantial serving cost challenges that required 95%+ cost reductions and offline inference strategies to make production deployment viable at YouTube's scale.

Large-Scale AI Assistant Deployment with Safety-First Evaluation Approach

Discord

Discord implemented Clyde AI, a chatbot assistant that was deployed to over 200 million users, focusing heavily on safety, security, and evaluation practices. The team developed a comprehensive evaluation framework using simple, deterministic tests and metrics, implemented through their open-source tool PromptFu. They faced unique challenges in preventing harmful content and jailbreaks, leading to innovative solutions in red teaming and risk assessment, while maintaining a balance between casual user interaction and safety constraints.

Large-Scale AI Red Teaming Competition Platform for Production Model Security

HackAPrompt, LearnPrompting

Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.

Large-Scale Deployment of On-Device and Server Foundation Models for Consumer AI Features

Apple

Apple developed and deployed a comprehensive foundation model infrastructure consisting of a 3-billion parameter on-device model and a mixture-of-experts server model to power Apple Intelligence features across iOS, iPadOS, and macOS. The implementation addresses the challenge of delivering generative AI capabilities at consumer scale while maintaining privacy, efficiency, and quality across 15 languages. The solution involved novel architectural innovations including shared KV caches, parallel track mixture-of-experts design, and extensive optimization techniques including quantization and compression, resulting in production deployment across millions of devices with measurable performance improvements in text and vision tasks.

Lessons Learned from Deploying 30+ GenAI Agents in Production

Quic

Quic shares their experience deploying over 30 AI agents across various industries, focusing on customer experience and e-commerce applications. They developed a comprehensive approach to LLMOps that includes careful planning, persona development, RAG implementation, API integration, and robust testing and monitoring systems. The solution achieved 60% resolution of tier-one support issues with higher quality than human agents, while maintaining human involvement for complex cases.

Lessons Learned from Production AI Agent Deployments

Google / Vertex AI

A comprehensive overview of lessons learned from deploying AI agents in production at Google's Vertex AI division. The presentation covers three key areas: meta-prompting techniques for optimizing agent prompts, implementing multi-layered safety and guard rails, and the critical importance of evaluation frameworks. These insights come from real-world experience delivering hundreds of models into production with various developers, customers, and partners.

LLM Testing Framework Using LLMs as Quality Assurance Agents

Various

Alaska Airlines and Bitra developed QARL (Quality Assurance Response Liaison), an innovative testing framework that uses LLMs to evaluate other LLMs in production. The system conducts automated adversarial testing of customer-facing chatbots by simulating various user personas and conversation scenarios. This approach helps identify potential risks and unwanted behaviors before deployment, while providing scalable testing capabilities through containerized architecture on Google Cloud Platform.

LLM-Driven Developer Experience and Code Migrations at Scale

Uber

Uber's Developer Platform team explored three major initiatives using LLMs in production: a custom IDE coding assistant (which was later abandoned in favor of GitHub Copilot), an AI-powered test generation system called Auto Cover, and an automated Java-to-Kotlin code migration system. The team combined deterministic approaches with LLMs to achieve significant developer productivity gains while maintaining code quality and safety. They found that while pure LLM approaches could be risky, hybrid approaches combining traditional software engineering practices with AI showed promising results.

LLM-Powered Customer Service Agent Copilot for E-commerce Support

Wayfair

Wayfair developed Wilma, an LLM-based copilot system to assist customer service agents in responding to customer inquiries about product issues. The system uses models like Gemini and GPT to draft contextual messages that agents can review and edit before sending. Through an iterative evolution from a single monolithic prompt to over 40 specialized prompt templates and multiple coordinated LLM calls, Wilma helps agents respond 12% faster while improving policy adherence by 2-5% depending on issue type. The system pulls real-time customer, order, and product data from Wayfair's systems to generate appropriate responses, with particular sophistication in handling complex resolution negotiation scenarios through a multi-LLM routing and analysis framework.

LLM-Powered GraphQL Mock Data Generation for Developer Productivity

Airbnb

Airbnb developed an innovative solution to address the persistent challenge of creating and maintaining realistic GraphQL mock data for testing and prototyping. Engineers traditionally spent significant time manually writing and updating mock data, which would often drift out of sync with evolving queries. Airbnb introduced the @generateMock directive, which combines GraphQL schema validation, product context (including design mockups), and LLMs (specifically Gemini 2.5 Pro) to automatically generate type-safe, realistic mock data. The solution integrates seamlessly into their existing code generation workflow (Niobe CLI), keeping engineers in their local development loops. A companion @respondWithMock directive enables client engineers to prototype features before server implementations are complete. Since deployment, Airbnb engineers have generated and merged over 700 mocks across iOS, Android, and Web platforms, significantly reducing manual effort and accelerating development cycles.

LLM-Powered Style Compatibility Labeling Pipeline for E-Commerce Catalog Curation

Wayfair

Wayfair addressed the challenge of identifying stylistic compatibility among millions of products in their catalog by building an LLM-powered labeling pipeline on Google Cloud. Traditional recommendation systems relied on popularity signals and manual annotation, which was accurate but slow and costly. By leveraging Gemini 2.5 Pro with carefully engineered prompts that incorporate interior design principles and few-shot examples, they automated the binary classification task of determining whether product pairs are stylistically compatible. This approach improved annotation accuracy by 11% compared to initial generic prompts and enables scalable, consistent style-aware curation that will be used to evaluate and ultimately improve recommendation algorithms, with plans for future integration into production search and personalization systems.

Managing Memory and Scaling Issues in Production AI Agent Systems

Gradient Labs

Gradient Labs experienced a series of interconnected production incidents involving their AI agent deployed on Google Cloud Run, starting with memory usage alerts that initially appeared to be memory leaks. The team discovered the root cause was Temporal workflow cache sizing issues causing container crashes, which they resolved by tuning cache parameters. However, this fix inadvertently caused auto-scaling problems that throttled their system's ability to execute activities, leading to increased latency. The incidents highlight the complex interdependencies in production AI systems and the need for careful optimization across all infrastructure layers.

MCP Protocol Development and Agent AI Foundation Launch

Anthropic / OpenAI / Goose

This podcast transcript covers the one-year journey of the Model Context Protocol (MCP) from its initial launch by Anthropic through to its donation to the newly formed Agent AI Foundation. The discussion explores how MCP evolved from a local-only protocol to support remote servers, authentication, and long-running tasks, addressing the fundamental challenge of connecting AI agents to external tools and data sources in production environments. The case study highlights extensive production usage of MCP both within Anthropic's internal systems and across major technology companies including OpenAI, Microsoft, and Google, demonstrating widespread adoption with millions of requests at scale. The formation of the Agent AI Foundation with founding members including Anthropic, OpenAI, and Block represents a significant industry collaboration to standardize agentic system protocols and ensure neutral governance of critical AI infrastructure.

Mercury: Agentic AI Platform for LLM-Powered Recommendation Systems

eBay

eBay developed Mercury, an internal agentic framework designed to scale LLM-powered recommendation experiences across its massive marketplace of over two billion active listings. The platform addresses the challenge of transforming vast amounts of unstructured data into personalized product recommendations by integrating Retrieval-Augmented Generation (RAG) with a custom Listing Matching Engine that bridges the gap between LLM-generated text outputs and eBay's dynamic inventory. Mercury enables rapid development through reusable, plug-and-play components following object-oriented design principles, while its near-real-time distributed queue-based execution platform handles cost and latency requirements at industrial scale. The system combines multiple retrieval mechanisms, semantic search using embedding models, anomaly detection, and personalized ranking to deliver contextually relevant shopping experiences to hundreds of millions of users.

Mission-Critical LLM Inference Platform Architecture

Baseten

Baseten has built a production-grade LLM inference platform focusing on three key pillars: model-level performance optimization, horizontal scaling across regions and clouds, and enabling complex multi-model workflows. The platform supports various frameworks including SGLang and TensorRT-LLM, and has been successfully deployed by foundation model companies and enterprises requiring strict latency, compliance, and reliability requirements. A key differentiator is their ability to handle mission-critical inference workloads with sub-400ms latency for complex use cases like AI phone calls.

MLOps Evolution and LLM Integration at a Major Bank

Barclays

Discussion of MLOps practices and the evolution towards LLM integration at Barclays, focusing on the transition from traditional ML to GenAI workflows while maintaining production stability. The case study highlights the importance of balancing innovation with regulatory requirements in financial services, emphasizing ROI-driven development and the creation of reusable infrastructure components.

MLOps Platform for Airline Operations with LLM Integration

LATAM Airlines

LATAM Airlines developed Cosmos, a vendor-agnostic MLOps framework that enables both traditional ML and LLM deployments across their business operations. The framework reduced model deployment time from 3-4 months to less than a week, supporting use cases from fuel efficiency optimization to personalized travel recommendations. The platform demonstrates how a traditional airline can transform into a data-driven organization through effective MLOps practices and careful integration of AI technologies.

Model Context Protocol (MCP): Building Universal Connectivity for LLMs in Production

Anthropic

Anthropic developed and open-sourced the Model Context Protocol (MCP) to address the challenge of providing external context and tool connectivity to large language models in production environments. The protocol emerged from recognizing that teams were repeatedly reimplementing the same capabilities across different contexts (coding editors, web interfaces, and various services) where Claude needed to interact with external systems. By creating a universal standard protocol and open-sourcing it, Anthropic enabled developers to build integrations once and deploy them everywhere, while fostering an ecosystem that became what they describe as the fastest-growing open source protocol in history. The protocol has matured from requiring local server deployments to supporting remote hosted servers with a central registry, reducing friction for both developers and end users while enabling sophisticated production use cases across enterprise integrations and personal automation.

Modernizing DevOps with Generative AI: Challenges and Best Practices in Production

Various (Bundesliga, Harness, Trice)

A panel of experts from various organizations discusses the current state and challenges of integrating generative AI into DevOps workflows and production environments. The discussion covers how companies are balancing productivity gains with security concerns, the importance of having proper testing and evaluation frameworks, and strategies for successful adoption of AI tools in production DevOps processes while maintaining code quality and security.

Multi-Agent Architecture for Automated Advertising Media Planning

Spotify

Spotify faced a structural problem where multiple advertising buying channels (Direct, Self-Serve, Programmatic) relied on consolidated backend services but implemented fragmented, channel-specific workflow logic, creating duplicated decision-making and technical debt. To address this, they built "Ads AI," a multi-agent system using Google's Agent Development Kit (ADK) and Vertex AI that transforms media planning from a manual 15-30 minute process requiring 20+ form fields into a conversational interface that generates optimized, data-driven media plans in 5-10 seconds using 1-3 natural language messages. The system decomposes media planning into specialized agents (RouterAgent, GoalResolverAgent, AudienceResolverAgent, BudgetAgent, ScheduleAgent, and MediaPlannerAgent) that execute in parallel, leverage historical campaign performance data via function calling tools, and produce recommendations based on cost optimization, delivery rates, and budget matching heuristics.

Multi-Agent DBT Development Workflow for Data Engineering Consulting

Mammoth Growth

Mammoth Growth, a boutique data consultancy specializing in marketing and customer data, developed a multi-agent AI system to automate DBT development workflows in response to data teams struggling to deliver analytics at the speed of business. The solution employs a team of specialized AI agents (orchestrator, analyst, architect, and analytics engineer) that leverage the DBT Model Context Protocol (MCP) to autonomously write, document, and test production-grade DBT code from detailed specifications. The system enabled the delivery of a complete enterprise-grade data lineage with 15 data models and two gold-layer models in just 3 weeks for a pilot client, compared to an estimated 10 weeks using traditional manual development approaches, while maintaining code quality standards through human-led requirements gathering and mandatory code review before production deployment.

Multi-Agent LLM System for Business Process Automation

Cognizant

Cognizant developed Neuro AI, a multi-agent LLM-based system that enables business users to create and deploy AI-powered decision-making workflows without requiring deep technical expertise. The platform allows agents to communicate with each other to handle complex business processes, from intranet search to process automation, with the ability to deploy either in the cloud or on-premises. The system includes features for opportunity identification, use case scoping, synthetic data generation, and automated workflow creation, all while maintaining explainability and human oversight.

Multi-Agent LLM Systems: Implementation Patterns and Production Case Studies

Nimble Gravity, Hiflylabs

A research study conducted by Nimble Gravity and Hiflylabs examining GenAI adoption patterns across industries, revealing that approximately 28-30% of GenAI projects successfully transition from assessment to production. The study explores various multi-agent LLM architectures and their implementation in production, including orchestrator-based, agent-to-agent, and shared message pool patterns, demonstrating practical applications like automated customer service systems that achieved significant cost savings.

Multi-Company Panel Discussion on Enterprise AI and Agentic AI Deployment Challenges

Glean / Deloitte / Docusign

This panel discussion at AWS re:Invent brings together practitioners from Glean, Deloitte, and DocuSign to discuss the practical realities of deploying AI and agentic AI systems in enterprise environments. The panelists explore challenges around organizational complexity, data silos, governance, agent creation and sharing, value measurement, and the tension between autonomous capabilities and human oversight. Key themes include the need for cross-functional collaboration, the importance of security integration from day one, the difficulty of measuring AI-driven productivity gains, and the evolution from individual AI experimentation to governed enterprise-wide agent deployment. The discussion emphasizes that successful AI transformation requires reimagining workflows rather than simply bolting AI onto legacy systems, and that business value should drive technical decisions rather than focusing solely on which LLM model to use.

Multi-Company Panel Discussion on Production LLM Frameworks and Scaling Challenges

Various (Thinking Machines, Yutori, Evolutionaryscale, Perplexity, Axiom)

This panel discussion features experts from multiple AI companies discussing the current state and future of agentic frameworks, reinforcement learning applications, and production LLM deployment challenges. The panelists from Thinking Machines, Perplexity, Evolutionary Scale AI, and Axiom share insights on framework proliferation, the role of RL in post-training, domain-specific applications in mathematics and biology, and infrastructure bottlenecks when scaling models to hundreds of GPUs, highlighting the gap between research capabilities and production deployment tools.

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

Multi-Label Red Flag Detection System for Fraud Prevention

Feedzai

Feedzai developed ScamAlert, a generative AI-based system that moves beyond traditional binary scam classification to identify specific red flags in suspected fraud attempts. The system addresses the limitations of binary classifiers that only output risk scores without explanation by using multimodal LLMs to analyze screenshots of suspected scams (emails, text messages, listings) and identify observable warning signs like suspicious links, urgency tactics, or unusual communication channels. The team created a comprehensive benchmarking framework to evaluate multiple commercial multimodal models across four dimensions: red flag detection accuracy (precision/recall/F1), instruction adherence, cost, and latency. Their results showed significant performance variations across models, with GPT-5, Gemini 3 Pro, and Gemini 2.5 Pro leading in accuracy, though with notable tradeoffs in cost and latency, while also revealing instruction-following issues in some models that generated hallucinated red flags not in the predefined taxonomy.

Multi-Tenant MCP Server Authentication with Redis Session Management

BrainGrid

BrainGrid faced the challenge of transforming their Model Context Protocol (MCP) server from a local development tool into a production-ready, multi-tenant service that could be deployed to customers. The core problem was that serverless platforms like Cloud Run and Vercel don't maintain session state, causing users to re-authenticate repeatedly as instances scaled to zero or requests hit different instances. BrainGrid solved this by implementing a Redis-based session store with AES-256-GCM encryption, OAuth integration via WorkOS, and a fast-path/slow-path authentication pattern that caches validated JWT sessions. The solution reduced authentication overhead from 50-100ms per request to near-instantaneous for cached sessions, eliminated re-authentication fatigue, and enabled the MCP server to scale from single-user to multi-tenant deployment while maintaining security and performance.

Multilingual Content Navigation and Localization System

Intercom

YouTube, a Google company, implements a comprehensive multilingual navigation and localization system for its global platform. The source text appears to be in Dutch, demonstrating the platform's localization capabilities, though insufficient details are provided about the specific LLMOps implementation.

Multilingual Text Editing via Instruction Tuning

Grammarly

Grammarly's Strategic Research team developed mEdIT, a multilingual extension of their CoEdIT text editing model, to support intelligent writing assistance across seven languages and three editing tasks (grammatical error correction, text simplification, and paraphrasing). The problem addressed was that foundational LLMs produce low-quality outputs for text editing tasks, and prior specialized models only supported either multiple tasks in one language or single tasks across multiple languages. By fine-tuning multilingual LLMs (including mT5, mT0, BLOOMZ, PolyLM, and Bactrian-X) on over 200,000 carefully curated instruction-output pairs across Arabic, Chinese, English, German, Japanese, Korean, and Spanish, mEdIT achieved strong performance across tasks and languages, even when instructions were given in a different language than the text being edited. The models demonstrated generalization to unseen languages, with causal language models performing best, and received high ratings from human evaluators, though the work has not yet been integrated into Grammarly's production systems.

Multimodal AI Vector Search for Advanced Video Understanding

Twelve Labs

Twelve Labs developed an integration with Databricks Mosaic AI to enable advanced video understanding capabilities through multimodal embeddings. The solution addresses challenges in processing large-scale video datasets and providing accurate multimodal content representation. By combining Twelve Labs' Embed API for generating contextual vector representations with Databricks Mosaic AI Vector Search's scalable infrastructure, developers can implement sophisticated video search, recommendation, and analysis systems with reduced development time and resource needs.

Native Image Generation with Multimodal Context in Gemini 2.5 Flash

Google DeepMind

Google DeepMind released an updated native image generation capability in Gemini 2.5 Flash that represents a significant quality leap over previous versions. The model addresses key production challenges including consistent character rendering across multiple angles, pixel-perfect editing that preserves scene context, and improved text rendering within images. Through interleaved generation, the model can maintain conversation context across multiple editing turns, enabling iterative creative workflows. The team tackled evaluation challenges by combining human preference data with specific technical metrics like text rendering quality, while incorporating real user feedback from social media to create comprehensive benchmarks that drive model improvements.

Natural Language Analytics with Snowflake Cortex for Self-Service BI

Gitlab

GitLab implemented conversational analytics using Snowflake Cortex to enable non-technical business users to query structured data using natural language, eliminating the traditional dependency on data analysts and reducing analytics backlog. The solution evolved from a basic proof-of-concept with 60% accuracy to a production system achieving 85-95% accuracy for simple queries and 75% for complex queries, utilizing semantic models, prompt engineering, verified query feedback loops, and role-based access controls. The implementation reduced analytics requests by approximately 50% for some teams, decreased time-to-insight from weeks to seconds, and democratized data access while maintaining enterprise-grade security through Snowflake's native governance features.

Natural Language Interface to Business Intelligence Using RAG

Volvo

Volvo implemented a Retrieval Augmented Generation (RAG) system that allows non-technical users to query business intelligence data through a Slack interface using natural language. The system translates natural language questions into SQL queries for BigQuery, executes them, and returns results - effectively automating what was previously manual work done by data analysts. The system leverages DBT metadata and schema information to provide accurate responses while maintaining control over data access.

Network Operations Transformation with GenAI and AIOps

Vodafone

Vodafone implemented a comprehensive AI and GenAI strategy to transform their network operations, focusing on improving customer experience through better network management. They migrated from legacy OSS systems to a cloud-based infrastructure on Google Cloud Platform, integrating over 2 petabytes of network data with commercial and IT data. The initiative includes AI-powered network investment planning, automated incident management, and device analytics, resulting in significant operational efficiency improvements and a planned 50% reduction in OSS tools.

On-Device Grammar Correction with Sequence-to-Sequence Models

Google

Google Research developed an on-device grammar correction system for Gboard on Pixel 6 that detects and suggests corrections for grammatical errors as users type. The solution addresses the challenge of implementing neural grammar correction within the constraints of mobile devices (limited memory, computational power, and latency requirements) while preserving user privacy by keeping all processing local. The team built a 20MB hybrid Transformer-LSTM model using hard distillation from a cloud-based system, achieving inference on 60 characters in under 22ms on the Pixel 6 CPU, enabling real-time grammar correction for both complete sentences and partial sentence prefixes across English text in nearly any app using Gboard.

Open-Source Protein Structure Prediction and Generative Design Platform for Drug Discovery

Boltz

Boltz, founded by Gabriele Corso and Jeremy Wohlwend, developed an open-source suite of AI models (Boltz-1, Boltz-2, and BoltzGen) for structural biology and protein design, democratizing access to capabilities previously held by proprietary systems like AlphaFold 3. The company addresses the challenge of predicting complex molecular interactions (protein-ligand, protein-protein) and designing novel therapeutic proteins by combining generative diffusion models with specialized equivariant architectures. Their approach achieved validated nanomolar binders for two-thirds of nine previously unseen protein targets, demonstrating genuine generalization beyond training data. The newly launched Boltz Lab platform provides a production-ready infrastructure with optimized GPU kernels running 10x faster than open-source versions, offering agents for protein and small molecule design with collaborative interfaces for medicinal chemists and researchers.

Optimizing Engineering Design with Conditional GANs

Rolls-Royce

Rolls-Royce collaborated with Databricks to enhance their design space exploration capabilities using conditional Generative Adversarial Networks (cGANs). The project aimed to leverage legacy simulation data to identify and assess innovative design concepts without requiring traditional geometry modeling and simulation processes. By implementing cGANs on the Databricks platform, they successfully developed a system that could handle multi-objective constraints and optimize design processes while maintaining compliance with aerospace industry requirements.

Optimizing LLM Server Startup Times for Preemptable GPU Infrastructure

Replit

Replit faced challenges with running LLM inference on expensive GPU infrastructure and implemented a solution using preemptable cloud GPUs to reduce costs by two-thirds. The key challenge was reducing server startup time from 18 minutes to under 2 minutes to handle preemption events, which they achieved through container optimization, GKE image streaming, and improved model loading processes.

Optimizing RAG-based Search Results for Production: A Journey from POC to Production

Statista

Statista, a global data platform, developed and optimized a RAG-based AI search system to enhance their platform's search capabilities. Working with Urial Labs and Talent Formation, they transformed a basic prototype into a production-ready system that improved search quality by 140%, reduced costs by 65%, and decreased latency by 10%. The resulting Research AI product has seen growing adoption among paying customers and demonstrates superior performance compared to general-purpose LLMs for domain-specific queries.

Optimizing Security Incident Response with LLMs at Google

Google

Google implemented LLMs to streamline their security incident response workflow, particularly focusing on incident summarization and executive communications. They used structured prompts and careful input processing to generate high-quality summaries while ensuring data privacy and security. The implementation resulted in a 51% reduction in time spent on incident summaries and 53% reduction in executive communication drafting time, while maintaining or improving quality compared to human-written content.

Panel Discussion on LLMOps Challenges: Model Selection, Ethics, and Production Deployment

Google, Databricks,

A panel discussion featuring leaders from various AI companies discussing the challenges and solutions in deploying LLMs in production. Key topics included model selection criteria, cost optimization, ethical considerations, and architectural decisions. The discussion highlighted practical experiences from companies like Interact.ai's healthcare deployment, Inflection AI's emotionally intelligent models, and insights from Google and Databricks on responsible AI deployment and tooling.

Panel Discussion: Scaling Generative AI in Enterprise - Challenges and Best Practices

Various

A panel discussion featuring leaders from Google Cloud AI, Symbol AI, Chain ML, and Deloitte discussing the adoption, scaling, and implementation challenges of generative AI across different industries. The panel explores key considerations around model selection, evaluation frameworks, infrastructure requirements, and organizational readiness while highlighting practical approaches to successful GenAI deployment in production.

Parallel Asynchronous AI Coding Agents for Development Workflows

Google

Google Labs introduced Jules, an asynchronous coding agent designed to execute development tasks in parallel in the background while developers focus on higher-value work. The product addresses the challenge of serial development workflows by enabling developers to spin up multiple cloud-based agents simultaneously to handle tasks like SDK updates, testing, accessibility audits, and feature development. Launched two weeks prior to the presentation, Jules had already generated 40,000 public commits. The demonstration showcased how a developer could parallelize work on a conference schedule website by simultaneously running multiple test framework implementations, adding features like calendar integration and AI summaries, while conducting accessibility and security audits—all managed through a VM-based cloud infrastructure powered by Gemini 2.5 Pro.

Pivoting from GPU Infrastructure to Building an AI-Powered Development Environment

Windsurf

Windsurf began as a GPU virtualization company but pivoted in 2022 when they recognized the transformative potential of large language models. They developed an AI-powered development environment that evolved from a VS Code extension to a full-fledged IDE, incorporating advanced code understanding and generation capabilities. The product now serves hundreds of thousands of daily active users, including major enterprises, and has achieved significant success in automating software development tasks while maintaining high precision through sophisticated evaluation systems.

Post-Training and Production LLM Systems at Scale

OpenAI

This case study explores OpenAI's approach to post-training and deploying large language models in production environments, featuring insights from a post-training researcher working on reasoning models. The discussion covers the operational complexities of reinforcement learning from human feedback at scale, the evolution from non-thinking to thinking models, and production challenges including model routing, context window optimization, token efficiency improvements, and interruptability features. Key developments include the shopping model release, improvements from GPT-4.1 to GPT-5.1, and the operational realities of managing complex RL training runs with multiple grading setups and infrastructure components that require constant monitoring and debugging.

Practical Lessons Learned from Building and Deploying GenAI Applications

Bolbeck

A comprehensive overview of lessons learned from building GenAI applications over 1.5 years, focusing on the complexities and challenges of deploying LLMs in production. The presentation covers key aspects of LLMOps including model selection, hosting options, ensuring response accuracy, cost considerations, and the importance of observability in AI applications. Special attention is given to the emerging role of AI agents and the critical balance between model capability and operational costs.

Production AI Agents for Accounting Automation: Engineering Process Daemons at Scale

Digits

Digits, an AI-native accounting platform, shares their experience running AI agents in production for over 2 years, addressing real-world challenges in deploying LLM-based systems. The team reframes "agents" as "process daemons" to set appropriate expectations and details their implementation across three use cases: vendor data enrichment, client onboarding, and complex query handling. Their solution emphasizes building lightweight custom infrastructure over dependency-heavy frameworks, reusing existing APIs as agent tools, implementing comprehensive observability with OpenTelemetry, and establishing robust guardrails. The approach has enabled reliable automation while maintaining transparency, security, and performance through careful engineering rather than relying on framework abstractions.

Production AI Systems for News Personalization and Journalistic Workflows

Bonnier News

Bonnier News, a major Swedish media publisher with over 200 brands including Expressen and local newspapers, has deployed AI and machine learning systems in production to solve content personalization and newsroom automation challenges. The company's data science team, led by product manager Hans Yell (PhD in computational linguistics) and head of architecture Magnus Engster, has built white-label personalization engines using embedding-based recommendation systems that outperform manual content curation while scaling across multiple brands. They leverage vector similarity and user reading patterns rather than traditional metadata, achieving significant engagement lifts. Additionally, they're developing LLM-powered tools for journalists including headline generation, news aggregation summaries, and trigger questions for articles. Through a WASP-funded PhD collaboration, they're working on domain-adapted Swedish language models via continued pre-training of Llama models with Bonnier's extensive text corpus, focusing on capturing brand tone and improving journalistic workflows while maintaining data sovereignty.

Production LLM Systems at Scale - Lessons from Financial Services, Legal Tech, and ML Infrastructure

Nubank, Harvey AI, Galileo and Convirza

A panel discussion featuring leaders from Nubank, Harvey AI, Galileo, and Convirza discussing their experiences implementing LLMs in production. The discussion covered key challenges and solutions around model evaluation, cost optimization, latency requirements, and the transition from large proprietary models to smaller fine-tuned models. Participants shared insights on modularizing LLM applications, implementing human feedback loops, and balancing the tradeoffs between model size, cost, and performance in production environments.

Production Monitoring and Issue Discovery for AI Agents

Raindrop

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

Production RAG Stack Development Through 37 Iterations for Financial Services

jonfernandes

Independent AI engineer Jonathan Fernandez shares his experience developing a production-ready RAG (Retrieval Augmented Generation) stack through 37 failed iterations, focusing on building solutions for financial institutions. The case study demonstrates the evolution from a naive RAG implementation to a sophisticated system incorporating query processing, reranking, and monitoring components. The final architecture uses LlamaIndex for orchestration, Qdrant for vector storage, open-source embedding models, and Docker containerization for on-premises deployment, achieving significantly improved response quality for document-based question answering.

Production Vector Search and Retrieval System Optimization at Scale

Superlinked

SuperLinked, a company focused on vector search infrastructure, shares production insights from deploying information retrieval systems for e-commerce and enterprise knowledge management with indexes up to 2 terabytes. The presentation addresses challenges in relevance, latency, and cost optimization when deploying vector search systems at scale. Key solutions include avoiding vector pooling/averaging, implementing late interaction models, fine-tuning embeddings for domain-specific needs, combining sparse and dense representations, leveraging graph embeddings, and using template-based query generation instead of unconstrained text-to-SQL. Results demonstrate 5%+ precision improvements through targeted fine-tuning, significant latency reductions through proper database selection and query optimization, and improved relevance through multi-encoder architectures that combine text, graph, and metadata signals.

Production-Ready LLM Integration Using Retrieval-Augmented Generation and Custom ReAct Implementation

Buzzfeed

BuzzFeed Tech tackled the challenges of integrating LLMs into production by addressing dataset recency limitations and context window constraints. They evolved from using vanilla ChatGPT with crafted prompts to implementing a sophisticated retrieval-augmented generation system. After exploring self-hosted models and LangChain, they developed a custom "native ReAct" implementation combined with an enhanced Nearest Neighbor Search Architecture using Pinecone, resulting in a more controlled, cost-efficient, and production-ready LLM system.

Production-Scale Document Parsing with Vision-Language Models and Specialized OCR

Reducto

Reducto has built a production document parsing system that processes over 1 billion documents by combining specialized vision-language models, traditional OCR, and layout detection models in a hybrid pipeline. The system addresses critical challenges in document parsing including hallucinations from frontier models, dense tables, handwritten forms, and complex charts. Their approach uses a divide-and-conquer strategy where different models are routed to different document regions based on complexity, achieving higher accuracy than AWS Textract, Microsoft Azure Document Intelligence, and Google Cloud OCR on their internal benchmarks. The company has expanded beyond parsing to offer extraction with pixel-level citations and an edit endpoint for automated form filling.

Rebuilding a Production Chatbot with Direct API Access and Multi-Agent Architecture

Langchain

LangChain rebuilt their public documentation chatbot after discovering their support engineers preferred using their own internal workflow over the existing tool. The original chatbot used traditional vector embedding retrieval, which suffered from fragmented context, constant reindexing, and vague citations. The solution involved building two distinct architectures: a fast CreateAgent for simple documentation queries delivering sub-15-second responses, and a Deep Agent with specialized subgraphs for complex queries requiring codebase analysis. The new approach replaced vector embeddings with direct API access to structured content (Mintlify for docs, Pylon for knowledge base, and ripgrep for codebase search), enabling the agent to search iteratively like a human. Results included dramatically faster response times, precise citations with line numbers, elimination of reindexing overhead, and internal adoption by support engineers for complex troubleshooting.

Rebuilding an AI SDR Agent with Multi-Agent Architecture for Enterprise Sales Automation

11x

11x rebuilt their AI Sales Development Representative (SDR) product Alice from scratch in just 3 months, transitioning from a basic campaign creation tool to a sophisticated multi-agent system capable of autonomous lead sourcing, research, and email personalization. The team experimented with three different agent architectures - React, workflow-based, and multi-agent systems - ultimately settling on a hierarchical multi-agent approach with specialized sub-agents for different tasks. The rebuilt system now processes millions of leads and messages with a 2% reply rate comparable to human SDRs, demonstrating the evolution from simple AI tools to true digital workers in production sales environments.

Responsible AI Implementation for Healthcare Form Automation

WellSky

WellSky, serving over 2,000 hospitals and handling 100 million forms annually, partnered with Google Cloud to address clinical documentation burden and clinician burnout. They developed an AI-powered solution focusing on form automation, implementing a comprehensive responsible AI framework with emphasis on evidence citation, governance, and technical foundations. The project aimed to reduce "pajama time" - where 75% of nurses complete documentation after hours - while ensuring patient safety through careful AI deployment.

Retrieval Augmented LLMs for Real-time CRM Account Linking

Schneider Electric

Schneider Electric partnered with AWS Machine Learning Solutions Lab to automate their CRM account linking process using Retrieval Augmented Generation (RAG) with Flan-T5-XXL model. The solution combines LangChain, Google Search API, and SEC-10K data to identify and maintain up-to-date parent-subsidiary relationships between customer accounts, improving accuracy from 55% to 71% through domain-specific prompt engineering.

Revenue Intelligence Platform with Ambient AI Agents

Tabs

Tabs, a vertical AI company in the finance space, has built a revenue intelligence platform for B2B companies that uses ambient AI agents to automate financial workflows. The company extracts information from sales contracts to create a "commercial graph" and deploys AI agents that work autonomously in the background to handle billing, collections, and reporting tasks. Their approach moves beyond traditional guided AI experiences toward fully ambient agents that monitor communications and trigger actions automatically, with the goal of creating "beautiful operational software that no one ever has to go into."

Scaling a High-Traffic LLM Chat Application to 30,000 Messages Per Second

Character.ai

Character.ai scaled their open-domain conversational AI platform from 300 to over 30,000 generations per second within 18 months, becoming the third most-used generative AI application globally. They tackled unique engineering challenges around data volume, cost optimization, and connection management while maintaining performance. Their solution involved custom model architectures, efficient GPU caching strategies, and innovative prompt management tools, all while balancing performance, latency, and cost considerations at scale.

Scaling Agentic AI Systems for Real Estate Due Diligence: Managing Prompt Tax at Production Scale

Orbital

Orbital, a real estate technology company, developed an agentic AI system called Orbital Co-pilot to automate legal due diligence for property transactions. The system processes hundreds of pages of legal documents to extract key information traditionally done manually by lawyers. Over 18 months, they scaled from zero to processing 20 billion tokens monthly and achieved multiple seven figures in annual recurring revenue. The presentation focuses on their concept of "prompt tax" - the hidden costs and complexities of continuously upgrading AI models in production, including prompt migration, regression risks, and the operational challenges of shipping at the AI frontier.

Scaling AI Infrastructure: From Training to Inference at Meta

Meta

Meta shares their journey in scaling AI infrastructure to support massive LLM training and inference operations. The company faced challenges in scaling from 256 GPUs to over 100,000 GPUs in just two years, with plans to reach over a million GPUs by year-end. They developed solutions for distributed training, efficient inference, and infrastructure optimization, including new approaches to data center design, power management, and GPU resource utilization. Key innovations include the development of a virtual machine service for secure code execution, improvements in distributed inference, and novel approaches to reducing model hallucinations through RAG.

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

Scaling AI-Assisted Coding Infrastructure: From Auto-Complete to Global Deployment

Cursor

Cursor, an AI-assisted coding platform, scaled their infrastructure from handling basic code completion to processing 100 million model calls per day across a global deployment. They faced and overcame significant challenges in database management, model inference scaling, and indexing systems. The case study details their journey through major incidents, including a database crisis that led to a complete infrastructure refactor, and their innovative solutions for handling high-scale AI model inference across multiple providers while maintaining service reliability.

Scaling an MCP Server for Error Monitoring to 60 Million Monthly Requests

Sentry

Sentry, an error monitoring platform, built a Model Context Protocol (MCP) server to improve the workflow where developers would copy error details from Sentry's UI and paste them into AI coding assistants like Cursor. The MCP server provides direct integration with 10-15 tools, including retrieving issue details and triggering automated fix attempts through Sentry's AI agent. The implementation scaled from 30 million to 60 million requests per month, with over 5,000 organizations using it. The company learned critical lessons about treating MCP servers as production services, implementing comprehensive observability, managing context pollution, and taking responsibility for agent behavior through careful prompt engineering and tool description design.

Scaling and Operating Large Language Models at the Frontier

Anthropic

This case study examines Anthropic's journey in scaling and operating large language models, focusing on their transition from GPT-3 era training to current state-of-the-art systems like Claude. The company successfully tackled challenges in distributed computing, model safety, and operational reliability while growing 10x in revenue. Key innovations include their approach to constitutional AI, advanced evaluation frameworks, and sophisticated MLOps practices that enable running massive training operations with hundreds of team members.

Scaling Document Processing with LLMs and Human Review

Vendr / Extend

Vendr partnered with Extend to extract structured data from SaaS order forms and contracts using LLMs. They implemented a hybrid approach combining LLM processing with human review to achieve high accuracy in entity recognition and data extraction. The system successfully processed over 100,000 documents, using techniques such as document embeddings for similarity clustering, targeted human review, and robust entity mapping. This allowed Vendr to unlock valuable pricing insights for their customers while maintaining high data quality standards.

Scaling Email Content Extraction Using LLMs in Production

Yahoo

Yahoo Mail faced challenges with their existing ML-based email content extraction system, hitting a coverage ceiling of 80% for major senders while struggling with long-tail senders and slow time-to-market for model updates. They implemented a new solution using Google Cloud's Vertex AI and LLMs, achieving 94% coverage for standard domains and 99% for tail domains, with 51% increase in extraction richness and 16% reduction in tracking API errors. The implementation required careful consideration of hybrid infrastructure, cost management, and privacy compliance while processing billions of daily messages.

Scaling Financial Research and Analysis with Multi-Model LLM Architecture

Rogo

Rogo developed an enterprise-grade AI finance platform that leverages multiple OpenAI models to automate and enhance financial research and analysis for investment banks and private equity firms. Through a layered model architecture combining GPT-4 and other models, along with fine-tuning and integration with financial datasets, they created a system that saves analysts over 10 hours per week on tasks like meeting prep and market research, while serving over 5,000 bankers across major financial institutions.

Scaling Local News Coverage with AI-Powered Newsletter Generation

Patch

Patch transformed its local news coverage by implementing AI-powered newsletter generation, enabling them to expand from 1,100 to 30,000 communities while maintaining quality and trust. The system combines curated local data sources, weather information, event calendars, and social media content, processed through AI to create relevant, community-specific newsletters. This approach resulted in over 400,000 new subscribers and a 93.6% satisfaction rating, while keeping costs manageable and maintaining editorial standards.

Scaling Voice AI with GPU-Accelerated Infrastructure

ElevenLabs

ElevenLabs developed a high-performance voice AI platform for voice cloning and multilingual speech synthesis, leveraging Google Cloud's GKE and NVIDIA GPUs for scalable deployment. They implemented GPU optimization strategies including multi-instance GPUs and time-sharing to improve utilization and reduce costs, while successfully serving 600 hours of generated audio for every hour of real time across 29 languages.

Secure Authentication for AI Agents using Model Context Protocol

Arcade

Arcade identified a critical security gap in the Model Context Protocol (MCP) where AI agents needed secure access to third-party APIs like Gmail but lacked proper OAuth 2.0 authentication mechanisms. They developed two solutions: first introducing user interaction capabilities (PR #475), then extending MCP's elicitation framework with URL mode (PR #887) to enable secure OAuth flows while maintaining proper security boundaries between trusted servers and untrusted clients. This work addresses fundamental production deployment challenges for AI agents that need authenticated access to real-world systems.

Semantic Data Processing at Scale with AI-Powered Query Optimization

DocETL

Shreyaa Shankar presents DocETL, an open-source system for semantic data processing that addresses the challenges of running LLM-powered operators at scale over unstructured data. The system tackles two major problems: how to make semantic operator pipelines scalable and cost-effective through novel query optimization techniques, and how to make them steerable through specialized user interfaces. DocETL introduces rewrite directives that decompose complex tasks and data to improve accuracy and reduce costs, achieving up to 86% cost reduction while maintaining target accuracy. The companion tool Doc Wrangler provides an interactive interface for iteratively authoring and debugging these pipelines. Real-world applications include public defenders analyzing court transcripts for racial bias and medical analysts extracting information from doctor-patient conversations, demonstrating significant accuracy improvements (2x in some cases) compared to baseline approaches.

Semi-Supervised Fine-Tuning of Compact Vision-Language Models for Product Attribute Extraction

Flipkart

Flipkart faced the challenge of accurately extracting product attributes (like color, pattern, and material) from millions of product listings at scale. Manual labeling was expensive and error-prone, while using large Vision Language Model APIs was cost-prohibitive. The company developed a semi-supervised approach using compact VLMs (2-3 billion parameters) that combines Parameter-Efficient Fine-Tuning (PEFT) with Direct Preference Optimization (DPO) to leverage unlabeled data. The method starts with a small labeled dataset, generates multiple reasoning chains for unlabeled products using self-consistency, and then fine-tunes the model using DPO to favor preferred outputs. Results showed accuracy improvements from 75.1% to 85.7% on the Qwen2.5-VL-3B-Instruct model across twelve e-commerce verticals, demonstrating that compact models can effectively learn from unlabeled data to achieve production-grade performance.

Six Principles for Building Production AI Agents

App.build

App.build shared six empirical principles learned from building production AI agents that help overcome common challenges in agentic system development. The principles focus on investing in system prompts with clear instructions, splitting context to manage costs and attention, designing straightforward tools with limited parameters, implementing feedback loops with actor-critic patterns, using LLMs for error analysis, and recognizing that frustrating agent behavior often indicates system design issues rather than model limitations. These guidelines emerged from practical experience in developing software engineering agents and emphasize systematic approaches to building reliable, recoverable agents that fail gracefully.

Smart Business Analyst: Automating Competitor Analysis in Medical Device Procurement

Philips

A procurement team developed an advanced LLM-powered system called "Smart Business Analyst" to automate competitor analysis in the medical device industry. The system addresses the challenge of gathering and analyzing competitor data across multiple dimensions, including features, pricing, and supplier relationships. Unlike general-purpose LLMs like ChatGPT, this solution provides precise numerical comparisons and leverages multiple data sources to deliver accurate, industry-specific insights, significantly reducing the time required for competitive analysis from hours to seconds.

Source-Grounded LLM Assistant with Multi-Modal Output Capabilities

Google / NotebookLLM

Google's NotebookLM tackles the challenge of making large language models more focused and personalized by introducing source grounding - allowing users to upload their own documents to create a specialized AI assistant. The system combines Gemini 1.5 Pro with sophisticated audio generation to create human-like podcast-style conversations about user content, complete with natural speech patterns and disfluencies. The solution includes built-in safety features, privacy protections through transient context windows, and content watermarking, while enabling users to generate insights from personal documents without contributing to model training data.

State of Production Machine Learning and LLMOps in 2024

Zalando

A comprehensive overview of the current state and challenges of production machine learning and LLMOps, covering key areas including motivations, industry trends, technological developments, and organizational changes. The presentation highlights the evolution from model-centric to data-centric approaches, the importance of metadata management, and the growing focus on security and monitoring in ML systems.

Structured LLM Conversations for Language Learning Video Calls

Duolingo

Duolingo implemented an AI-powered video call feature called "Video Call with Lily" that enables language learners to practice speaking with an AI character. The system uses carefully structured prompts, conversational blueprints, and dynamic evaluations to ensure appropriate difficulty levels and natural interactions. The implementation includes memory management to maintain conversation context across sessions and separate processing steps to prevent LLM overload, resulting in a personalized and effective language learning experience.

Swarm-Coding with Multiple Background Agents for Large-Scale Code Maintenance

Faire

Faire implemented "swarm-coding" using GitHub Copilot's background agents to automate tedious engineering tasks like cleaning up expired feature flags and migrating test infrastructure. By coordinating multiple autonomous AI agents working in parallel, they enabled non-engineers to land simple code changes and freed up engineering teams to focus on innovation rather than maintenance work. Within the first month of deployment, 18% of the engineering team adopted the approach, merging over 500 Copilot pull requests with an average time savings of 39.6 minutes per PR and a 25% increase in overall PR volume among users. The company enhanced the background agents through custom instructions, MCP (Model Context Protocol) servers, and programmatic task assignment to create specialized agent profiles for common workflows.

Synthetic Consumer Survey Generation Using LLMs with Semantic Similarity Response Mapping

Colgate

PyMC Labs partnered with Colgate to address the limitations of traditional consumer surveys for product testing by developing a novel synthetic consumer methodology using large language models. The challenge was that standard approaches of asking LLMs to provide numerical ratings (1-5) resulted in biased, middle-of-the-road responses that didn't reflect real consumer behavior. The solution involved allowing LLMs to provide natural text responses which were then mapped to quantitative scales using embedding similarity to reference responses. This approach achieved 90% of the maximum achievable correlation with real survey data, accurately reproduced demographic effects including age and income patterns, eliminated positivity bias present in human surveys, and provided richer qualitative feedback while being faster and cheaper than traditional surveys.

Systematic Analysis of Prompt Templates in Production LLM Applications

Uber, Microsoft

The research analyzes real-world prompt templates from open-source LLM-powered applications to understand their structure, composition, and effectiveness. Through analysis of over 2,000 prompt templates from production applications like those from Uber and Microsoft, the study identifies key components, patterns, and best practices for template design. The findings reveal that well-structured templates with specific patterns can significantly improve LLMs' instruction-following abilities, potentially enabling weaker models to achieve performance comparable to more advanced ones.

Test-Driven Vibe Development: Integrating Quality Engineering with AI Code Generation

Asos

ASOS, a major e-commerce retailer, developed Test-Driven Vibe Development (TDVD), a novel methodology that combines test-first quality engineering practices with LLM-driven code generation to address the quality and reliability challenges of "vibe coding." The company applied this approach to build an internal stock discrepancy reporting system, using AI agents to generate both tests and code in a structured workflow that prioritizes acceptance test-driven development (ATDD), behavior-driven development (BDD), and test-driven development (TDD). With a team of effectively 2.5 people working part-time, they delivered a full-stack MVP (backend API, Azure Functions, React frontend) in 4 weeks—representing a 7-10x acceleration compared to traditional development estimates—while maintaining quality through continuous validation against predefined test requirements and catching hallucinations early in the development cycle.

Text-to-SQL System with Structured RAG and Comprehensive Evaluation

ICE / NYSE

ICE/NYSE developed a text-to-SQL application using structured RAG to enable business users to query financial data without needing SQL knowledge. The system leverages Databricks' Mosaic AI stack including Unity Catalog, Vector Search, Foundation Model APIs, and Model Serving. They implemented comprehensive evaluation methods using both syntactic and execution matching, achieving 77% syntactic accuracy and 96% execution match across approximately 50 queries. The system includes continuous improvement through feedback loops and few-shot learning from incorrect queries.

Thinking Machines' Tinker: Low-Level Fine-Tuning API for Production LLM Training

Thinking Machines

Thinking Machines, a new AI company founded by former OpenAI researcher John Schulman, has developed Tinker, a low-level fine-tuning API designed to enable sophisticated post-training of language models without requiring teams to manage GPU infrastructure or distributed systems complexity. The product aims to abstract away infrastructure concerns while providing low-level primitives for expressing nearly all post-training algorithms, allowing researchers and companies to build custom models without developing their own training infrastructure. The company plans to release their own models and expand Tinker's capabilities to include multimodal functionality and larger-scale training jobs, while making the platform more accessible to non-experts through higher-level tooling.

Usability Challenges in Commercial AI Agent Systems: A Study of Industry Aspirations vs. User Realities

Carnegie Mellon

This research study addresses the gap between how AI agents are marketed by the technology industry and how end-users actually experience them in practice. Researchers from Carnegie Mellon conducted a systematic review of 102 commercial AI agent products to understand industry positioning, identifying three core use case categories: orchestration (automating GUI tasks), creation (generating structured documents), and insight (providing analysis and recommendations). They then conducted a usability study with 31 participants attempting representative tasks using popular commercial agents (Operator and Manus), revealing five critical usability barriers: misalignment between agent capabilities and user mental models, premature trust assumptions, inflexible collaboration styles, overwhelming communication overhead, and lack of meta-cognitive abilities. While users generally succeeded at assigned tasks and were impressed with the technology, these barriers significantly impacted the user experience and highlighted the disconnect between marketed capabilities and practical usability.

Using LLMs to Scale Insurance Operations at a Small Company

Anzen

Anzen, a small insurance company with under 20 people, leveraged LLMs to compete with larger insurers by automating their underwriting process. They implemented a document classification system using BERT and AWS Textract for information extraction, achieving 95% accuracy in document classification. They also developed a compliance document review system using sentence embeddings and question-answering models to provide immediate feedback on legal documents like offer letters.

Video Content Summarization and Metadata Enrichment for Streaming Platform

Paramount+

Paramount+ partnered with Google Cloud Consulting to develop two key AI use cases: video summarization and metadata extraction for their streaming platform containing over 50,000 videos. The project used Gen AI jumpstarts to prototype solutions, implementing prompt chaining, embedding generation, and fine-tuning approaches. The system was designed to enhance content discoverability and personalization while reducing manual labor and third-party costs. The implementation included a three-component architecture handling transcription creation, content generation, and personalization integration.

Video Super-Resolution at Scale for Ads and Generative AI Content

Meta

Meta's Media Foundation team deployed AI-powered video super-resolution (VSR) models at massive scale to enhance video quality across their ecosystem, processing over 1 billion daily video uploads. The problem addressed was the prevalence of low-quality videos from poor camera quality, cross-platform uploads, and legacy content that degraded user experience. The solution involved deploying multiple VSR models—both CPU-based (using Intel's RVSR SDK) and GPU-based—to upscale and enhance video quality for ads and generative AI features like Meta Restyle. Through extensive subjective evaluation with thousands of human raters, Meta identified effective quality metrics (VMAF-UQ), determined which videos would benefit most from VSR, and successfully deployed the technology while managing GPU resource constraints and ensuring quality improvements aligned with user preferences.

Voice AI Agent Development and Production Challenges

Various (Canonical, Prosus, DeepMind)

Panel discussion with experts from various companies exploring the challenges and solutions in deploying voice AI agents in production. The discussion covers key aspects of voice AI development including real-time response handling, emotional intelligence, cultural adaptation, and user retention. Experts shared experiences from e-commerce, healthcare, and tech sectors, highlighting the importance of proper testing, prompt engineering, and understanding user interaction patterns for successful voice AI deployments.