LLMOps Tag: google_gcp

364 tools with this tag

Common industries

Tech (214) E-commerce (34) Finance (29) Media & Entertainment (21) Healthcare (18) Research & Academia (9) Consulting (7) Telecommunications (6)

2x Engineering Throughput Through AI-First Development Platform

Intercom

Intercom, a customer support platform company, successfully doubled their R&D throughput measured by pull requests per head over nine months by implementing a comprehensive AI-first development approach centered on Claude Code. The company faced the challenge of maintaining engineering velocity while simultaneously transforming their product to be AI-native after ChatGPT's release. Their solution involved treating internal AI adoption as a product, building a custom skills repository with hundreds of specialized tools, implementing sophisticated telemetry across all AI interactions, and establishing high-quality standards enforced through automated hooks and evaluations. The results included not only 2x PR throughput but also improved code quality as measured by third-party research, faster time-to-market for features, and a cultural shift toward treating all technical work as agent-first, with leadership openly targeting 10x improvements as the next milestone.

customer_support code_generation chatbot poc +30

A Practical Blueprint for Evaluating Conversational AI at Scale

Dropbox

Dropbox shares their comprehensive approach to building and evaluating Dropbox Dash, their conversational AI product. The company faced challenges with ad-hoc testing leading to unpredictable regressions where changes to any part of their LLM pipeline—intent classification, retrieval, ranking, prompt construction, or inference—could cause previously correct answers to fail. They developed a systematic evaluation-first methodology treating every experimental change like production code, requiring rigorous testing before merging. Their solution involved curating diverse datasets (both public and internal), defining actionable metrics using LLM-as-judge approaches that outperformed traditional metrics like BLEU and ROUGE, implementing the Braintrust evaluation platform, and automating evaluation throughout the development-to-production pipeline. This resulted in a robust system with layered gates catching regressions early, continuous live-traffic scoring for production monitoring, and a feedback loop for continuous improvement that significantly improved reliability and deployment safety.

question_answering document_processing chatbot summarization +28

Abstractive Conversation Summarization for Google Chat Spaces

Google

Google deployed an abstractive summarization system to automatically generate conversation summaries in Google Chat Spaces to address information overload from unread messages, particularly in hybrid work environments. The solution leveraged the Pegasus transformer model fine-tuned on a custom ForumSum dataset of forum conversations, then distilled into a hybrid transformer-encoder/RNN-decoder architecture for lower latency. The system surfaces summaries through cards when users enter Spaces with unread messages, with quality controls including heuristics for triggering, detection of low-quality summaries, and ephemeral caching of pre-generated summaries to reduce latency, ultimately delivering production value to premium Google Workspace business customers.

summarization customer_support chatbot fine_tuning +8

Agent Registry and Dynamic Prompt Management for AI Feature Development

Gitlab

Gitlab faced challenges with delivering prompt improvements for their AI-powered issue description generation feature, particularly for self-managed customers who don't update frequently. They developed an Agent Registry system within their AI Gateway that abstracts provider models, prompts, and parameters, allowing for rapid prompt updates and model switching without requiring monolith changes or new releases. This system enables faster iteration on AI features and seamless provider switching while maintaining a clean separation of concerns.

code_generation structured_output prompt_engineering agent_based +4

Agent-Based Workflow Automation in Spreadsheets for Non-Technical Users

Otto

Otto, founded by Suli Omar, addresses the challenge of making AI agents accessible to non-technical users by embedding agent workflows directly into spreadsheet interfaces. The company transforms unstructured data processing tasks into spreadsheet-based workflows where each cell acts as an autonomous agent capable of executing tasks, waiting for dependencies, and outputting structured results. By leveraging the familiar spreadsheet UX instead of traditional chatbot interfaces, Otto enables finance teams, accountants, and other business users to harness agent capabilities without requiring technical expertise. The solution involves sophisticated model selection across three tiers (workhorse, middle-tier, and heavy reasoning models) to optimize cost and performance, continuous evaluation through customer usage patterns, and iterative model testing to maintain service quality as new LLM capabilities emerge.

chatbot customer_support code_generation data_analysis +20

Agent-First AI Development Platform with Multi-Surface Orchestration

Google Deepmind

Google DeepMind launched Anti-gravity, an agent-first AI development platform designed to handle increasingly complex, long-running software development tasks powered by Gemini 3 Pro. The platform addresses the challenge of managing AI agents operating across multiple surfaces (editor, browser, and agent manager) by introducing "artifacts" - dynamic representations that help organize agent outputs and enable asynchronous feedback. The solution emerged from close collaboration between product and research teams at DeepMind, creating a feedback loop where internal dogfooding identified model gaps and drove improvements. Initial launch experienced capacity constraints due to high demand, but users who accessed the product reported significant workflow improvements from the multi-surface agent orchestration approach.

code_generation chatbot poc data_analysis +15

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support document_processing +89

Agentic RAG Implementation for Retail Personalization and Customer Support

MongoDB

MongoDB and Dataworkz partnered to implement an agentic RAG (Retrieval Augmented Generation) solution for retail and e-commerce applications. The solution combines MongoDB Atlas's vector search capabilities with Dataworkz's RAG builder to create a scalable system that integrates operational data with unstructured information. This enables personalized customer experiences through intelligent chatbots, dynamic product recommendations, and enhanced search functionality, while maintaining context-awareness and real-time data access.

customer_support unstructured_data realtime_application data_integration +13

Agentic Search and Context Engineering for Production LLM Systems

Elastic

This case study presents Elastic's approach to implementing agentic search systems for production LLM applications, focusing on context engineering challenges. The presentation addresses the limitations of fixed RAG pipelines and demonstrates how agentic search tools can dynamically retrieve and filter information from multiple context sources including databases, local file systems, and web sources. Through practical demonstrations using conference session data, the presenter shows how different search tool architectures—from simple semantic search to general-purpose query execution and shell-based tools—can be combined to create robust production systems. The solution emphasizes the importance of tool description design, error handling, agent skills for complex queries, and logging agent behavior to optimize the balance between specialized and general-purpose search tools.

question_answering code_generation chatbot rag +18

AI Agent Development and Evaluation Platform for Insurance Underwriting

Snorkel

Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.

healthcare fraud_detection customer_support document_processing +28

AI Agent for Real Estate Legal Document Analysis and Lease Reporting

Orbital

Orbital Witness developed Orbital Copilot, an AI agent specifically designed for real estate legal work, to address the time-intensive nature of legal due diligence and lease reporting. The solution evolved from classical machine learning models through LLM-based approaches to a sophisticated agentic architecture that combines planning, memory, and tool use capabilities. The system analyzes hundreds of pages across multiple legal documents, answers complex queries by following information trails across documents, and provides transparent reasoning with source citations. Deployed with prestigious law firms including BCLP, Clifford Chance, and others, Orbital Copilot demonstrated up to 70% time savings on lease reporting tasks, translating to significant cost reductions for complex property analyses that typically require 2-10+ hours of lawyer time.

document_processing question_answering summarization high_stakes_application +13

AI Agents and Intelligent Observability for DevOps Modernization

HRS Group / Netflix / Harness

This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.

customer_support code_generation summarization chatbot +34

AI Agents for Interpretability Research: Experimenter Agents in Production

Goodfire

Goodfire, an AI interpretability research company, deployed AI agents extensively for conducting experiments in their research workflow over several months. They distinguish between "developer agents" (for software development) and "experimenter agents" (for research and discovery), identifying key architectural differences needed for the latter. Their solution, code-named Scribe, leverages Jupyter notebooks with interactive, stateful access via MCP (Model Context Protocol), enabling agents to iteratively run experiments across domains like genomics, vision transformers, and diffusion models. Results showed agents successfully discovering features in genomics models, performing circuit analysis, and executing complex interpretability experiments, though validation, context engineering, and preventing reward hacking remain significant challenges that require human oversight and critic systems.

healthcare code_generation data_analysis poc +23

AI Agents for Life Sciences R&D: Accelerating Drug Discovery with Context-Rich Data

Benchling

Benchling, a 14-year-old platform for life sciences R&D data management, launched Benchling AI six months ago to bring intelligent agents to scientific workflows. The problem scientists face is the time-consuming nature of drug discovery, from initial experiments to FDA submissions, involving manual data entry, analysis, and report writing. Benchling AI addresses this through a chat-based agent interface that leverages their extensive historical data repository to help scientists find relevant experiments, design new tests, analyze results, and generate regulatory reports. The system uses multiple model families in parallel for critical tasks like data entry, employs custom-built harnesses tailored to scientific workflows rather than coding-focused architectures, and integrates agent skills that function like standard operating procedures. Early results suggest the potential to reduce drug discovery timelines by 2x through eliminating workflow bottlenecks and enabling more efficient experimental design.

healthcare data_analysis question_answering summarization +30

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation summarization +35

AI Managed Services and Agent Operations at Enterprise Scale

PriceWaterhouseCooper

PriceWaterhouseCooper (PWC) addresses the challenge of deploying and maintaining AI systems in production through their managed services practice focused on data analytics and AI. The organization has developed frameworks for deploying AI agents in enterprise environments, particularly in healthcare and back-office operations, using their Agent OS framework built on Python. Their approach emphasizes process standardization, human-in-the-loop validation, continuous model tuning, and comprehensive measurement through evaluations to ensure sustainable AI operations at scale. Results include successful deployments in healthcare pre-authorization processes and the establishment of specialized AI managed services teams comprising MLOps engineers and data scientists who continuously optimize production models.

healthcare fraud_detection poc high_stakes_application +23

AI Strategy and LLM Application Development in Swedish Public Sector

Swedish Tax Authority

The Swedish Tax Authority (Skatteverket) has been on a multi-decade digitalization journey, progressively incorporating AI and large language models into production systems to automate and enhance tax services. The organization has developed various NLP applications including text categorization, transcription, OCR pipelines, and question-answering systems using RAG architectures. They have tested both open-source models (Llama 3.1, Mixtral 7B, Cohere) and commercial solutions (GPT-3.5), finding that open-source models perform comparably for simpler queries while commercial models excel at complex questions. The Authority operates within a regulated environment requiring on-premise deployment for sensitive data, adopting Agile/SAFe methodologies and building reusable AI infrastructure components that can serve multiple business domains across different public sector silos.

regulatory_compliance document_processing question_answering classification +20

AI-Assisted Database Debugging Platform at Scale

Databricks

Databricks built an agentic AI platform to help engineers debug thousands of OLTP database instances across hundreds of regions on AWS, Azure, and GCP. The platform addresses the problem of fragmented tooling and dispersed expertise by unifying metrics, logs, and operational workflows into a single intelligent interface with a chat assistant. The solution reduced debugging time by up to 90%, enabled new engineers to start investigations in under 5 minutes, and has achieved company-wide adoption, fundamentally changing how engineers interact with their infrastructure.

data_analysis data_cleaning poc prompt_engineering +16

AI-Driven Collateral Allocation Optimization in Fintech

Mercado Libre

Mercado Pago, the fintech arm of Mercado Libre, faced the challenge of optimizing collateral allocation across billions of dollars in credit lines secured from major banks, requiring daily selection from millions of loans with complex contractual constraints. The company developed Enigma, a solution leveraging linear programming via Google OR-Tools combined with a custom grouping heuristic to handle scalability challenges. While the article primarily focuses on traditional optimization techniques rather than LLMs, it hints at future AI agent exploration for enhanced analytics, strategic constraint proposals, and automated translation of contractual conditions into mathematical constraints, representing a potential future evolution toward LLM integration in financial operations.

fraud_detection high_stakes_application regulatory_compliance data_analysis +10

AI-Driven Documentation Generation for dbt Data Models

Loblaw Digital

Loblaw Digital addressed the challenge of maintaining comprehensive documentation for over 3,000 dbt data models across their analytics engineering infrastructure. Manual documentation proved labor-intensive and often led to incomplete or outdated documentation that confused business users. The team implemented an LLM-based solution using the open-source dbt-documentor tool integrated with Google Cloud's Vertex AI platform, which automatically generates descriptions for models and their columns by ingesting dbt's manifest.json files without accessing actual data. This automation significantly improved documentation coverage and productivity while maintaining data security, enabling analysts to better understand model purposes and dependencies through the dbt documentation website.

document_processing data_analysis prompt_engineering open_source +1

AI-Powered Automated Issue Resolution Achieving State-of-the-Art Performance on SWE-bench

Trae

Trae developed an AI engineering system that achieved 70.6% accuracy on the SWE-bench Verified benchmark, setting a new state-of-the-art record for automated software issue resolution. The solution combines multiple large language models (Claude 3.7, Gemini 2.5 Pro, and OpenAI o4-mini) in a sophisticated multi-stage pipeline featuring generation, filtering, and voting mechanisms. The system uses specialized agents including a Coder agent for patch generation, a Tester agent for regression testing, and a Selector agent that employs both syntax-based voting and multi-selection voting to identify the best solution from multiple candidate patches.

code_generation poc multi_agent_systems agent_based +10

AI-Powered Background Coding Agents for Large-Scale Software Maintenance

Spotify

Spotify faced the challenge of scaling complex code migrations and maintenance tasks across thousands of repositories, where their existing Fleet Management system handled simple transformations well but required specialized expertise for complex changes. They integrated AI coding agents into their Fleet Management platform, allowing engineers to define fleet-wide code changes using natural language prompts instead of writing complex AST manipulation scripts. Since February 2025, this approach has generated over 1,500 merged pull requests handling complex tasks like language modernization, breaking API changes, and UI component migrations, achieving 60-90% time savings compared to manual implementation while expanding to ad hoc background coding tasks accessible via Slack and GitHub.

code_generation poc prompt_engineering multi_agent_systems +18

AI-Powered Call Center Agents for Healthcare Operations

HeyRevia

HeyRevia has developed an AI call center solution specifically for healthcare operations, where over 30% of operations run on phone calls. Their system uses AI agents to handle complex healthcare-related calls, including insurance verifications, prior authorizations, and claims processing. The solution incorporates real-time audio processing, context understanding, and sophisticated planning capabilities to achieve performance that reportedly exceeds human operators while maintaining compliance with healthcare regulations.

healthcare customer_support speech_recognition regulatory_compliance +11

AI-Powered Chatbot Automation with Hybrid NLU and LLM Approach

Scotiabank

Scotiabank developed a hybrid chatbot system combining traditional NLU with modern LLM capabilities to handle customer service inquiries. They created an innovative "AI for AI" approach using three ML models (nicknamed Luigi, Eva, and Peach) to automate the review and improvement of chatbot responses, resulting in 80% time savings in the review process. The system includes LLM-powered conversation summarization to help human agents quickly understand customer contexts, marking the bank's first production use of generative AI features.

customer_support chatbot classification summarization +7

AI-Powered Contact Center Copilot: From Research to Enterprise-Scale Production

Cresta / OpenAI

Cresta, founded in 2017 by Stanford PhD students with OpenAI research experience, developed an AI copilot system for contact center agents that provides real-time suggestions during customer conversations. The company tackled the challenge of transforming academic NLP and reinforcement learning research into production-grade enterprise software by building domain-specific models fine-tuned on customer conversation data. Starting with Intuit as their first customer through an unconventional internship arrangement, they demonstrated measurable ROI through A/B testing, showing improved conversion rates and agent productivity. The solution evolved from custom LSTM and transformer models to leveraging pre-trained foundation models like GPT-3/4 with fine-tuning, ultimately serving Fortune 500 customers across telecommunications, airlines, and banking with demonstrated value including a pilot generating $100 million in incremental revenue.

customer_support chatbot classification content_moderation +32

AI-Powered Content Understanding and Ad Targeting Platform

Dotdash

Dotdash Meredith, a major digital publisher, developed an AI-powered system called Decipher that understands user intent from content consumption to deliver more relevant advertising. Through a strategic partnership with OpenAI, they enhanced their content understanding capabilities and expanded their targeting platform across the premium web. The system outperforms traditional cookie-based targeting while maintaining user privacy, proving that high-quality content combined with AI can drive better business outcomes.

content_moderation data_analysis structured_output regulatory_compliance +17

AI-Powered Customer Interest Generation for Personalized E-commerce Recommendations

Wayfair

Wayfair developed a GenAI-powered system to generate nuanced, free-form customer interests that go beyond traditional behavioral models and fixed taxonomies. Using Google's Gemini LLM, the system processes customer search queries, product views, cart additions, and purchase history to infer deep insights about preferences, functional needs, and lifestyle values. These LLM-generated interests power personalized product carousels on the homepage and product detail pages, driving measurable engagement and revenue gains while enabling more transparent and adaptable personalization at scale across millions of customers.

customer_support classification summarization content_moderation +12

AI-Powered Customer Service and Call Center Transformation with Multi-Agent Systems

Fastweb / Vodafone

Fastweb / Vodafone, a major European telecommunications provider serving 9.5 million customers in Italy, transformed their customer service operations by building two AI agent systems to address the limitations of traditional customer support. They developed Super TOBi, a customer-facing agentic chatbot system, and Super Agent, an internal tool that empowers call center consultants with real-time diagnostics and guidance. Built on LangGraph and LangChain with Neo4j knowledge graphs and monitored through LangSmith, the solution achieved a 90% correctness rate, 82% resolution rate, 5.2/7 Customer Effort Score for Super TOBi, and over 86% One-Call Resolution rate for Super Agent, delivering faster response times and higher customer satisfaction while reducing agent workload.

customer_support chatbot question_answering classification +31

AI-Powered Data Copilot for Autonomous Analysis in IDEs

BlaBlaCar

BlaBlaCar developed an AI-powered Data Copilot to address the inefficient workflow between Software Engineers and Data Analysts, where engineers lacked data warehouse access and analysts were overwhelmed with repetitive queries. The solution embeds an LLM-powered assistant directly in VS Code that connects to BigQuery, provides contextual business logic from curated queries, generates SQL and Python code with unit tests, and enables engineers to perform their own analyses with data health checks as guardrails. The tool leverages a "zero-infrastructure" RAG approach using VS Code's native capabilities and GitHub Copilot, treating analyses as code artifacts in pull requests that analysts review, resulting in faster question resolution (from weeks to minutes) and freeing analysts to focus on high-value modeling work.

code_generation data_analysis data_cleaning question_answering +11

AI-Powered Food Image Generation System at Scale

Delivery Hero

Delivery Hero built a comprehensive AI-powered image generation system to address the problem that 86% of food products lacked images, which significantly impacted conversion rates. The solution involved implementing both text-to-image generation and image inpainting workflows using Stable Diffusion models, with extensive optimization for cost efficiency and quality assurance. The system successfully generated over 100,000 production images, achieved 6-8% conversion rate improvements, and reduced costs to under $0.003 per image through infrastructure optimization and model fine-tuning.

content_moderation multi_modality structured_output high_stakes_application +30

AI-Powered Home Loan Guardian for Mortgage Refinancing

Lendi

Lendi, an Australian FinTech company, developed Guardian, an agentic AI application to transform the home loan refinancing experience. The company identified that homeowners lacked visibility into their mortgage positions and faced cumbersome refinancing processes, while brokers spent excessive time on administrative tasks. Using Amazon Bedrock's foundation models, Lendi built a multi-agent system deployed on Amazon EKS that monitors loan competitiveness, tracks equity positions in real-time, and streamlines refinancing through conversational AI. The solution was developed in 16 weeks and has already settled millions in home loans with significantly reduced refinance cycle times, enabling customers to complete refinancing in as little as 10 minutes through the Rate Radar feature.

customer_support chatbot question_answering high_stakes_application +27

AI-Powered Image Generation for Customizable Grocery Products

Instacart

Instacart's FoodStorm Order Management System faced the challenge of providing high-quality product images for countless customizable grocery items like deli sandwiches, cakes, and prepared foods, where professional photography for every configuration was impractical and costly. The solution involved integrating generative AI image generation capabilities through Instacart's internal Pixel service (which provides access to Google Imagen and other models) directly into FoodStorm's user interface, allowing grocery retailers to create product images on-demand with customizable prompts. Through multiple design iterations, the system evolved from simple one-click generation to a sophisticated interface where users can fine-tune prompts, preview multiple variations, and inspect details for quality control, ultimately enabling retailers to efficiently produce images for ingredients, toppings, promotional banners, and category thumbnails across the Instacart platform.

content_moderation poc prompt_engineering human_in_the_loop +3

AI-Powered Natural Language Flight Search Implementation

Alaska Airlines

Alaska Airlines implemented a natural language destination search system powered by Google Cloud's Gemini LLM to transform their flight booking experience. The system moves beyond traditional flight search by allowing customers to describe their desired travel experience in natural language, considering multiple constraints and preferences simultaneously. The solution integrates Gemini with Alaska Airlines' existing flight data and customer information, ensuring recommendations are grounded in actual available flights and pricing.

api_gateway chatbot compliance customer_support +17

AI-Powered Personal Health Coach Using Gemini Models

Fitbit

Fitbit developed an AI-powered personal health coach to address the fragmented and generic nature of traditional health and fitness guidance. Using Gemini models within a multi-agent framework, the system provides proactive, personalized, and adaptive coaching grounded in behavioral science and individual health metrics such as sleep and activity data. The solution employs a conversational agent for orchestration, a data science agent for numerical reasoning on physiological time series, and domain expert agents for specialized guidance. The system underwent extensive validation through the SHARP evaluation framework, involving over 1 million human annotations and 100k hours of expert evaluation across multiple health disciplines. The health coach entered public preview for eligible US-based Fitbit Premium users, providing personalized insights, goal setting, and adaptive plans to build sustainable health habits.

healthcare chatbot data_analysis realtime_application +10

AI-Powered Personalized Content Recommendations for Sports and Entertainment Venue

Golden State Warriors

The Golden State Warriors implemented a recommendation engine powered by Google Cloud's Vertex AI to personalize content delivery for their fans across multiple platforms. The system integrates event data, news content, game highlights, retail inventory, and user analytics to provide tailored recommendations for both sports events and entertainment content at Chase Center. The solution enables personalized experiences for 18,000+ venue seats while operating with limited technical resources.

api_gateway content_moderation data_analysis data_integration +11

AI-Powered Similar Issues Detection for Project Management

Linear

Linear developed a Similar Issues matching feature to address the persistent challenge of duplicate issues and backlog management in large team workflows. The solution uses large language models to generate vector embeddings that capture the semantic meaning of issue descriptions, enabling accurate detection of related or duplicate issues across their project management platform. The feature integrates at multiple touchpoints—during issue creation, in the Triage inbox, and within support integrations like Intercom—allowing teams to identify duplicates before they enter the system. The implementation uses PostgreSQL with pgvector on Google Cloud Platform for vector storage and search, with partitioning strategies to handle tens of millions of issues at scale.

customer_support classification unstructured_data embeddings +11

Architecture Patterns for Production AI Systems: Lessons from Building and Failing with Generative AI Products

Outropy

Phil Calçado shares a post-mortem analysis of Outropy, a failed AI productivity startup that served thousands of users, revealing why most AI products struggle in production. Despite having superior technology compared to competitors like Salesforce's Slack AI, Outropy failed commercially but provided valuable insights into building production AI systems. Calçado argues that successful AI products require treating agents as objects and workflows as data pipelines, applying traditional software engineering principles rather than falling into "Twitter-driven development" or purely data science approaches.

customer_support document_processing chatbot structured_output +31

Auto-generated Document Summaries Using Abstractive Summarization

Google

Google Docs implemented automatic document summary generation to help users manage the volume of documents they receive daily. The challenge was to create concise, high-quality summaries that capture document essence while maintaining writer control over the final output. Google developed a solution based on Pegasus, a Transformer-based abstractive summarization model with custom pre-training, combined with careful data curation focusing on quality over quantity, knowledge distillation to optimize serving efficiency (distilling to a Transformer encoder + RNN decoder hybrid), and TPU-based serving infrastructure. The feature was launched for Google Workspace business customers, providing 1-2 sentence suggestions that writers can accept, edit, or ignore, helping both document creators and readers navigate content more efficiently.

document_processing summarization fine_tuning knowledge_distillation +5

Automated Inventory Counting with Multimodal LLMs in Grocery Fulfillment

Picnic

Picnic, an online grocery delivery company, implemented a multimodal LLM-based computer vision system to automate inventory counting in their automated warehouse. The manual stock counting process was time-consuming at scale, and traditional approaches like weighing scales proved unreliable due to measurement variance. The solution involved deploying camera setups to capture high-quality images of grocery totes, using Google Gemini's multimodal models with carefully crafted prompts and supply chain reference images to count products. Through fine-tuning, they achieved performance comparable to expensive pro-tier models using cost-effective flash models, deployed via a Fast API service with LiteLLM as a proxy layer for model interchangeability, and implemented continuous validation through selective manual checks.

fraud_detection classification poc multi_modality +11

Automated LLM Evaluation and Quality Monitoring in Customer Support Analytics

Echo AI

Echo AI, leveraging Log10's platform, developed a system for analyzing customer support interactions at scale using LLMs. They faced the challenge of maintaining accuracy and trust while processing high volumes of customer conversations. The solution combined Echo AI's conversation analysis capabilities with Log10's automated feedback and evaluation system, resulting in a 20-point F1 score improvement in accuracy and the ability to automatically evaluate LLM outputs across various customer-specific use cases.

customer_support summarization classification high_stakes_application +20

Automated Product Classification and Attribute Extraction Using Vision LLMs

Shopify

Shopify tackled the challenge of automatically understanding and categorizing millions of products across their platform by implementing a multi-step Vision LLM solution. The system extracts structured product information including categories and attributes from product images and descriptions, enabling better search, tax calculation, and recommendations. Through careful fine-tuning, evaluation, and cost optimization, they scaled the solution to handle tens of millions of predictions daily while maintaining high accuracy and managing hallucinations.

classification structured_output multi_modality fine_tuning +15

Automating Community Conference Operations with AI Coding Agents

PyCon

A volunteer-run conference organization (PyData/PyConDE) with events serving up to 1,500 attendees faced significant operational overhead in managing tickets, marketing, video production, and community engagement. Over a three-month period, the team experimented with various AI coding agents (Claude, Gemini, Qwen Coder Plus, Codex) to automate tasks including LinkedIn scraping for social media content, automated video cutting using computer vision, ticket management integration, and multi-step workflow automation. The results were mixed: while AI agents proved valuable for well-documented API integration, boilerplate code generation, and specific automation tasks like screenshot capture and video processing, they struggled with multi-step procedural workflows, data normalization, and maintaining code quality without close human oversight. The team concluded that AI agents work best when kept on a "short leash" with narrow use cases, frequent commits, and human validation, delivering time savings for generalist tasks but requiring careful expectation management and not delivering the "10x productivity" improvements often claimed.

content_moderation data_analysis data_cleaning summarization +19

Automating Search Engine Marketing Ad Generation with Multi-Stage LLM Pipeline

Thumbtack

Thumbtack faced significant challenges with their manual Search Engine Marketing (SEM) ad creation process, where 80% of ad assets were generic templates across all ad groups, leading to suboptimal performance and requiring extensive manual effort. They developed a multi-stage LLM-powered solution that automates the generation, review, and grouping of Google Responsive Search Ads (RSAs) headlines and descriptions, incorporating specific keywords and value propositions for each ad group. The implementation was rolled out in four phases, with initial proof-of-concept showing 20% increase in traffic and 10% increase in conversions, and the final phase demonstrating statistically significant improvements in click-through rates and conversion value using Google's Drafts and Experiments feature for robust measurement.

customer_support content_moderation classification structured_output +10

Automating Supplier Ticket Management with LLM Agents

Wayfair

Wayfair developed Wilma, an LLM-based ticket automation system, to automate the manual triage of supplier support tickets in their SupportHub JIRA-based system. The solution uses LangGraph to orchestrate LLM calls and tool interactions for intent classification, language detection, and supplier ID lookup through a ReAct agent with BigQuery access. The system achieved better-than-human performance with 93% accuracy on question type identification (vs. 75% human accuracy), 98% on language detection, and 88% on supplier ID identification, while reducing processing time and allowing associates to focus on higher-value work.

customer_support classification chatbot agent_based +13

Autonomous Bug Investigation and Resolution Agent

Basis

Basis developed Clueso, an autonomous debugging agent that resolves 78% of bugs on first pass to handle their scaling incident response needs. The agent operates in a Modal VM environment using the Claude Agent SDK, accessing their monorepo, logging services, and internal documentation to investigate issues. Clueso pulls error logs, writes database queries, and produces verifiable post-event summaries with evidence timelines, completing routine investigations in under five minutes while complex cases can run over an hour. By integrating Clueso into Slack workflows and triggering it automatically in customer support channels, Basis reduced response times on complex questions by approximately 50% and freed engineers to focus on higher-leverage work.

code_interpretation data_analysis structured_output high_stakes_application +13

Autonomous Coding Agent Evolution: From Short-Burst to Extended Runtime Operations

Replit

Replit evolved their AI coding agent from V1 (running autonomously for only a couple of minutes) to V2 (running for 10-15 minutes of productive work) through significant rearchitecting and leveraging new frontier models. The company focuses on enabling non-technical users to build complete applications without writing code, emphasizing performance and cost optimization over latency while maintaining comprehensive observability through tools like Langsmith to manage the complexity of production AI agents at scale.

code_generation poc prompt_engineering multi_agent_systems +11

Autonomous Multi-Phase Software Architecture Execution with LLM Agents

Cara

Cara, a healthcare software platform company, used Claude Code (Opus 4.6) to autonomously execute 66 software tickets across 2 repositories, write 536 tests, and deliver a composable 5-layer architecture for their healthcare app platform in under 4 hours. The problem was a flat list of 25 scaffolds with no composition model, making it impossible to automatically assemble applications from component parts. The solution involved implementing a structured execution framework called RePPITS (Research, Propose, Plan, Implement, Test, Secure) with persistent memory, parallel subagents, phase gates, and comprehensive security audits. This required approximately 20-25 hours of preparation including codebase structuring, instruction file refinement, and epic planning. The autonomous execution produced approximately 20,000 lines of code organized into 53 scaffolds across 5 architectural layers (Foundation, Runtime, Capability, Adapter, Specialty), with 2 critical bugs and 10 other issues caught and fixed through automated security audits, resulting in zero deferred issues and only one minor production incident that was resolved in under 5 minutes.

healthcare code_generation regulatory_compliance high_stakes_application +25

Autonomous Security Agents for Continuous Vulnerability Detection and Remediation

Cursor

Cursor faced a challenge where their PR velocity increased 5x over nine months, making traditional static analysis and code ownership insufficient for security at scale. They implemented Cursor Automations to build a fleet of autonomous security agents that continuously identify and repair vulnerabilities in their codebase. The solution includes four main automation templates: Agentic Security Review (which has run on thousands of PRs and prevented hundreds of issues in two months), Vuln Hunter (for scanning existing code), Anybump (which automates dependency patching), and Invariant Sentinel (for daily compliance monitoring). These agents operate through a custom security MCP tool deployed as a serverless Lambda function, providing persistent data storage, deduplication of LLM-generated findings, and consistent output formatting.

code_generation regulatory_compliance prompt_engineering agent_based +14

Autonomous SRE Agent for Cloud Infrastructure Monitoring Using FastMCP

FuzzyLabs

FuzzyLabs developed an autonomous Site Reliability Engineering (SRE) agent using Anthropic's Model Context Protocol (MCP) with FastMCP to automate the diagnosis of production incidents in cloud-native applications. The agent integrates with Kubernetes, GitHub, and Slack to automatically detect issues, analyze logs, identify root causes in source code, and post diagnostic summaries to development teams. While the proof-of-concept successfully demonstrated end-to-end incident response automation using a custom MCP client with optimizations like tool caching and filtering, the project raises important questions about effectiveness measurement, security boundaries, and cost optimization that require further research.

poc realtime_application multi_agent_systems agent_based +27

Background Coding Agents for Large-Scale Software Maintenance and Migrations

Spotify

Spotify faced challenges in scaling complex code transformations across thousands of repositories despite having a successful Fleet Management system that automated simple, repetitive maintenance tasks. The company integrated AI coding agents into their existing Fleet Management infrastructure, allowing engineers to define fleet-wide code changes using natural language prompts instead of writing complex transformation scripts. Since February 2025, this approach has generated over 1,500 merged pull requests handling complex tasks like language modernization, breaking-change upgrades, and UI component migrations, achieving 60-90% time savings compared to manual approaches while expanding the system's use to ad-hoc development tasks through IDE and chat integrations.

code_generation poc prompt_engineering multi_agent_systems +13

Benchmarking AI Agents for Software Bug Detection and Maintenance Tasks

Bismuth

Bismuth, a startup focused on software agents, developed SM-100, a comprehensive benchmark to evaluate AI agents' capabilities in software maintenance tasks, particularly bug detection and fixing. The benchmark revealed significant limitations in existing popular agents, with most achieving only 7% accuracy in finding complex bugs and exhibiting high false positive rates (90%+). While agents perform well on feature development benchmarks like SWE-bench, they struggle with real-world maintenance tasks that require deep system understanding, cross-file reasoning, and holistic code evaluation. Bismuth's own agent achieved better performance (10 out of 100 bugs found vs. 7 for the next best), demonstrating that targeted improvements in model architecture, prompting strategies, and navigation techniques can enhance bug detection capabilities in production software maintenance scenarios.

code_generation poc prompt_engineering multi_agent_systems +13

Benchmarking and Optimizing AI Agents for Accounting Automation

Ramp

Ramp developed Stack, an AI-native suite of tools for automating accounting book-closing workflows, with an AI agent at its core that can handle complex tasks through chat or scheduled automation. To accelerate agent development and avoid overfitting to individual design partners, Ramp created a comprehensive accounting benchmark with 237 tasks across 8 synthetic business worlds covering diverse accounting complexities. Using this benchmark, they optimized their agent through skill ablation (removing unhelpful capabilities), context reduction (shrinking prompts by 64%), and memory system refinement, achieving a 4% improvement in task accuracy over frontier models like GPT 5.5 and Anthropic Opus 4.7, while maintaining competitive latency and delivering the highest Pass@1 rate on real accounting tasks.

poc agent_based prompt_engineering memory +6

Blueprint for Scalable and Reliable Enterprise LLM Systems

Various

A panel discussion featuring leaders from Bank of America, NVIDIA, Microsoft, and IBM discussing best practices for deploying and scaling LLM systems in enterprise environments. The discussion covers key aspects of LLMOps including business alignment, production deployment, data management, monitoring, and responsible AI considerations. The panelists share insights on the evolution from traditional ML deployments to LLM systems, highlighting unique challenges around testing, governance, and the increasing importance of retrieval and agent-based architectures.

cicd compliance continuous_deployment continuous_integration +22

BM25 vs Vector Search for Large-Scale Code Repository Search

Github

Github faces the challenge of providing efficient search across 100+ billion documents while maintaining low latency and supporting diverse search use cases. They chose BM25 over vector search due to its computational efficiency, zero-shot capabilities, and ability to handle diverse query types. The solution involves careful optimization of search infrastructure, including strategic data routing and field-specific indexing approaches, resulting in a system that effectively serves Github's massive scale while keeping costs manageable.

code_generation code_interpretation unstructured_data realtime_application +14

Building a Collaborative Multi-Agent AI Ecosystem for Enterprise Knowledge Access

DoorDash

DoorDash developed an internal agentic AI platform to address the challenge of fragmented knowledge spread across experimentation platforms, metrics hubs, dashboards, wikis, and team communications. The solution evolved from deterministic workflows through single agents to hierarchical deep agents and exploratory agent swarms, built on foundational capabilities including hybrid vector search with RRF-based re-ranking, schema-aware SQL generation with pre-cached examples, multi-stage zero-data query validation, and LLM-as-judge evaluation frameworks. The platform integrates with Slack and Cursor to meet users in their existing workflows, enabling business teams and developers to access complex data and insights without context-switching, democratizing data access across the organization while maintaining rigorous guardrails and provenance tracking.

data_analysis question_answering chatbot code_generation +30

Building a Comprehensive LLM Platform for Healthcare Applications

IncludedHealth

IncludedHealth built Wordsmith, a comprehensive platform for GenAI applications in healthcare, starting in early 2023. The platform includes a proxy service for multi-provider LLM access, model serving capabilities, training and evaluation libraries, and prompt engineering tools. This enabled multiple production applications including automated documentation, coverage checking, and clinical documentation, while maintaining security and compliance in a regulated healthcare environment.

healthcare document_processing speech_recognition question_answering +25

Building a Customer Support AI Assistant: From PoC to Production

Elastic

Elastic's Field Engineering team developed a generative AI solution to improve customer support operations by automating case summaries and drafting initial replies. Starting with a proof of concept using Google Cloud's Vertex AI, they achieved a 15.67% positive response rate, leading them to identify the need for better input refinement and knowledge integration. This resulted in a decision to develop a unified chat interface with RAG architecture leveraging Elasticsearch for improved accuracy and response relevance.

customer_support chatbot high_stakes_application rag +10

Building a Global Product Catalogue with Multimodal LLMs at Scale

Shopify

Shopify addressed the challenge of fragmented product data across millions of merchants by building a Global Catalogue using multimodal LLMs to standardize and enrich billions of product listings. The system processes over 10 million product updates daily through a four-layer architecture involving product data foundation, understanding, matching, and reconciliation. By fine-tuning open-source vision language models and implementing selective field extraction, they achieve 40 million LLM inferences daily with 500ms median latency while reducing GPU usage by 40%. The solution enables improved search, recommendations, and conversational commerce experiences across Shopify's ecosystem.

classification data_analysis data_cleaning data_integration +26

Building a Hyper-Personalized Food Ordering Agent for E-commerce at Scale

iFood

iFood, Brazil's largest food delivery platform with 160 million monthly orders and 55 million users, built ISO, an AI agent designed to address the paradox of choice users face when ordering food. The agent uses hyper-personalization based on user behavior, interprets complex natural language intents, and autonomously takes actions like applying coupons, managing carts, and processing payments. Deployed on both the iFood app and WhatsApp, ISO handles millions of users while maintaining sub-10 second P95 latency through aggressive prompt optimization, context window management, and intelligent tool routing. The team achieved this by moving from a 30-second to a 10-second P95 latency through techniques including asynchronous processing, English-only prompts to avoid tokenization penalties, and deflating bloated system prompts by improving tool naming conventions.

chatbot question_answering classification summarization +23

Building a Knowledge as a Service Platform with LLMs and Developer Community Data

Stack Overflow

Stack Overflow addresses the challenges of LLM brain drain, answer quality, and trust by transforming their extensive developer Q&A platform into a Knowledge as a Service offering. They've developed API partnerships with major AI companies like Google, OpenAI, and GitHub, integrating their 40 billion tokens of curated technical content to improve LLM accuracy by up to 20%. Their approach combines AI capabilities with human expertise while maintaining social responsibility and proper attribution.

amazon_aws api_gateway chatbot code_generation +17

Building a Multi-Agent Healthcare Analytics Assistant with LLM-Powered Natural Language Queries

Komodo Health

Komodo Health, a company with a large database of anonymized American patient medical events, developed an AI assistant over two years to answer complex healthcare analytics queries through natural language. The system evolved from a simple chaining architecture with fine-tuned models to a sophisticated multi-agent system using a supervisor pattern, where an intelligent agent-based supervisor routes queries to either deterministic workflows or sub-agents as needed. The architecture prioritizes trust by ensuring raw database outputs are presented directly to users rather than LLM-generated content, with LLMs primarily handling natural language to structured query conversion and explanations. The production system balances autonomous AI capabilities with control, avoiding the cost and latency issues of pure agentic approaches while maintaining flexibility for unexpected user queries.

healthcare data_analysis chatbot structured_output +22

Building a Multi-Agent LLM Platform for Customer Service Automation

Deutsche Telekom

Deutsche Telekom developed a comprehensive multi-agent LLM platform to automate customer service across multiple European countries and channels. They built their own agent computing platform called LMOS to manage agent lifecycles, routing, and deployment, moving away from traditional chatbot approaches. The platform successfully handled over 1 million customer queries with an 89% acceptable answer rate and showed 38% better performance compared to vendor solutions in A/B testing.

customer_support chatbot structured_output regulatory_compliance +19

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis content_moderation +47

Building a Multi-Model AI Platform and Agent Marketplace

Quora

Quora built Poe as a unified platform providing consumer access to multiple large language models and AI agents through a single interface and subscription. Starting with experiments using GPT-3 for answer generation on Quora, the company recognized the paradigm shift toward chat-based AI interactions and developed Poe to serve as a "web browser for AI" - enabling users to access diverse models, create custom agents through prompting or server integrations, and monetize AI applications. The platform has achieved significant scale with creators earning millions annually while supporting various modalities including text, image, and voice models.

chatbot question_answering code_generation content_moderation +18

Building a Multi-Model LLM API Marketplace and Infrastructure Platform

OpenRouter

OpenRouter was founded in early 2023 to address the fragmented landscape of large language models by creating a unified API marketplace that aggregates over 400 models from 60+ providers. The company identified that the LLM inference market would not be winner-take-all, and built infrastructure to normalize different model APIs, provide intelligent routing, caching, and uptime guarantees. Their platform enables developers to switch between models with near-zero switching costs while providing better prices, uptime, and choice compared to using individual model providers directly.

content_moderation code_generation chatbot multi_modality +27

Building a Multi-Model LLM Marketplace and Routing Platform

OpenRouter

OpenRouter was founded in 2023 to address the challenge of choosing between rapidly proliferating language models by creating a unified API marketplace that aggregates over 400 models from 60+ providers. The platform solves the problem of model selection, provider heterogeneity, and high switching costs by providing normalized access, intelligent routing, caching, and real-time performance monitoring. Results include 10-100% month-over-month growth, sub-30ms latency, improved uptime through provider aggregation, and evidence that the AI inference market is becoming multi-model rather than winner-take-all.

poc content_moderation code_generation multi_modality +25

Building a Multi-Provider GenAI Gateway for Enterprise-Scale LLM Access

Grab

Grab developed an AI Gateway to provide centralized, secure access to multiple GenAI providers (including OpenAI, Azure, AWS Bedrock, and Google VertexAI) for their internal developers. The gateway handles authentication, cost management, auditing, and rate limiting while providing a unified API interface. Since its launch in 2023, it has enabled over 300 unique use cases across the organization, from real-time audio analysis to content moderation, while maintaining security and cost efficiency through centralized management.

content_moderation translation speech_recognition question_answering +24

Building a Next-Generation AI-Powered Code Editor

Cursor

Cursor, founded by MIT graduates, developed an AI-powered code editor that goes beyond simple code completion to reimagine how developers interact with AI while coding. By focusing on innovative features like instructed edits and codebase indexing, along with developing custom models for specific tasks, they achieved rapid growth to $100M in revenue. Their success demonstrates how combining frontier LLMs with custom-trained models and careful UX design can transform developer productivity.

code_generation code_interpretation fine_tuning prompt_engineering +10

Building a Production RAG-based Customer Support Assistant with Elasticsearch

Elastic

Elastic's Field Engineering team developed a customer support chatbot using RAG instead of fine-tuning, leveraging Elasticsearch for document storage and retrieval. They created a knowledge library of over 300,000 documents from technical support articles, product documentation, and blogs, enriched with AI-generated summaries and embeddings using ELSER. The system uses hybrid search combining semantic and BM25 approaches to provide relevant context to the LLM, resulting in more accurate and trustworthy responses.

customer_support chatbot document_processing question_answering +17

Building a Production RAG-Based Slackbot for Developer Support

Vespa

Vespa developed an intelligent Slackbot to handle increasing support queries in their community Slack channel. The solution combines RAG (Retrieval-Augmented Generation) with Vespa's search capabilities and OpenAI, leveraging both past conversations and documentation. The bot features user consent management, feedback mechanisms, and automated user anonymization, while continuously learning from new interactions to improve response quality.

chatbot question_answering rag embeddings +9

Building a Production-Ready AI Phone Call Assistant with Multi-Modal Processing

RealChar

RealChar is developing an AI assistant that can handle customer service phone calls on behalf of users, addressing the frustration of long wait times and tedious interactions. The system uses a complex architecture combining traditional ML and generative AI, running multiple models in parallel through an event bus system, with fallback mechanisms for reliability. The solution draws inspiration from self-driving car systems, implementing real-time processing of multiple input streams and maintaining millisecond-level observability.

customer_support error_handling fallback_strategies google_gcp +10

Building a Scalable LLM Gateway for E-commerce Recommendations

Mercado Libre

Mercado Libre developed a centralized LLM gateway to handle large-scale generative AI deployments across their organization. The gateway manages multiple LLM providers, handles security, monitoring, and billing, while supporting 50,000+ employees. A key implementation was a product recommendation system that uses LLMs to generate personalized recommendations based on user interactions, supporting multiple languages across Latin America.

anthropic api_gateway compliance cost_optimization +19

Building a Search Engine for AI Agents: Infrastructure, Product Development, and Production Deployment

Exa.ai

Exa.ai has built the first search engine specifically designed for AI agents rather than human users, addressing the fundamental problem that existing search engines like Google are optimized for consumer clicks and keyword-based queries rather than semantic understanding and agent workflows. The company trained its own models, built its own index, and invested heavily in compute infrastructure (including purchasing their own GPU cluster) to enable meaning-based search that returns raw, primary data sources rather than listicles or summaries. Their solution includes both an API for developers building AI applications and an agentic search tool called Websites that can find and enrich complex, multi-criteria queries. The results include serving hundreds of millions of queries across use cases like sales intelligence, recruiting, market research, and research paper discovery, with 95% inbound growth and expanding from 7 to 28+ employees within a year.

question_answering data_analysis chatbot document_processing +43

Building a Systematic SNAP Benefits LLM Evaluation Framework

Propel

Propel is developing a comprehensive evaluation framework for testing how well different LLMs handle SNAP (food stamps) benefit-related queries. The project aims to assess model accuracy, safety, and appropriateness in handling complex policy questions while balancing strict accuracy with practical user needs. They've built a testing infrastructure including a Slackbot called Hydra for comparing multiple LLM outputs, and plan to release their evaluation framework publicly to help improve AI models' performance on SNAP-related tasks.

regulatory_compliance question_answering high_stakes_application prompt_engineering +10

Building a Unified GenAI Platform for Hundreds of Production Use Cases

Karrot

Karrot, a local community marketplace platform, faced challenges scaling from initial LLM experimentation to hundreds of GenAI use cases across their organization. The main problems included fragmented account management with proliferating API keys, experimentation bottlenecks requiring engineering support for every prompt iteration, and inconsistent reliability patterns. They solved this by building three integrated platforms: LLM Router (a unified API gateway for centralized access and cost management), Prompt Studio (a no-code platform for prompt development, evaluation, and deployment), and KarrotChat (an internal agent platform for discovering and using AI capabilities). The result was democratized AI development where non-technical teams could independently build and deploy GenAI features, company-wide knowledge sharing through reusable prompts and agents, and reliable production services handling hundreds of millions of requests with sophisticated fallback mechanisms.

chatbot data_analysis content_moderation classification +30

Building Agentic Spreadsheet Automation from Process Mining to Production

Ramp

Ramp developed an agentic spreadsheet editor called Ramp Sheets to automate complex finance workflows, starting from an internal process mining project that converted Loom videos of finance tasks into automation pipelines. The team evolved from black-box Python code generation to transparent spreadsheet-native operations using around 10 Excel-specific tools, leveraging Anthropic's Claude models which proved particularly effective at decomposing spreadsheet tasks. The system runs in Modal sandboxes with an agent SDK managing tool calls for reading and writing cell ranges, achieving typical execution times of 7-10 minutes per task. Beyond the core product, Ramp implemented a self-monitoring loop using their internal coding agent Inspect to automatically create DataDog monitors, and conducted research experiments in recursive language models with KV cache communication and steering vectors for model behavior modification.

document_processing data_analysis high_stakes_application structured_output +25

Building Agentic Workflows with Temporal for Data Infrastructure at Scale

Instacart

Instacart runs 56 million workflows per day on self-hosted Temporal clusters to support mission-critical operations, and has evolved this infrastructure to support agentic AI workflows. The company faced the challenge of building reliable, durable LLM-based applications at scale while managing the non-deterministic nature of AI models. By treating LLM calls as Temporal activities and agent state as workflows, Instacart developed three core design patterns: human-in-the-loop workflows for config generation and metadata enrichment, ensemble evaluation systems for LLM quality assurance, and batch inference pipelines for large-scale data processing. These patterns leverage Temporal's primitives including signals, child workflows, and retry policies to provide the durability and reliability needed for production AI systems. The approach has enabled use cases ranging from automatic table description generation for thousands of database objects to real-time evaluation of internal chatbot conversations, all while maintaining full auditability and compliance.

data_analysis data_cleaning chatbot code_generation +34

Building AI-Assisted Development at Scale with Cursor

Cursor

This case study explores how Cursor's solutions team has observed enterprise companies successfully deploying AI-assisted coding in production environments. The problem addressed is helping developers leverage LLMs effectively for coding tasks while avoiding common pitfalls like context window bloat, over-reliance on AI, and hallucinations. The solution involves teaching developers to break down problems into appropriately-sized tasks, maintain clean context windows, use semantic search for brownfield codebases, and build deterministic harnesses around non-deterministic LLM outputs. Results include significant productivity gains when developers learn proper prompt engineering, context management, and maintain responsibility for AI-generated code, with specific improvements like bench scores jumping from 45% to 65% through harness optimization.

code_generation chatbot prompt_engineering semantic_search +9

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support summarization +39

Building Alfred: Production-Ready Agentic Orchestration Layer for E-commerce

Loblaws

Loblaws Digital, the technology arm of one of Canada's largest retail companies, developed Alfred—a production-ready orchestration layer for running agentic AI workflows across their e-commerce, pharmacy, and loyalty platforms. The system addresses the challenge of moving agent prototypes into production at enterprise scale by providing a reusable template-based architecture built on LangGraph, FastAPI, and Google Cloud Platform components. Alfred enables teams across the organization to quickly deploy conversational commerce applications and agentic workflows (such as recipe-based shopping) while handling critical enterprise requirements including security, privacy, PII masking, observability, and integration with 50+ platform APIs through their Model Context Protocol (MCP) ecosystem.

customer_support chatbot healthcare regulatory_compliance +30

Building an Agentic DevOps Copilot for Infrastructure Automation

Qovery

Qovery developed an agentic DevOps copilot to automate infrastructure tasks and eliminate repetitive DevOps work. The solution evolved through four phases: from basic intent-to-tool mapping, to a dynamic agentic system that plans tool sequences, then adding resilience and recovery mechanisms, and finally incorporating conversation memory. The copilot now handles complex multi-step workflows like deployments, infrastructure optimization, and configuration management, currently using Claude Sonnet 3.7 with plans for self-hosted models and improved performance.

code_generation chatbot customer_support question_answering +20

Building an Agentic Enterprise with AI Agents in Production

Salesforce

Salesforce transformed itself into what it calls an "agentic enterprise" by deploying AI agents (branded as Agentforce) across sales, service, and marketing operations to address capacity constraints where demand exceeded headcount. The company deployed agents that autonomously handled over 2 million customer service conversations, followed up with previously untouched leads (75% of total leads), and provided 24/7 multilingual support. Key results included over $100 million in annualized cost savings from the service agent implementation, increased lead engagement leading to new revenue opportunities, and the ability to scale operations without proportional headcount increases. The initiative required significant iteration, data unification through their Data 360 platform, continuous testing and tuning of agent performance, cross-functional collaboration breaking down traditional departmental silos, and process redesigns to enable human-AI collaboration.

customer_support chatbot classification question_answering +18

Building an AI API Gateway for Streamlined GenAI Service Development

DeliveryHero

DeliveryHero's Woowa Brothers division developed an AI API Gateway to address the challenges of managing multiple GenAI providers and streamlining development processes. The gateway serves as a central infrastructure component to handle credential management, prompt management, and system stability while supporting various GenAI services like AWS Bedrock, Azure OpenAI, and GCP Imagen. The initiative was driven by extensive user interviews and aims to democratize AI usage across the organization while maintaining security and efficiency.

unstructured_data structured_output multi_modality caption_generation +11

Building an AI Co-pilot for Product Strategy with LLM Integration Patterns

Thoughtworks

Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to explore effective patterns for LLM-powered applications beyond simple chat interfaces. The team developed and documented key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed implementation insights for building sophisticated LLM applications with better user experiences.

caption_generation chunking data_analysis databases +15

Building an AI Legal Assistant: From Early Testing to Production Deployment

Casetext

Casetext transformed their legal research platform into an AI-powered legal assistant called Co-Counsel using GPT-4, leading to a $650M acquisition by Thomson Reuters. The company shifted their entire 120-person team to focus on building this AI assistant after early access to GPT-4 showed promising results. Through rigorous testing, prompt engineering, and a test-driven development approach, they created a reliable AI system that could perform complex legal tasks like document review and research that previously took lawyers days to complete. The product achieved rapid market acceptance and true product-market fit within months of launch.

document_processing regulatory_compliance high_stakes_application structured_output +18

Building an AI-Generated Movie Quiz Game with RAG and Real-Time Multiplayer

Datastax

Datastax developed UnReel, a multiplayer movie trivia game that combines AI-generated questions with real-time gaming. The system uses RAG to generate movie-related questions and fake movie quotes, implemented through Langflow, with data storage in Astra DB and real-time multiplayer functionality via PartyKit. The project demonstrates practical challenges in production AI deployment, particularly in fine-tuning LLM outputs for believable content generation and managing distributed system state.

question_answering realtime_application structured_output rag +13

Building an AI-Native Code Editor in a Competitive Market

Cursor

Cursor, an AI-powered code editor startup, entered an extremely competitive market dominated by Microsoft's GitHub Copilot and well-funded competitors like Poolside, Augment, and Magic.dev. Despite initial skepticism from advisors about competing against Microsoft's vast resources and distribution, Cursor succeeded by focusing on the right short-term product decisions—specifically deep IDE integration through forking VS Code and delivering immediate value through "Cursor Tab" code completion. The company differentiated itself through rapid iteration, concentrated talent, bottom-up adoption among developers, and eventually building their own fast agent models. Cursor demonstrated that startups can compete against tech giants by moving quickly, dog-fooding their own product, and correctly identifying what developers need in the near term rather than betting solely on long-term agent capabilities.

code_generation chatbot fine_tuning prompt_engineering +23

Building an AI-Powered Browser Extension for Product Documentation with RAG and Chain-of-Thought

Reforge

Reforge developed a browser extension to help product professionals draft and improve documents like PRDs by integrating expert knowledge directly into their workflow. The team evolved from simple RAG (Retrieve and Generate) to a sophisticated Chain-of-Thought approach that classifies document types, generates tailored suggestions, and filters content based on context. Operating with a lean team of 2-3 people, they built the extension through rapid prototyping and iterative development, integrating into popular tools like Google Docs, Notion, and Confluence. The extension uses OpenAI models with Pinecone for vector storage, emphasizing privacy by not storing user data, and leverages innovative testing approaches like analyzing course recommendation distributions and reference counts to optimize model performance without accessing user content.

document_processing content_moderation classification poc +15

Building an AI-Powered Help Desk with RAG and Model Evaluation

Vimeo

Vimeo developed a prototype AI help desk chat system that leverages RAG (Retrieval Augmented Generation) to provide accurate customer support responses using their existing Zendesk help center content. The system uses vector embeddings to store and retrieve relevant help articles, integrates with various LLM providers through Langchain, and includes comprehensive testing of different models (Google Vertex AI Chat Bison, GPT-3.5, GPT-4) for performance and cost optimization. The prototype demonstrates successful integration of modern LLMOps practices including prompt engineering, model evaluation, and production-ready architecture considerations.

chatbot chunking customer_support documentation +18

Building an AI-Powered Slack Agent with MCP Standardization

Duolingo

Duolingo developed an AI-powered Slack bot to democratize access to their Model Context Protocol (MCP) infrastructure after discovering that manual MCP server setup was too complex for widespread adoption. The journey began with individual engineers connecting MCP servers to local editors in late 2024, evolved through a centralized discovery portal in mid-2025, and culminated in a comprehensive standardization effort and Slack application by late 2025. By April 2026, the bot achieved over 250 weekly active users (approximately 30% of the company) with an 80% upvote rate, successfully reducing toil for on-call engineers through automated incident response, help desk support, and safe write operations with human-in-the-loop verification.

customer_support chatbot code_generation poc +20

Building an Enterprise AI Engineering Stack with Internal Agents and MCP Infrastructure

Cloudflare

Cloudflare built a comprehensive internal AI engineering stack over eleven months to integrate AI coding assistants across their R&D organization, achieving 93% adoption among engineering teams. The solution involved creating an MCP-based infrastructure using their own products (AI Gateway, Workers AI, Cloudflare Access, Agents SDK, Workflows, and Sandbox SDK), developing 13 MCP servers with 182+ tools, generating AGENTS.md files for ~3,900 repositories, implementing automated AI code review for all merge requests, and establishing an Engineering Codex for standards enforcement. The result was a dramatic increase in developer velocity with merge requests nearly doubling, processing 241.37 billion tokens monthly through AI Gateway, with 3,683 active users generating 47.95 million AI requests in the last 30 days, while maintaining security through zero-trust authentication and zero data retention policies.

code_generation code_interpretation chatbot high_stakes_application +34

Building an Enterprise AI Productivity Platform: From Slack Bot to Integrated AI Workforce

Toqan

Proess (previously called Prous) developed Toqan, an internal AI productivity platform that evolved from a simple Slack bot to a comprehensive enterprise AI system serving 30,000+ employees across 100+ portfolio companies. The platform addresses the challenge of enterprise AI adoption by providing access to multiple LLMs through conversational interfaces, APIs, and system integrations, while measuring success through user engagement metrics like daily active users and "super users" who ask 5+ questions per day. The solution demonstrates how large organizations can systematically deploy AI tools across diverse business functions while maintaining security and enabling bottom-up adoption through hands-on training and cultural change management.

customer_support content_moderation chatbot code_generation +25

Building an Internal AI-Powered Customer Reference Discovery Platform

Databricks

Databricks faced a significant challenge in helping sales and marketing teams discover and utilize their vast collection of over 2,400 customer stories scattered across multiple platforms including YouTube, LinkedIn, internal documents, and their website. The tribal knowledge problem meant that finding the right customer reference at the right time was difficult, leading to overused references, missed opportunities, and inefficient manual searching. To solve this, they built Reffy—a full-stack agentic application using RAG (Retrieval-Augmented Generation), Vector Search, AI Functions, and Lakebase on the Databricks platform. Since its launch in December 2025, over 1,800 employees have executed more than 7,500 queries, resulting in faster campaign execution, more relevant storytelling, and democratized access to customer proof points that were previously siloed in tribal knowledge.

customer_support question_answering document_processing data_analysis +26

Building an Internal ChatGPT for Enterprise: From Failed Support Bot to Company-Wide AI Tool

Grab

Grab's ML Platform team was overwhelmed with support inquiries in Slack channels, prompting an engineer to experiment with building an LLM-powered chatbot for platform documentation. After the initial attempt failed due to token limitations and poor embedding search results, the project pivoted to creating GrabGPT—an internal ChatGPT-like tool for all employees. Deployed over a weekend with Google authentication and leveraging Grab's existing model-serving infrastructure (Catwalk), GrabGPT rapidly grew from 300 users on day one to becoming nearly universally adopted across the company, with over 3,000 users and 600 daily active users within three months. The success was attributed to data security controls, global accessibility (especially in regions where ChatGPT is blocked), model-agnostic architecture supporting multiple LLM providers, and full auditability for governance.

chatbot customer_support question_answering rag +10

Building an Internal ChatGPT-like Tool for Enterprise-wide AI Access

Grab

Grab's ML Platform team faced overwhelming support channel inquiries that consumed engineering time with repetitive questions. An engineer initially attempted to build a RAG-based chatbot for platform documentation but encountered context window limitations with GPT-3.5-turbo and scalability issues. Pivoting from this failed experiment, the engineer built GrabGPT, an internal ChatGPT-like tool accessible to all employees, deployed over a weekend using existing frameworks and Grab's model-serving platform. The tool rapidly scaled to nearly company-wide adoption, with over 3000 users within three months and 600 daily active users, providing secure, auditable, and globally accessible LLM capabilities across multiple model providers including OpenAI, Claude, and Gemini.

chatbot customer_support poc rag +10

Building and Automating Comprehensive LLM Evaluation Framework for SNAP Benefits

Propel

Propel developed a sophisticated evaluation framework for testing and benchmarking LLM performance in handling SNAP (food stamp) benefit inquiries. The company created two distinct evaluation approaches: one for benchmarking current base models on SNAP topics, and another for product development. They implemented automated testing using Promptfoo and developed innovative ways to evaluate model responses, including using AI models as judges for assessing response quality and accessibility.

regulatory_compliance high_stakes_application question_answering prompt_engineering +8

Building and Debugging Web Automation Agents with LangChain Ecosystem

Airtop

Airtop developed a web automation platform that enables AI agents to interact with websites through natural language commands. They leveraged the LangChain ecosystem (LangChain, LangSmith, and LangGraph) to build flexible agent architectures, integrate multiple LLM models, and implement robust debugging and testing processes. The platform successfully enables structured information extraction and real-time website interactions while maintaining reliability and scalability.

unstructured_data structured_output data_analysis data_integration +10

Building and Deploying an Organization-Wide AI Agent with Production Security Challenges

Daily.dev

Daily.dev built "Smith," an internal AI agent deployed in their Slack workspace that provides autonomous access to databases, GitHub repositories, browser automation, and scheduled tasks across the organization. Initially developed in four days using AI coding assistants (Codex and Claude Code), the team spent three subsequent weeks addressing critical production issues including credential leakage, event-loop hangs, memory overflow from long conversations, and security vulnerabilities in a shared runtime environment. The agent now runs in production with 60 tools, 25 self-authored skills, progressive tool disclosure, containerized execution, and defense-in-depth security layers, though several challenges remain unresolved including mysterious crashes from power users and the inherent difficulty of verifying autonomous agent behavior in production systems.

chatbot data_analysis content_moderation customer_support +25

Building and Deploying Enterprise-Grade LLMs: Lessons from Mistral

Mistral

Mistral, a European AI company, evolved from developing academic LLMs to building and deploying enterprise-grade language models. They started with the successful launch of Mistral-7B in September 2023, which became one of the top 10 most downloaded models on Hugging Face. The company focuses not just on model development but on providing comprehensive solutions for enterprise deployment, including custom fine-tuning, on-premise deployment infrastructure, and efficient inference optimization. Their approach demonstrates the challenges and solutions in bringing LLMs from research to production at scale.

code_generation code_interpretation translation high_stakes_application +24

Building and Evaluating a Financial Earnings Call Summarization System

Aiera

Aiera, an investor intelligence platform, developed a system for automated summarization of earnings call transcripts. They created a custom dataset from their extensive collection of earnings call transcriptions, using Claude 3 Opus to extract targeted insights. The project involved comparing different evaluation metrics including ROUGE and BERTScore, ultimately finding Claude 3.5 Sonnet performed best for their specific use case. Their evaluation process revealed important insights about the trade-offs between different scoring methodologies and the challenges of evaluating generative AI outputs in production.

summarization data_analysis structured_output regulatory_compliance +10

Building and Evaluating Production AI Agents: From Function Calling to Complex Multi-Agent Systems

Google Deepmind

This case study explores the evolution of LLM-based systems in production through discussions with Raven Kumar from Google DeepMind about building products like Notebook LM, Project Mariner, and working with the Gemini and Gemma model families. The conversation covers the rapid progression from simple function calling to complex agentic systems capable of multi-step reasoning, the critical importance of evaluation harnesses as competitive advantages, and practical considerations around context engineering, tool orchestration, and model selection. Key insights include how model improvements are causing teams to repeatedly rebuild agent architectures, the importance of shipping products quickly to learn from real users, and strategies for evaluating increasingly complex multi-modal agentic systems across different scales from edge devices to cloud-based deployments.

code_generation chatbot summarization question_answering +27

Building and Managing Taxonomies for Effective AI Systems

Adobe

Adobe's Information Architect Jessica Talisman discusses how to build and maintain taxonomies for AI and search systems. The case study explores the challenges and best practices in creating taxonomies that bridge the gap between human understanding and machine processing, covering everything from metadata extraction to ontology development. The approach emphasizes the importance of human curation in AI systems and demonstrates how well-structured taxonomies can significantly improve search relevance, content categorization, and business operations.

classification data_analysis data_cleaning data_integration +10

Building and Operating Agentic AI Coding Products at Scale with Temporal

Cursor

Cursor, an AI-powered code editor company, developed Cloud Agents to enable independent, asynchronous AI coding agents that run in dedicated cloud environments. The company transitioned from a homegrown orchestration system with 90% reliability to Temporal-based workflows achieving over 99% activity success rates. By leveraging Temporal for workflow orchestration, they enabled parallel agent execution, automated code reviews, and proof-of-correctness through screenshots and videos. The system now processes over 50 million Temporal actions daily across 7+ million workflows, with cloud agents generating one-third of internal merged pull requests, demonstrating significant developer productivity gains.

code_generation poc agent_based multi_agent_systems +27

Building and Scaling Conversational Voice AI Agents for Enterprise Go-to-Market

Thoughtly / Gladia

Thoughtly, a voice AI platform founded in late 2023, provides conversational AI agents for enterprise sales and customer support operations. The company orchestrates speech-to-text, large language models, and text-to-speech systems to handle millions of voice calls with sub-second latency requirements. By optimizing every layer of their stack—from telephony providers to LLM inference—and implementing sophisticated caching, conditional navigation, and evaluation frameworks, Thoughtly delivers 3x conversion rates over traditional methods and 15x ROI for customers. The platform serves enterprises with HIPAA and SOC 2 compliance while handling both inbound customer support and outbound lead activation at massive scale across multiple languages and regions.

customer_support healthcare regulatory_compliance realtime_application +32

Building and Scaling Internal Data Agents and AI-Powered Frontend Development Tools

Vercel

Vercel developed two significant production AI applications: DZ, an internal text-to-SQL data agent that enables employees to query Snowflake using natural language in Slack, and V0, a public-facing AI tool for generating full-stack web applications. The company initially built DZ as a traditional tool-based agent but completely rebuilt it as a coding-style agent with simplified architecture (just two tools: bash and SQL execution), dramatically improving performance by leveraging models' native coding capabilities. V0 evolved from a 2023 prototype targeting frontend engineers into a comprehensive full-stack development tool as models improved, finding strong product-market fit with tech-adjacent users and enabling significant internal productivity gains. Both products demonstrate Vercel's philosophy that building custom agents is straightforward and preferable to buying off-the-shelf solutions, with the company successfully deploying these AI systems at scale while maintaining reliability and supporting their core infrastructure business.

data_analysis code_generation chatbot question_answering +30

Building and Scaling Visual Intelligence Models from Research to Production

Black Forest Labs

Black Forest Labs, co-founded by Andreas Blattmann (co-creator of Stable Diffusion), evolved from academic research in latent diffusion models to become a frontier visual AI company generating hundreds of millions in revenue. The company faced the challenge of moving from unimodal text-to-image generation to multimodal visual intelligence systems capable of content creation, physical AI, and robotics applications. By implementing a systematic pre-training, mid-training, and post-training pipeline with continuous feedback loops from production usage, they developed the Flux model family. The solution included latent adversarial distillation to create multiple model variants (Flux Schnell, Dev, and Pro) optimized for different speed-quality tradeoffs, and the development of Self-Flow for multimodal learning across video, audio, and images. This approach enabled rapid iteration based on user feedback, such as developing Flux Context for character consistency in response to observed user behavior, ultimately leading to partnerships with Meta and other major platforms serving billions of users.

content_moderation multi_modality poc caption_generation +15

Building and Testing a Production LLM-Powered Quiz Application

Google

A case study of transforming a traditional trivia quiz application into an LLM-powered system using Google's Vertex AI platform. The team evolved from using static quiz data to leveraging PaLM and later Gemini models for dynamic quiz generation, addressing challenges in prompt engineering, validation, and testing. They achieved significant improvements in quiz accuracy from 70% with Gemini Pro to 91% with Gemini Ultra, while implementing robust validation methods using LLMs themselves to evaluate quiz quality.

cache cost_optimization databases error_handling +14

Building Claude Code: Scaling AI-Powered Development from Terminal Prototype to Production

Anthropic

Anthropic's Boris Churnney, creator of Claude Code, describes the journey from an accidental terminal prototype in September 2024 to a production coding tool used by 70% of startups and responsible for 4% of all public commits globally. Starting as a simple API testing tool, Claude Code evolved through continuous user feedback and rapid iteration, with the entire codebase rewritten every few months to adapt to improving model capabilities. The tool achieved remarkable productivity gains at Anthropic itself, with engineers seeing 70% productivity increases per capita despite team doubling, and total productivity improvements of 150% since launch. The development philosophy centered on building for future model capabilities rather than current ones, anticipating improvements 6 months ahead, and minimizing scaffolding that would become obsolete with each new model release.

code_generation poc chatbot code_interpretation +26

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization document_processing +51

Building Custom Cloud Agent Infrastructure for Legal AI at Scale

Harvey

Harvey, a legal AI company, built their own custom cloud agent infrastructure to support complex legal tasks that require processing hundreds of thousands of documents. The company identified three critical requirements that existing managed agent runtimes from frontier labs and cloud providers couldn't meet: multi-model flexibility (to handle client conflicts and optimize for different tasks), zero data retention (a hard legal requirement for privileged client data), and aggressive cost optimization (achieving 3-5x cost reductions). By owning the runtime, Harvey created an abstraction layer that normalizes different model providers' APIs, ensures client data never persists to storage, and enables intelligent routing to the most cost-effective model for each task, making large-scale legal agent workflows economically viable while meeting stringent regulatory requirements.

healthcare regulatory_compliance high_stakes_application document_processing +17

Building Deep Research: A Production AI Research Assistant Agent

Google Deepmind

Google Deepmind developed Deep Research, a feature that acts as an AI research assistant using Gemini to help users learn about any topic in depth. The system takes a query, browses the web for about 5 minutes, and outputs a comprehensive research report that users can review and ask follow-up questions about. The system uses iterative planning, transparent research processes, and a sophisticated orchestration backend to manage long-running autonomous research tasks.

question_answering document_processing unstructured_data realtime_application +12

Building Durable and Reliable AI Agents at Scale with Dapr Workflows

HumanLayer

This case study presents Dapr, a CNCF graduated project, and its application to production AI agent systems through the Dapr Agents framework. The core problem addressed is the unreliability of current agent frameworks when running at scale in production environments, particularly the challenge of state loss during failures that forces expensive re-execution of long-running agentic workflows. Dapr Agents provides a durable agent framework with built-in workflow orchestration, automatic failure detection and recovery, exactly-once execution guarantees, and support for over 30 different state stores. The solution was demonstrated through live examples showing agents automatically resuming from their exact point of failure without manual intervention, multi-agent collaboration using pub/sub mechanisms, and complete observability through OpenTelemetry integration. Contributed by Nvidia to the Dapr project and reaching 1.0 stability in 2026, the framework addresses critical production gaps in existing agent frameworks like LangChain and LangGraph.

poc chatbot question_answering document_processing +34

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation question_answering +56

Building Enterprise-Grade GenAI Platform with Multi-Cloud Architecture

Coinbase

Coinbase developed CB-GPT, an enterprise GenAI platform, to address the challenges of deploying LLMs at scale across their organization. Initially focused on optimizing cost versus accuracy, they discovered that enterprise-grade LLM deployment requires solving for latency, availability, trust and safety, and adaptability to the rapidly evolving LLM landscape. Their solution was a multi-cloud, multi-LLM platform that provides unified access to models across AWS Bedrock, GCP VertexAI, and Azure, with built-in RAG capabilities, guardrails, semantic caching, and both API and no-code interfaces. The platform now serves dozens of internal use cases and powers customer-facing applications including a conversational chatbot launched in June 2024 serving all US consumers.

customer_support chatbot question_answering summarization +35

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application regulatory_compliance +41

Building Evaluation Frameworks for AI Product Managers: A Workshop on Production LLM Testing

Arize

This workshop, presented by Aman, an AI product manager at Arize, addresses the challenge of shipping reliable AI applications in production by establishing evaluation frameworks specifically designed for product managers. The problem identified is that LLMs inherently hallucinate and are non-deterministic, making traditional software testing approaches insufficient. The solution involves implementing "LLM as a judge" evaluation systems, building comprehensive datasets, running experiments with prompt variations, and establishing human-in-the-loop validation workflows. The approach demonstrates how product managers can move from "vibe coding" to "thrive coding" by using data-driven evaluation methods, prompt playgrounds, and continuous monitoring. Results show that systematic evaluation can catch issues like mismatched tone, missing features, and hallucinations before production deployment, though the workshop candidly acknowledges that evaluations themselves require validation and iteration.

poc chatbot prompt_engineering few_shot +16

Building Foundation Models for Computer Use Agents

Tzafon

Tzafon, a research lab focused on training foundation models for computer use agents, tackled the challenge of enabling LLMs to autonomously interact with computers through visual understanding and action execution. The company identified fundamental limitations in existing models' ability to ground visual information and coordinate actions, leading them to develop custom infrastructure (Waypoint) for data generation at scale, fine-tune vision encoders on screenshot data, and ultimately pre-train models from scratch with specialized computer interaction capabilities. While initial approaches using supervised fine-tuning and reinforcement learning on successful trajectories showed limited generalization, their focus on solving the grounding problem through improved vision-language integration and domain-specific pre-training has positioned them to release models and desktop applications for autonomous computer use, though performance on benchmarks like OS World remains a challenge across the industry.

poc code_interpretation data_analysis fine_tuning +14

Building Gemini Deep Research: An Agentic Research Assistant with Custom-Tuned Models

Google Deepmind

Google DeepMind developed Gemini Deep Research, an AI-powered research assistant that autonomously browses the web for 5-10 minutes to generate comprehensive research reports with citations. The product addresses the challenge of users wanting to go from "zero to 50" on new topics quickly, automating what would typically require opening dozens of browser tabs and hours of manual research. The team solved key technical challenges around agentic planning, transparent UX design with editable research plans, asynchronous orchestration, and post-training custom models (initially Gemini 1.5 Pro, moving toward 2.0 Flash) to reliably perform iterative web search and synthesis. The product launched in December 2024 and has been widely praised as potentially the most useful public-facing AI agent to date, with users reporting it can compress hours or days of research work into minutes.

question_answering summarization chatbot content_moderation +26

Building General Purpose AI Agents with Agent Harnesses and Tool Runtimes

Langchain / Arcade

LangChain and Arcade collaborated to demonstrate how general-purpose AI agents can be built for enterprise deployment by combining two critical components: an agent harness (like LangChain's Deep Agents) that provides the scaffolding for LLM-powered agents to interact with file systems and execute code, and a secure tool runtime (like Arcade) that handles authentication, authorization, and integration with over 8,000 third-party services. The solution addresses the gap between single-user coding agents running locally and multi-user enterprise agents that require proper security controls, delegated authorization, and the ability to perform actions as specific users across multiple services. The approach enables organizations to deploy agents that can handle complex workflows like flight booking, email management, and LinkedIn recruiting while maintaining enterprise-grade security and compliance requirements.

code_generation customer_support poc data_analysis +26

Building Internal LLM Tools with Security and Privacy Focus

Wealthsimple

Wealthsimple developed an internal LLM Gateway and suite of generative AI tools to enable secure and privacy-preserving use of LLMs across their organization. The gateway includes features like PII redaction, multi-model support, and conversation checkpointing. They achieved significant adoption with over 50% of employees using the tools, primarily for programming support, content generation, and information retrieval. The platform also enabled operational improvements like automated customer support ticket triaging using self-hosted models.

code_generation document_processing regulatory_compliance question_answering +24

Building ISO: A Hyperpersonalized AI Food Ordering Agent for Millions of Users

iFood

iFood, Brazil's largest food delivery company, built Ailo, an AI-powered food ordering agent to address the decision paralysis users face when choosing what to eat from overwhelming options. The agent operates both within the iFood app and on WhatsApp, providing hyperpersonalized recommendations based on user behavior, handling complex intents beyond simple search, and autonomously taking actions like applying coupons, managing carts, and facilitating payments. Through careful context management, latency optimization (reducing P95 from 30 to 10 seconds), and sophisticated evaluation frameworks, the team deployed ISO to millions of users in Brazil, demonstrating significant improvements in user experience through proactive engagement and intelligent personalization.

customer_support chatbot question_answering classification +22

Building Low-Latency Voice AI Agents for Home Services

Elyos AI

Elyos AI built end-to-end voice AI agents for home services companies (plumbers, electricians, HVAC installers) to handle customer calls, emails, and messages 24/7. The company faced challenges achieving human-like conversation latency (targeting sub-400ms response times) while maintaining reliability and accuracy for complex workflows including appointment booking, payment processing, and emergency dispatch. Through careful orchestration, they optimized speech-to-text, LLM, and text-to-speech components, implemented just-in-time context engineering, state machine-based workflows, and parallel monitoring streams to achieve consistent performance with approximately 85% call automation (15% requiring human involvement).

customer_support realtime_application chatbot prompt_engineering +15

Building Modular and Scalable RAG Systems with Hybrid Batch/Incremental Processing

Bell

Bell developed a sophisticated hybrid RAG (Retrieval Augmented Generation) system combining batch and incremental processing to handle both static and dynamic knowledge bases. The solution addresses challenges in managing constantly changing documentation while maintaining system performance. They created a modular architecture using Apache Beam, Cloud Composer (Airflow), and GCP services, allowing for both scheduled batch updates and real-time document processing. The system has been successfully deployed for multiple use cases including HR policy queries and dynamic Confluence documentation management.

question_answering document_processing regulatory_compliance rag +26

Building Multi-Agent Systems with MCP and Pydantic AI for Document Processing

Deepsense

Deepsense AI built a multi-agent system for a customer who operates a document processing platform that handles various file types and data sources at scale. The problem was to create both an MCP (Model Context Protocol) server for the platform's internal capabilities and a demonstration multi-agent system that could structure data on demand from documents. Using Pydantic AI as the core agent framework and Anthropic's Claude models, the team developed a solution where users specify goals for document processing, and the system automatically extracts structured information into tables. The implementation involved creating custom MCP servers, integrating with Databricks MCP, and applying 10 key lessons learned around tool design, token optimization, model selection, observability, testing, and security. The result was a modular, scalable system that demonstrates practical patterns for building production-ready agentic applications.

document_processing structured_output data_analysis multi_agent_systems +18

Building Open-Source RL Environments from Real-World Coding Tasks for Model Training

Cline

Cline's head of AI presents their experience operating a model-agnostic AI coding agent platform, arguing that the industry has over-invested in "clever scaffolding" like RAG and tool-calling frameworks when frontier models can succeed with simpler approaches. The real bottleneck to progress, they contend, isn't prompt engineering or agent architecture but rather the quality of benchmarks and RL environments used to train models. Cline developed an automated "RL environments factory" system that transforms real-world coding tasks captured from actual user interactions into standardized, containerized training environments. They announce Cline Bench, an open-source benchmark derived from genuine software development work, inviting the community to contribute by simply working on open-source projects with Cline and opting into the initiative, thereby creating a shared substrate for improving frontier models.

code_generation code_interpretation rag prompt_engineering +10

Building Personalized Financial and Gardening Experiences with LLMs

Bud Financial / Scotts Miracle-Gro

This case study explores how Bud Financial and Scotts Miracle-Gro leverage Google Cloud's AI capabilities to create personalized customer experiences. Bud Financial developed a conversational AI solution for personalized banking interactions, while Scotts Miracle-Gro implemented an AI assistant called MyScotty for gardening advice and product recommendations. Both companies utilize various Google Cloud services including Vertex AI, GKE, and AI Search to deliver contextual, regulated, and accurate responses to their customers.

compliance customer_support google_gcp kubernetes +15

Building Pi: A Minimal, Extensible Coding Agent Framework

The presenter, Mario, describes the development of Pi, a minimal and extensible coding agent framework designed to address limitations in existing tools like Claude Code, Cursor, and OpenCode. Frustrated by feature bloat, poor context management, lack of model choice, and insufficient observability in commercial coding agents, Mario built Pi as a stripped-down core that provides only four basic tools (read, write, edit, bash) with extensive customization capabilities through TypeScript extensions. Pi achieved competitive performance on the TerminalBench coding benchmark, ranking second only to Terminus while maintaining a system prompt of just a few tokens. The framework emphasizes developer control, hot-reloading extensions, and adaptability to individual workflows rather than forcing users to conform to opinionated agent designs.

code_generation poc prompt_engineering agent_based +19

Building Production AI Agents for E-commerce and Food Delivery at Scale

Prosus

This case study explores how Prosus builds and deploys AI agents across e-commerce and food delivery businesses serving two billion customers globally. The discussion covers critical lessons learned from deploying conversational agents in production, with a particular focus on context engineering as the most important factor for success—more so than model selection or prompt engineering alone. The team found that successful production deployments require hybrid approaches combining semantic and keyword search, generative UI experiences that mix chat with dynamic visual components, and sophisticated evaluation frameworks. They emphasize that technology has advanced faster than user adoption, leading to failures when pure chatbot interfaces were tested, and success only came through careful UI/UX design, contextual interventions, and extensive testing with both synthetic and real user data.

chatbot question_answering classification summarization +34

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

customer_support healthcare document_processing summarization +38

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

customer_support chatbot speech_recognition realtime_application +35

Building Production AI at Scale with Internal Tooling and Agent-Based Systems

Shopify

Shopify's CTO discusses how the company has achieved near-universal AI adoption internally, with nearly 100% of employees using AI tools daily as of December 2025. The company has developed sophisticated internal platforms including Tangle (an ML experimentation framework), Tangent (an auto-research loop for automatic optimization), and SimGym (a customer simulation platform using historical data). These systems have enabled dramatic productivity improvements including 30% month-over-month PR merge growth, significant code quality improvements through critique loops, and the ability to run hundreds of automated experiments. The company provides unlimited token budgets to employees and emphasizes quality token usage over quantity, focusing on efficient agent architectures with critique loops rather than many parallel agents. They've also implemented Liquid AI models for low-latency applications, achieving 30-millisecond response times for search queries.

code_generation customer_support chatbot data_analysis +47

Building Production AI Coding Assistants and Agents at Scale

Sourcegraph

Sourcegraph's CTO discusses the evolution from their code search engine to building Cody, an enterprise AI coding assistant, and AMP, a coding agent released in 2024. The company serves hundreds of Fortune 500 companies and government agencies, deploying LLM-powered tools that achieve 30-60% developer productivity gains. Their approach emphasizes multi-model architectures, rapid iteration without traditional code review processes, and building application scaffolds around frontier models to generate training data for next-generation systems. The discussion explores the transition from chat-based LLM applications (requiring sophisticated RAG systems) to agentic architectures (using simple tool-calling loops), the challenges of scaling in enterprise environments, and philosophical debates about whether pure model scaling will lead to AGI or whether alternating between application development and model training is necessary for continued progress.

code_generation chatbot code_interpretation rag +23

Building Production AI Customer Support Agents with Multi-Agent Architecture and Human-in-the-Loop Design

Lorikeet

Lorikeet is an AI customer support startup that evolved from building basic automation tools to creating sophisticated multi-agent systems for handling customer support at scale. The company developed two primary agents: a customer-facing concierge agent that handles support tickets across email, live chat, and voice channels, and a coach agent that helps support teams configure, evaluate, and improve their AI systems. The solution addresses the challenge of drowning support teams by not only automating routine inquiries but also implementing resolution-in-the-loop patterns where AI can request human assistance for specific blockers while maintaining conversation ownership. Results include increased average handle time for human agents, indicating they now focus on complex issues rather than routine tickets, with the system processing customer interactions at significant scale across multiple regulated industries including fintech and healthcare.

customer_support chatbot high_stakes_application regulatory_compliance +17

Building Production AI Products: A Framework for Continuous Calibration and Development

OpenAI / Various

AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.

customer_support code_generation healthcare chatbot +25

Building Production LLM Applications with DSPy Framework

AlixPartners

A technical consultant presents a comprehensive workshop on using DSPy, a declarative framework for building modular LLM-powered applications in production. The presenter demonstrates how DSPy enables rapid iteration on LLM applications by treating LLMs as first-class citizens in Python programs, with built-in support for structured outputs, type guarantees, tool calling, and automatic prompt optimization. Through multiple real-world use cases including document classification, contract analysis, time entry correction, and multi-modal processing, the workshop shows how DSPy's core primitives—signatures, modules, tools, adapters, optimizers, and metrics—allow teams to build production-ready systems that are transferable across models, optimizable without fine-tuning, and maintainable at scale.

document_processing classification summarization question_answering +28

Building Production Multi-Agent Research Systems with Claude

Anthropic

Anthropic developed a production-grade multi-agent research system for their Claude Research feature that uses multiple LLM agents working in parallel to explore complex topics across web, Google Workspace, and integrated data sources. The system employs an orchestrator-worker pattern where a lead agent coordinates specialized subagents that search and filter information simultaneously, addressing challenges in agent coordination, evaluation, and reliability. Internal evaluations showed the multi-agent approach with Claude Opus 4 and Sonnet 4 outperformed single-agent Claude Opus 4 by 90.2% on research tasks, with token usage explaining 80% of performance variance, though the architecture consumes approximately 15× more tokens than standard chat interactions, requiring careful consideration of economic viability and deployment strategies.

question_answering data_analysis code_generation summarization +21

Building Production Video Generation and World Models at Scale

xAI

This case study chronicles the journey of Eden Ha, who led video and multimodal model development at xAI, building production-ready image generation, video generation, and world models from scratch in just three months. The challenge was to create competitive generative media capabilities without existing infrastructure, data pipelines, or trained models, while managing massive compute resources and storage costs. The solution involved leveraging strong engineering talent, building on previous experience from NVIDIA's Cosmos project, implementing efficient iteration cycles, and critically recognizing that most visual intelligence gains come from language models rather than the video models themselves. This led to innovations like prompt rewriting with large language models, video extension with full historical context, reference-based video generation, and ultimately the development of video agents that orchestrate multiple tools. The results included the successful launch of Grok Imagine 0.9 with audio-video joint generation, state-of-the-art video extension capabilities, and pioneering work toward real-time interactive world models that point toward a future of generative UIs and AI-controlled interfaces.

multi_modality content_moderation poc prompt_engineering +14

Building Production-Grade Agentic AI Analytics: Lessons from Real-World Deployment

Tellius

Tellius shares hard-won lessons from building their agentic analytics platform that transforms natural language questions into trustworthy SQL-based insights. The core problem addressed is that chat-based analytics requires far more than simple text-to-SQL conversion—it demands deterministic planning, governed semantic layers, ambiguity management, multi-step consistency, transparency, performance engineering, and comprehensive observability. Their solution architecture separates language understanding from execution through typed plan artifacts that validate against schemas and policies before execution, implements clarification workflows for ambiguous queries, maintains plan/result fingerprinting for consistency, provides inline transparency with preambles and lineage, enforces latency budgets across execution hops, and treats feedback as governed policy changes. The result is a production system that achieves determinism, explainability, and sub-second interactive performance while avoiding the common pitfalls that cause 95% of AI pilot failures.

data_analysis question_answering structured_output high_stakes_application +29

Building Production-Grade AI Agents with Guardrails, Context Management, and Security

Portia / Riff / Okta

This panel discussion features founders from Portia AI and Rift.ai (formerly Databutton) discussing the challenges of moving AI agents from proof-of-concept to production. The speakers address critical production concerns including guardrails for agent reliability, context engineering strategies, security and access control challenges, human-in-the-loop patterns, and identity management. They share real-world customer examples ranging from custom furniture makers to enterprise CRM enrichment, emphasizing that while approximately 40% of companies experimenting with AI have agents in production, the journey requires careful attention to trust, security, and supportability. Key solutions include conditional example-based prompting, sandboxed execution environments, role-based access controls, and keeping context windows smaller for better precision rather than utilizing maximum context lengths.

code_generation chatbot document_processing poc +28

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing high_stakes_application +40

Building Production-Ready AI Agents for Internal Workflow Automation

Vercel

Vercel, a web hosting and deployment platform, addressed the challenge of identifying and implementing successful AI agent projects across their organization by focusing on employee pain points—specifically repetitive, boring tasks that humans disliked. The company deployed three internal production agents: a lead processing agent that automated sales qualification and research (saving hundreds of days of manual work), an anti-abuse agent that accelerated content moderation decisions by 59%, and a data analyst agent that automated SQL query generation for business intelligence. Their methodology centered on asking employees "What do you hate most about your job?" to identify tasks that were repetitive enough for current AI models to handle reliably while still delivering high business impact.

customer_support fraud_detection content_moderation data_analysis +19

Building Production-Ready AI Agents Through Harness Engineering and Continual Learning

Langchain

Langchain's approach to production AI agents focuses on "harness engineering" - the practice of wrapping LLMs with context engineering, prompting, tools, verification systems, and orchestration logic to solve specific tasks. The team has developed open-source infrastructure including Deep Agents and comprehensive evaluation frameworks to help developers build task-specific agents that improve over time through continual learning loops. By treating agents as "model plus harness," they've achieved significant improvements on benchmarks like SWE-bench (moving from top 30 to top 5 on Terminal Bench 2.0 through harness optimization alone) while emphasizing that production success requires custom harnesses tailored to specific customer use cases rather than relying solely on frontier model capabilities.

code_generation chatbot question_answering document_processing +28

Building Production-Ready Customer Support AI Agents: Challenges and Solutions

Gradient Labs

Gradient Labs shares their experience building and deploying AI agents for customer support automation in production. While prototyping with LLMs is relatively straightforward, deploying agents to production introduces complex challenges around state management, knowledge integration, tool usage, and handling race conditions. The company developed a state machine-based architecture with durable execution engines to manage these challenges, successfully handling hundreds of conversations per day with high customer satisfaction.

customer_support high_stakes_application regulatory_compliance realtime_application +21

Building Production-Scale Voice AI with Multi-Model Pipelines and Deployment Infrastructure

ElevenLabs

ElevenLabs, founded by Mati and his co-founder from Poland, built frontier voice AI models to solve audio generation, transcription, and translation problems at scale. Starting in 2022 with text-to-speech models trained on modest compute budgets, they evolved a cascaded architecture combining speech-to-text, LLMs, and text-to-speech models to power applications from audiobook narration to real-time voice agents. By focusing on product-led growth, staying close to users through Discord communities, and building deployment infrastructure for enterprise customers, they scaled from under $2M to over $430M ARR in 36 months with a team of 450 people, serving use cases ranging from content localization to customer support automation while maintaining quality, reliability, and emotional expressiveness in voice outputs.

customer_support translation speech_recognition content_moderation +35

Building Reliable AI Agent Systems with Effect TypeScript Framework

14.ai

14.ai, an AI-native customer support platform, uses Effect, a TypeScript framework, to manage the complexity of building reliable LLM-powered agent systems that interact directly with end users. The company built a comprehensive architecture using Effect across their entire stack to handle unreliable APIs, non-deterministic model outputs, and complex workflows through strong type guarantees, dependency injection, retry mechanisms, and structured error handling. Their approach enables reliable agent orchestration with fallback strategies between LLM providers, real-time streaming capabilities, and comprehensive testing through dependency injection, resulting in more predictable and resilient AI systems.

customer_support chatbot agent_based multi_agent_systems +16

Building Reliable AI Agents Through Production Monitoring and Intent Discovery

Raindrop

Raindrop, a monitoring platform for AI products, addresses the challenge of building reliable AI agents in production where traditional offline evaluations fail to capture real-world usage patterns. The company developed a "Sentry for AI products" approach that emphasizes experimentation, production monitoring, and discovering user intents through clustering and signal detection. Their solution combines explicit signals (like thumbs up/down, regenerations) and implicit signals (detecting refusals, task failures, user frustration) to identify issues that don't manifest as traditional software errors. The platform trains custom models to detect issues across production data at scale, enabling teams to discover unknown problems, track their impact on users, and fix them systematically without breaking existing functionality.

chatbot customer_support question_answering code_generation +27

Building Reliable AI DevOps Agents: Engineering Practices for Nondeterministic LLM Output

Trunk

Trunk developed an AI DevOps agent to handle root cause analysis (RCA) for test failures in CI pipelines, facing challenges with nondeterministic LLM outputs. They applied traditional software engineering principles adapted for LLMs, including starting with narrow use cases, switching between models (Claude to Gemini) for better tool calling, implementing comprehensive testing with mocked LLM responses, and establishing feedback loops through internal usage and user feedback collection. The approach resulted in a more reliable agent that performs well on specific tasks like analyzing test failures and posting summaries to GitHub PRs.

code_generation poc prompt_engineering agent_based +12

Building Reliable Production AI Agents with Durable Execution Infrastructure

Temporal

This case study explores how Temporal provides durable execution infrastructure for building reliable, long-running AI agents in production environments. The problem addressed is that traditional approaches to building production systems—whether through manual retry logic, event-driven architectures, or checkpoint-based solutions—require significant engineering effort to handle failures common in cloud environments and agentic workflows. Temporal solves this through a deterministic execution model that separates business logic from reliability concerns, allowing developers to write regular code in their preferred language while automatically handling crashes, retries, and state management. The solution has been adopted by companies like OpenAI (Codex on the web), Replit, and Lovable, with integrations across major AI frameworks including OpenAI Agents SDK, Pydantic AI, Vercel AI SDK, BrainTrust, and LangFuse, enabling developers to build production-grade agentic systems with significantly reduced complexity.

code_generation code_interpretation chatbot document_processing +37

Building Resilient Multi-Provider AI Agent Infrastructure for Financial Services

Gradient Labs

Gradient Labs built an AI agent that handles customer interactions for financial services companies, requiring high reliability in production. The company architected a sophisticated failover system that spans multiple LLM providers (OpenAI, Anthropic, Google) and hosting platforms (native APIs, Azure, AWS, GCP), enabling both traffic distribution across rate limits and automatic failover during errors, rate limiting, or latency spikes. They use Temporal for durable execution to checkpoint progress across long-running agentic workflows, and have implemented both provider-level and model-level failover strategies with tailored prompts for backup models, ensuring continuous operation even during catastrophic provider outages.

customer_support fraud_detection high_stakes_application prompt_engineering +13

Building Robust Enterprise Search with LLMs and Traditional IR

Glean

Glean tackles enterprise search by combining traditional information retrieval techniques with modern LLMs and embeddings. Rather than relying solely on AI techniques, they emphasize the importance of rigorous ranking algorithms, personalization, and hybrid approaches that combine classical IR with vector search. The company has achieved unicorn status and serves major enterprises by focusing on holistic search solutions that include personalization, feed recommendations, and cross-application integrations.

api_gateway compliance cost_optimization document_processing +16

Building Stateful AI Agents with In-Context Learning and Memory Management

Letta

Letta addresses the fundamental limitation of current LLM-based agents: their inability to learn and retain information over time, leading to degraded performance as context accumulates. The platform enables developers to build stateful agents that learn by updating their context windows rather than model parameters, making learning interpretable and model-agnostic. The solution includes a developer platform with memory management tools, context window controls, and APIs for creating production agents that improve over time. Real-world deployments include a support agent that has been learning from Discord interactions for a month and recommendation agents for Built Rewards, demonstrating that agents with persistent memory can achieve performance comparable to fine-tuned models while remaining flexible and debuggable.

customer_support code_generation chatbot poc +15

Building Synthetic Filesystems for AI Agent Navigation Across Enterprise Data Sources

Dust.tt

Dust.tt observed that their AI agents were attempting to navigate company data using filesystem-like syntax, prompting them to build synthetic filesystems that map disparate data sources (Notion, Slack, Google Drive, GitHub) into Unix-inspired navigable structures. They implemented five filesystem commands (list, find, cat, search, locate_in_tree) that allow agents to both structurally explore and semantically search across organizational data, transforming agents from search engines into knowledge workers capable of complex multi-step information tasks.

document_processing data_integration question_answering chatbot +12

Building Unified API Infrastructure for AI Integration at Scale

Merge

Merge, a unified API provider founded in 2020, helps companies offer native integrations across multiple platforms (HR, accounting, CRM, file storage, etc.) through a single API. As AI and LLMs emerged, Merge adapted by launching Agent Handler, an MCP-based product that enables live API calls for agentic workflows while maintaining their core synced data product for RAG-based use cases. The company serves major LLM providers including Mistral and Perplexity, enabling them to access customer data securely for both retrieval-augmented generation and real-time agent actions. Internally, Merge has adopted AI tools across engineering, support, recruiting, and operations, leading to increased output and efficiency while maintaining their core infrastructure focus on reliability and enterprise-grade security.

customer_support data_integration chatbot question_answering +37

Building Voice-Enabled AI Assistants with Real-Time Processing

Bee

A detailed exploration of building real-time voice-enabled AI assistants, featuring multiple approaches from different companies and developers. The case study covers how to achieve low-latency voice processing, transcription, and LLM integration for interactive AI assistants. Solutions demonstrated include both commercial services like Deepgram and open-source implementations, with a focus on achieving sub-second latency, high accuracy, and cost-effective deployment.

speech_recognition realtime_application chatbot latency_optimization +12

Challenges in Building Enterprise Chatbots with LLMs: A Banking Case Study

Invento Robotics

A bank's attempt to implement a customer support chatbot using GPT-4 and RAG reveals the complexities and challenges of deploying LLMs in production. What was initially estimated as a three-month project struggled to deliver after a year, highlighting key challenges in domain knowledge management, retrieval effectiveness, conversation flow design, state management, latency, and regulatory compliance.

amazon_aws chatbot compliance customer_support +20

Cloud-Based Agent Orchestration Platform for Multi-Agent Coding Workflows

Warp

Warp, a terminal software company, developed a cloud-based agent orchestration platform called Oz to address the limitations of running multiple AI coding agents on local laptops. The problem emerged as developers increasingly shifted from writing code by hand to writing by prompt, creating laptop capacity constraints, lack of visibility into agent work across teams, and inability to run agents when laptops are offline. Warp's solution provides cloud-hosted agent execution with automatic tracking, team visibility, programmable APIs, and support for multiple agent harnesses, enabling developers to parallelize coding tasks across multiple cloud agents, create scheduled automations, and embed agent capabilities into internal applications. The platform demonstrates successful use cases including parallel feature implementation, automated issue triage, and team-wide agent coordination.

code_generation fraud_detection document_processing poc +16

Comprehensive LLM Benchmarking for Financial Automation Tasks

Ramp

Ramp, a financial technology company, developed a comprehensive benchmarking framework to evaluate the performance of large language models across six critical financial automation tasks including invoice OCR, financial statement extraction, policy compliance, accounting autocoding, partner restrictions compliance, and smart routing. The company systematically tested 13+ models from Anthropic, Google, and OpenAI across different reasoning configurations to optimize for the specific tradeoffs of accuracy, cost, and latency for each use case. Results showed that no single model dominated across all tasks—Gemini 3 Flash excelled at visual extraction tasks with superior cost-efficiency, while Claude Opus 4.6 achieved the highest overall intelligence, and different models proved optimal for different financial workflows depending on whether the priority was precision, cost, latency, or coverage.

document_processing fraud_detection regulatory_compliance classification +10

Comprehensive LLM Evaluation Framework for Production AI Code Assistants

Github

Github describes their robust evaluation framework for testing and deploying new LLM models in their Copilot product. The team runs over 4,000 offline tests, including automated code quality assessments and chat capability evaluations, before deploying any model changes to production. They use a combination of automated metrics, LLM-based evaluation, and manual testing to assess model performance, quality, and safety across multiple programming languages and frameworks.

code_generation code_interpretation high_stakes_application prompt_engineering +16

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

code_generation summarization chatbot question_answering +38

Context Engineering for AI-Assisted Employee Onboarding

Etsy

Etsy explored using prompt engineering as an alternative to fine-tuning for AI-assisted employee onboarding, focusing on Travel & Entertainment policy questions and community forum support. They implemented a RAG-style approach using embeddings-based search to augment prompts with relevant Etsy-specific documents. The system achieved 86% accuracy on T&E policy questions and 72% on community forum queries, with various prompt engineering techniques like chain-of-thought reasoning and source citation helping to mitigate hallucinations and improve reliability.

question_answering customer_support document_processing prompt_engineering +10

Context Engineering for Production AI Agents at Scale

Manus

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

code_generation data_analysis visualization poc +33

Context Engineering for Production AI Assistants at Scale

Spotify

Shopify developed Sidekick, an AI assistant serving millions of merchants on their commerce platform. The challenge was managing context windows effectively while maintaining performance, latency, and cost efficiency for an agentic system operating at massive scale. Their solution involved sophisticated "context engineering" techniques including aggressive token management (removing processed tool messages, trimming old conversation turns), a three-tier memory system (explicit user preferences, implicit user profiles, and episodic memory via RAG), and just-in-time instruction injection that collocates instructions with tool outputs. These techniques reportedly improved instruction adherence by 5-10% while reducing jailbreak likelihood and maintaining acceptable latency despite the system managing over 20 tools and handling complex multi-step agentic workflows.

customer_support chatbot data_analysis code_generation +16

Context Engineering Platform for Multi-Domain RAG and Agentic Systems

Contextual

Contextual has developed an end-to-end context engineering platform designed to address the challenges of building production-ready RAG and agentic systems across multiple domains including e-commerce, code generation, and device testing. The platform combines multimodal ingestion, hierarchical document processing, hybrid search with reranking, and dynamic agents to enable effective reasoning over large document collections. In a recent context engineering hackathon, Contextual's dynamic agent achieved competitive results on a retail dataset of nearly 100,000 documents, demonstrating the value of constrained sub-agents, turn limits, and intelligent tool selection including MCP server management.

code_generation document_processing data_analysis poc +34

Context Rot: Evaluating LLM Performance Degradation with Increasing Input Tokens

ChromaDB

ChromaDB's technical report examines how large language models (LLMs) experience performance degradation as input context length increases, challenging the assumption that models process context uniformly. Through evaluation of 18 state-of-the-art models including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 across controlled experiments, the research reveals that model reliability decreases significantly with longer inputs, even on simple tasks like retrieval and text replication. The study demonstrates that factors like needle-question similarity, presence of distractors, haystack structure, and semantic relationships all impact performance non-uniformly as context length grows, suggesting that current long-context benchmarks may not adequately reflect real-world performance challenges.

question_answering document_processing summarization chatbot +25

Context-Seeking Conversational AI for Health Information Navigation

Google

Google Research developed a "Wayfinding AI" prototype based on Gemini to address the challenge of people struggling to find relevant, personalized health information online. Through formative user research with 33 participants and iterative design, they created an AI agent that proactively asks clarifying questions to understand user goals and context before providing answers. In a randomized study with 130 participants, the Wayfinding AI was significantly preferred over a baseline Gemini model across multiple dimensions including helpfulness, relevance, goal understanding, and tailoring, demonstrating that a context-seeking, conversational approach creates more empowering health information experiences than traditional question-answering systems.

healthcare chatbot question_answering prompt_engineering +4

Conversational AI Data Agent for Financial Analytics

Uber

Uber developed Finch, a conversational AI agent integrated into Slack, to address the inefficiencies of traditional financial data retrieval processes where analysts had to manually navigate multiple platforms, write complex SQL queries, or wait for data science team responses. The solution leverages generative AI, RAG, and self-querying agents to transform natural language queries into structured data retrieval, enabling real-time financial insights while maintaining enterprise-grade security through role-based access controls. The system reportedly reduces query response times from hours or days to seconds, though the text lacks quantified performance metrics or third-party validation of claimed benefits.

data_analysis question_answering chatbot rag +18

Cost Reduction Through Fine-tuning: Healthcare Chatbot and E-commerce Product Classification

Airtrain

Two case studies demonstrate significant cost reduction through LLM fine-tuning. A healthcare company reduced costs and improved privacy by fine-tuning Mistral-7B to match GPT-3.5's performance for patient intake, while an e-commerce unicorn improved product categorization accuracy from 47% to 94% using a fine-tuned model, reducing costs by 94% compared to using GPT-4.

classification compliance cost_optimization devops +14

Cost-Effective LLM Transaction Categorization for Business Banking

ANNA

ANNA, a UK business banking provider, implemented LLMs to automate transaction categorization for tax and accounting purposes across diverse business types. They achieved this by combining traditional ML with LLMs, particularly focusing on context-aware categorization that understands business-specific nuances. Through strategic optimizations including offline predictions, improved context utilization, and prompt caching, they reduced their LLM costs by 75% while maintaining high accuracy in their AI accountant system.

customer_support regulatory_compliance high_stakes_application document_processing +13

Customer Service Transformation with AI-Based Email Automation and Chatbot Implementation

Sixt

Sixt, a mobility service provider with over €4 billion in revenue, transformed their customer service operations using generative AI to handle the complexity of multiple product lines across 100+ countries. The company implemented "Project AIR" (AI-based Replies) to automate email classification, generate response proposals, and deploy chatbots across multiple channels. Within five months of ideation, they moved from proof-of-concept to production, achieving over 90% classification accuracy using Amazon Bedrock with Anthropic Claude models (up from 70% with out-of-the-box solutions), while reducing classification costs by 70%. The solution now handles customer inquiries in multiple languages, integrates with backend reservation systems, and has expanded from email automation to messaging and chatbot services deployed across all corporate countries by Q1 2025.

customer_support chatbot classification summarization +30

Data and AI Governance Integration in Enterprise GenAI Adoption

Various

A panel discussion featuring leaders from Mercado Libre, ATB Financial, LBLA retail, and Collibra discussing how they are implementing data and AI governance in the age of generative AI. The organizations are leveraging Google Cloud's Dataplex and other tools to enable comprehensive data governance, while also exploring GenAI applications for automating governance tasks, improving data discovery, and enhancing data quality management. Their approaches range from careful regulated adoption in finance to rapid e-commerce implementation, all emphasizing the critical connection between solid data governance and successful AI deployment.

compliance data_cleaning data_integration databases +14

Deploying AI Agents for Scalable Immigration Automation

Navismart AI

Navismart AI developed a multi-agent AI system to automate complex immigration processes that traditionally required extensive human expertise. The platform addresses challenges including complex sequential workflows, varying regulatory compliance across different countries, and the need for human oversight in high-stakes decisions. Built on a modular microservices architecture with specialized agents handling tasks like document verification, form filling, and compliance checks, the system uses Kubernetes for orchestration and scaling. The solution integrates REST APIs for inter-agent communication, implements end-to-end encryption for security, and maintains human-in-the-loop capabilities for critical decisions. The team started with US immigration processes due to their complexity and is expanding to other countries and domains like education.

document_processing regulatory_compliance high_stakes_application multi_modality +26

Deploying AI Coding Agents in Highly Regulated Environments with Secure Infrastructure

ONA

ONA addresses the challenge faced by companies in highly regulated sectors (finance, government) that need to leverage AI coding assistants while maintaining strict data security and compliance requirements. The problem stems from the fact that many organizations initially ban AI tools like ChatGPT due to data leakage concerns, but employees use them anyway (with surveys showing 45% admit using banned AI tools and 58% sending sensitive data to public AI services). ONA's solution is a software engineering agent platform that runs entirely within the customer's own virtual private cloud (VPC), using isolated disposable development environments (virtual machines with dev containers), providing admin controls and audit logs, and ensuring all data remains within the customer's network with client-side encryption. The platform enables secure AI-assisted development with direct connections to customers' Git providers and LLM services without ONA accessing any code or sensitive data.

code_generation regulatory_compliance agent_based multi_agent_systems +12

Deploying Generative AI at Scale Across 5,000 Developers

Liberty IT

Liberty IT, the technology division of Fortune 100 insurance company Liberty Mutual, embarked on a large-scale deployment of generative AI tools across their global workforce of over 5,000 developers and 50,000+ employees. The initiative involved rolling out custom GenAI platforms including Liberty GPT (an internal ChatGPT variant) to 70% of employees and GitHub Copilot to over 90% of IT staff within the first year. The company faced challenges including rapid technology evolution, model availability constraints, cost management, RAG implementation complexity, and achieving true adoption beyond basic usage. Through building a centralized AI platform with governance controls, implementing comprehensive learning programs across six streams, supporting 28 different models optimized for various use cases, and developing custom dashboards for cost tracking and observability, Liberty IT successfully navigated these challenges while maintaining enterprise security and compliance requirements.

fraud_detection customer_support code_generation chatbot +40

Domain-Specific Agentic AI for Personalized Korean Skincare Recommendations

Glowe / Weaviate

Glowe, developed by Weaviate, addresses the challenge of finding effective skincare product combinations by building a domain-specific AI agent that understands Korean skincare science. The solution leverages dual embedding strategies with TF-IDF weighting to capture product effects from 94,500 user reviews, uses Weaviate's vector database for similarity search, and employs Gemini 2.5 Flash for routine generation. The system includes an agentic chat interface powered by Elysia that provides real-time personalized guidance, resulting in scientifically-grounded skincare recommendations based on actual user experiences rather than marketing claims.

healthcare customer_support question_answering classification +20

Domain-Specific LLMs for Drug Discovery Biomarker Identification

BenchSci

BenchSci developed an AI platform for drug discovery that combines domain-specific LLMs with extensive scientific data processing to assist scientists in understanding disease biology. They implemented a RAG architecture that integrates their structured biomedical knowledge base with Google's Med-PaLM model to identify biomarkers in preclinical research, resulting in a reported 40% increase in productivity and reduction in processing time from months to days.

compliance databases error_handling google_gcp +13

Dutch YouTube Interface Localization and Content Management

Tastewise

This appears to be the Dutch footer section of YouTube's interface, showcasing the platform's localization and content management system. However, without more context about specific LLMOps implementation details, we can only infer that YouTube likely employs language models for content translation, moderation, and user interface localization.

compliance content_moderation fine_tuning google_gcp +11

Emotionally Aware AI Tutoring Agents with Multimodal Affect Detection

GlowingStar

GlowingStar Inc. develops emotionally aware AI tutoring agents that detect and respond to learner emotional states in real-time to provide personalized learning experiences. The system addresses the gap in current AI agents that focus solely on cognitive processing without emotional attunement, which is critical for effective learning and engagement. By incorporating multimodal affect detection (analyzing tone of voice, facial expressions, interaction patterns, latency, and silence) into an expanded agent architecture, the platform aims to deliver world-class personalized education while navigating significant challenges around emotional data privacy, cross-cultural generalization, and ethical deployment in sensitive educational contexts.

healthcare chatbot question_answering multi_modality +17

End-to-End LLM Observability for RAG-Powered AI Assistant

Splunk

Splunk built an AI Assistant leveraging Retrieval-Augmented Generation (RAG) to answer FAQs using curated public content from .conf24 materials. The system was developed in a hackathon-style sprint using their internal CIRCUIT platform. To operationalize this LLM-powered application at scale, Splunk integrated comprehensive observability across the entire RAG pipeline—from prompt handling and document retrieval to LLM generation and output evaluation. By instrumenting structured logs, creating unified dashboards in Splunk Observability Cloud, and establishing proactive alerts for quality degradation, hallucinations, and cost overruns, they achieved full visibility into response quality, latency, source document reliability, and operational health. This approach enabled rapid iteration, reduced mean time to resolution for quality issues, and established reproducible governance practices for production LLM deployments.

question_answering chatbot content_moderation fraud_detection +30

Engineering Principles and Practices for Production LLM Systems

Langchain

This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.

code_generation question_answering summarization chatbot +34

Enhanced Agentic RAG for On-Call Engineering Support

Uber

Uber developed Genie, an internal on-call copilot that uses an enhanced agentic RAG (EAg-RAG) architecture to provide real-time support for engineering security and privacy queries through Slack. The system addressed significant accuracy issues in traditional RAG approaches by implementing LLM-powered agents for query optimization, source identification, and context refinement, along with enriched document processing that improved table extraction and metadata enhancement. The enhanced system achieved a 27% relative improvement in acceptable answers and a 60% relative reduction in incorrect advice, enabling deployment across critical security and privacy channels while reducing the support load on subject matter experts and on-call engineers.

customer_support document_processing question_answering chatbot +18

Enhanced Agentic-RAG for Internal On-Call Support Copilot

Uber

Uber developed Genie, an internal on-call copilot powered by LLMs, to provide real-time support for engineering queries in Slack. When initial testing revealed significant accuracy issues with responses in the engineering security and privacy domain, the team transitioned from traditional RAG to an Enhanced Agentic RAG (EAg-RAG) architecture. This involved enriched document processing with custom Google Docs loaders and LLM-powered content formatting, plus pre- and post-processing agents for query optimization, source identification, and context refinement. The improvements resulted in a 27% relative increase in acceptable answers and a 60% relative reduction in incorrect advice, enabling deployment across critical security and privacy channels while reducing the support load on subject matter experts.

customer_support question_answering document_processing chatbot +17

Enhancing E-commerce Search with Vector Embeddings and Generative AI

Mercado Libre / Grupo Boticario

Mercado Libre, Latin America's largest e-commerce platform, addressed the challenge of handling complex search queries by implementing vector embeddings and Google's Vector Search database. Their traditional word-matching search system struggled with contextual queries, leading to irrelevant results. The new system significantly improved search quality for complex queries, which constitute about half of all search traffic, resulting in increased click-through and conversion rates.

databases embeddings google_gcp monitoring +7

Enterprise Agent Orchestration Platform for Secure LLM Deployment

Airia

This case study explores how Airia developed an orchestration platform to help organizations deploy AI agents in production environments. The problem addressed is the significant complexity and security challenges that prevent businesses from moving beyond prototype AI agents to production-ready systems. The solution involves a comprehensive platform that provides agent building capabilities, security guardrails, evaluation frameworks, red teaming, and authentication controls. Results include successful deployments across multiple industries including hospitality (customer profiling across hotel chains), HR, legal (contract analysis), marketing (personalized content generation), and operations (real-time incident response through automated data aggregation), with customers reporting significant efficiency gains while maintaining enterprise security standards.

customer_support document_processing data_analysis summarization +32

Enterprise Agentic AI Deployment: Panel Discussion on Production Realities and Technical Bottlenecks

Various

This panel discussion features leaders from Writer, You.com, Glean, and Google discussing the current state of deploying agentic AI systems in enterprise environments. The panelists address the gap between prototype development (which can now take 90 seconds) and production-ready systems that Fortune 500 companies can rely on. They identify key technical bottlenecks including data quality and governance issues, information retrieval challenges, function calling limitations, security vulnerabilities, and the difficulty of verifying agent actions. The consensus is that while every large enterprise has built some AI agents adding business value, they are far from having 50% of enterprise work handled by AI, with action agents for larger enterprises likely requiring several more years for major adoption.

chatbot question_answering code_generation customer_support +15

Enterprise AI Agent Development: Lessons from Production Deployments

IBM, The Zig, Augmented AI Labs

This panel discussion features three companies - IBM, The Zig, and Augmented AI Labs - sharing their experiences building and deploying AI agents in enterprise environments. The panelists discuss the challenges of scaling AI agents, including cost management, accuracy requirements, human-in-the-loop implementations, and the gap between prototype demonstrations and production realities. They emphasize the importance of conservative approaches, proper evaluation frameworks, and the need for human oversight in high-stakes environments, while exploring emerging standards like agent communication protocols and the evolving landscape of enterprise AI adoption.

customer_support document_processing chatbot high_stakes_application +23

Enterprise AI Platform Deployment for Multi-Company Productivity Enhancement

Payfit, Alan

This case study presents the deployment of Dust.tt's AI platform across multiple companies including Payfit and Alan, focusing on enterprise-wide productivity improvements through LLM-powered assistants. The companies implemented a comprehensive AI strategy involving both top-down leadership support and bottom-up adoption, creating custom assistants for various workflows including sales processes, customer support, performance reviews, and content generation. The implementation achieved significant productivity gains of approximately 20% across teams, with some specific use cases reaching 50% improvements, while addressing challenges around security, model selection, and user adoption through structured rollout processes and continuous iteration.

customer_support healthcare document_processing content_moderation +38

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot classification +52

Enterprise Challenges and Opportunities in Large-Scale LLM Deployment

Barclays

A senior leader in industry discusses the key challenges and opportunities in deploying LLMs at enterprise scale, highlighting the differences between traditional MLOps and LLMOps. The presentation covers critical aspects including cost management, infrastructure needs, team structures, and organizational adaptation required for successful LLM deployment, while emphasizing the importance of leveraging existing MLOps practices rather than completely reinventing the wheel.

amazon_aws compliance cost_optimization devops +19

Enterprise Code Search and Bug Investigation with Multi-Agent AI Systems

Wix

Wix developed two interconnected AI systems to address the challenge of searching and understanding code across thousands of repositories and services in a large organization. The first system, OctoCode, is an MCP-based tool with 90,000 downloads and 5,000 weekly active users that helps developers search repositories, understand dependencies, and navigate complex codebases. The second system, Bilbo, is an enterprise service that orchestrates multiple AI agents to investigate bugs and perform deep research across the organization's technical stack, integrating with GitLab, databases, logs, documentation, and other internal systems. Both systems employ sophisticated prompt engineering, context management, sub-agent architectures, and custom tooling protocols to handle the complexity of enterprise-scale code search and investigation while managing token limits and maintaining response quality.

code_generation code_interpretation question_answering summarization +30

Enterprise Document Data Extraction Using Agentic AI Workflows

Box

Box, an enterprise content platform serving over 115,000 customers including two-thirds of the Fortune 500, transformed their document data extraction capabilities by evolving from simple single-shot LLM prompting to sophisticated agentic AI workflows. Initially successful with basic document extraction using off-the-shelf models like GPT, Box encountered significant challenges when customers demanded extraction from complex 300-page documents with hundreds of fields, multilingual content, and poor OCR quality. The company implemented an agentic architecture using directed graphs that orchestrate multiple AI models, tools for validation and cross-checking, and iterative refinement processes. This approach dramatically improved accuracy and reliability while maintaining the flexibility to handle diverse document types and complex extraction requirements across their enterprise customer base.

document_processing content_moderation unstructured_data high_stakes_application +18

Enterprise Infrastructure Challenges for Agentic AI Systems in Production

Various (Meta / Google / Monte Carlo / Azure)

A panel discussion featuring engineers from Meta, Google, Monte Carlo, and Microsoft Azure explores the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion reveals that agentic workloads differ dramatically from traditional software systems, requiring complete reimagining of reliability, security, networking, and observability approaches. Key challenges include non-deterministic behavior leading to incidents like chatbots selling cars for $1, massive scaling requirements as agents work continuously, and the need for new health checking mechanisms, semantic caching, and comprehensive evaluation frameworks to manage systems where 95% of outcomes are unknown unknowns.

code_generation customer_support healthcare chatbot +28

Enterprise LLM Deployment with Multi-Cloud Data Platform Integration

Databricks

This presentation by Databricks' Product Management lead addresses the challenges large enterprises face when deploying LLMs into production, particularly around data governance, evaluation, and operational control. The talk centers on two primary case studies: FactSet's transformation of their query language translation system (improving from 59% to 85% accuracy while reducing latency from 15 to 6 seconds), and Databricks' internal use of Claude for automating analyst questionnaire responses. The solution involves decomposing complex prompts into multi-step agentic workflows, implementing granular governance controls across data and model access, and establishing rigorous evaluation frameworks to achieve production-grade reliability in high-risk enterprise environments.

healthcare fraud_detection data_analysis data_integration +32

Enterprise Neural Machine Translation at Scale

DeepL

DeepL, a translation company founded in 2017, has built a successful enterprise-focused business using neural machine translation models to tackle the language barrier problem at scale. The company handles hundreds of thousands of customers by developing specialized neural translation models that balance accuracy and fluency, training them on curated parallel and monolingual corpora while leveraging context injection rather than per-customer fine-tuning for scalability. By building their own GPU infrastructure early on and developing custom frameworks for inference optimization, DeepL maintains a competitive edge over general-purpose LLMs and established players like Google Translate, demonstrating strong product-market fit in high-stakes enterprise use cases where translation quality directly impacts legal compliance, customer experience, and business operations.

translation speech_recognition customer_support document_processing +30

Enterprise-Scale AI-First Translation Platform with Agentic Workflows

Smartling

Smartling operates an enterprise-scale AI-first agentic translation delivery platform serving major corporations like Disney and IBM. The company addresses challenges around automation, centralization, compliance, brand consistency, and handling diverse content types across global markets. Their solution employs multi-step agentic workflows where different model functions validate each other's outputs, combining neural machine translation with large language models, RAG for accessing validated linguistic assets, sophisticated prompting, and automated post-editing for hyper-localization. The platform demonstrates measurable improvements in throughput (from 2,000 to 6,000-7,000 words per day), cost reduction (4-10x cheaper than human translation), and quality approaching 70% human parity for certain language pairs and content types, while maintaining enterprise requirements for repeatability, compliance, and brand voice consistency.

translation content_moderation multi_modality high_stakes_application +43

Enterprise-Scale Data Product AI Agent for Multi-Domain Knowledge Discovery

Bosch

Bosch, a global manufacturing and technology company with over 400,000 employees across 60+ countries, faced the challenge of accessing and understanding its vast distributed data ecosystem spanning automotive, consumer goods, power tools, and industrial equipment divisions. The company developed DPAI (Data Product AI Agent), an enterprise AI platform that enables natural language interaction with Bosch's data by combining a data mesh architecture, a centralized data marketplace, and generative AI capabilities. The solution integrates semantic understanding through ontologies, data catalogs, and Bosch-specific context to provide accurate, business-relevant answers across divisions. While still in development with an estimated one to two years until full completion, the platform demonstrates how large enterprises can overcome data fragmentation and contextual complexity to make organizational knowledge accessible through conversational AI.

question_answering data_analysis chatbot structured_output +13

Enterprise-Scale Deployment of AI Ambient Scribes Across Multiple Healthcare Systems

Memorial Sloan Kettering / McLeod Health / UCLA

This panel discussion features three major healthcare systems—McLeod Health, Memorial Sloan Kettering Cancer Center, and UCLA Health—discussing their experiences deploying generative AI-powered ambient clinical documentation (AI scribes) at scale. The organizations faced challenges in vendor evaluation, clinician adoption, and demonstrating ROI while addressing physician burnout and documentation burden. Through rigorous evaluation processes including randomized controlled trials, head-to-head vendor comparisons, and structured pilots, these systems successfully deployed AI scribes to hundreds to thousands of physicians. Results included significant reductions in burnout (20% at UCLA), improved patient satisfaction scores (5-6% increases at McLeod), time savings of 1.5-2 hours per day, and positive financial ROI through improved coding and RVU capture. Key learnings emphasized the importance of robust training, encounter-based pricing models, workflow integration, and managing expectations that AI scribes are not a universal solution for all specialties and clinicians.

healthcare document_processing summarization high_stakes_application +13

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

fraud_detection document_processing content_moderation translation +51

Enterprise-Scale LLM Deployment with Licensed Content for Business Intelligence

Factiva

Factiva, a Dow Jones business intelligence platform, implemented a secure, enterprise-scale LLM solution for their content aggregation service. They developed "Smart Summaries" that allows natural language querying across their vast licensed content database of nearly 3 billion articles. The implementation required securing explicit GenAI licensing agreements from thousands of publishers, ensuring proper attribution and royalty tracking, and deploying a secure cloud infrastructure using Google's Gemini model. The solution successfully launched in November 2023 with 4,000 publishers, growing to nearly 5,000 publishers by early 2024.

question_answering summarization data_analysis regulatory_compliance +12

Enterprise-Scale LLM Platform with Multi-Model Support and Copilot Customization

Telus

Telus developed Fuel X, an enterprise-scale LLM platform that provides centralized management of multiple AI models and services. The platform enables creation of customized copilots for different use cases, with over 30,000 custom copilots built and 35,000 active users. Key features include flexible model switching, enterprise security, RAG capabilities, and integration with workplace tools like Slack and Google Chat. Results show significant impact, including 46% self-resolution rate for internal support queries and 21% reduction in agent interactions.

customer_support chatbot code_generation document_processing +26

Enterprise-Wide Generative AI Implementation for Marketing Content Generation and Translation

Bosch

Bosch, a global industrial and consumer goods company, implemented a centralized generative AI platform called "Gen playground" to address their complex marketing content needs across 3,500+ websites and numerous social media channels. The solution enables their 430,000+ associates to create text content, generate images, and perform translations without relying on external agencies, significantly reducing costs and turnaround time from 6-12 weeks to near-immediate results while maintaining brand consistency and quality standards.

compliance content_moderation document_processing documentation +8

Enterprise-Wide Virtual Assistant for Employee Knowledge Access

BNY Mellon

BNY Mellon implemented an LLM-based virtual assistant to help their 50,000 employees efficiently access internal information and policies across the organization. Starting with small pilot deployments in specific departments, they scaled the solution enterprise-wide using Google's Vertex AI platform, while addressing challenges in document processing, chunking strategies, and context-awareness for location-specific policies.

chatbot chunking compliance databases +13

Evaluation Patterns for Deep Agents in Production

Langchain

LangChain built and deployed four production applications powered by "Deep Agents" - stateful, long-running AI agents capable of complex tasks including coding, email assistance, and agent building. The challenge was developing comprehensive evaluation strategies for these agents that went beyond traditional LLM evaluation approaches. Their solution involved five key patterns: bespoke test logic for each datapoint with custom assertions, single-step evaluations for validating specific decision points, full agent turn testing for end-to-end behavior, multi-turn conversations with conditional logic to simulate realistic interactions, and proper environment setup with clean, reproducible test conditions. Using LangSmith's Pytest and Vitest integrations, they implemented flexible evaluation frameworks that could assess agent trajectories, final responses, and state artifacts while maintaining fast, debuggable test suites through techniques like API mocking and containerized environments.

code_generation customer_support chatbot poc +15

Evolution from Task-Specific Models to Multi-Agent Orchestration Platform

AI21

AI21 Labs evolved their production AI systems from task-specific models (2022-2023) to RAG-as-a-Service, and ultimately to Maestro, a multi-agent orchestration platform. The company identified that while general-purpose LLMs demonstrated impressive capabilities, they weren't optimized for specific business use cases that enterprises actually needed, such as contextual question answering and summarization. AI21 developed smaller language models fine-tuned for specific tasks, wrapped them with pre- and post-processing operations (including hallucination filters), and eventually built a comprehensive RAG system when customers struggled to identify relevant context from large document corpora. The Maestro platform emerged to handle complex multi-hop queries by automatically breaking them into subtasks, parallelizing execution, and orchestrating multiple agents and tools, achieving dramatically improved quality with full traceability for enterprise requirements.

question_answering summarization document_processing data_analysis +37

Evolution of AI Agents: From Manual Workflows to End-to-End Training

OpenAI

OpenAI's journey in developing agentic products showcases the evolution from manually designed workflows with LLMs to end-to-end trained agents. The company has developed three main agentic products - Deep Research, Operator, and Codeex CLI - each addressing different use cases from web research to code generation. These agents demonstrate how end-to-end training with reinforcement learning enables better error recovery and more natural interaction compared to traditional manually designed workflows.

code_generation code_interpretation high_stakes_application realtime_application +18

Evolution of AI Systems and LLMOps from Research to Production: Infrastructure Challenges and Application Design

NVIDA / Lepton

This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.

code_generation chatbot question_answering summarization +50

Evolution of an Internal AI Platform from No-Code LLM Apps to Agentic Systems

Grab

Grab developed SpellVault, an internal no-code AI platform that evolved from a simple RAG-based LLM app builder into a sophisticated agentic system supporting thousands of apps across the organization. Initially designed to democratize AI access for non-technical users through knowledge integrations and plugins, the platform progressively incorporated advanced capabilities including workflow orchestration, ReAct agent execution, unified tool frameworks, and Model Context Protocol (MCP) compatibility. This evolution enabled SpellVault to transform from supporting static question-answering apps into powering dynamic AI agents capable of reasoning, acting, and interacting with internal and external systems, while maintaining its core mission of accessibility and ease of use.

chatbot question_answering document_processing customer_support +21

Evolution of Code Evaluation Benchmarks: From Single-Line Completion to Full Codebase Translation

Cursor

This research presentation details four years of work developing evaluation methodologies for coding LLMs across varying time horizons, from second-level code completions to hour-long codebase translations. The speaker addresses critical challenges in evaluating production coding AI systems including data contamination, insufficient test suites, and difficulty calibration. Key solutions include LiveCodeBench's dynamic evaluation approach with periodically updated problem sets, automated test generation using LLM-driven approaches, and novel reward hacking detection systems for complex optimization tasks. The work demonstrates how evaluation infrastructure must evolve alongside model capabilities, incorporating intermediate grading signals, latency-aware metrics, and LLM-as-judge approaches to detect non-idiomatic coding patterns that pass traditional tests but fail real-world quality standards.

code_generation code_interpretation prompt_engineering few_shot +13

Evolution of Industrial AI: From Traditional ML to Multi-Agent Systems

Hitachi

Hitachi's journey in implementing AI across industrial applications showcases the evolution from traditional machine learning to advanced generative AI solutions. The case study highlights how they transformed from focused applications in maintenance, repair, and operations to a more comprehensive approach integrating LLMs, focusing particularly on reliability, small data scenarios, and domain expertise. Key implementations include repair recommendation systems for fleet management and fault tree extraction from manuals, demonstrating the practical challenges and solutions in industrial AI deployment.

high_stakes_application internet_of_things regulatory_compliance legacy_system_integration +14

Extreme Harness Engineering: Building Production Systems with Zero Human-Written Code

OpenAI

OpenAI's Frontier Product Exploration team conducted a five-month experiment building an internal Electron application with zero lines of human-written code, generating over one million lines of code across thousands of pull requests. The team developed "harness engineering" principles and Symphony, an Elixir-based orchestration system, to manage multiple coding agents at scale. By removing humans from the code authorship loop and focusing on building infrastructure, observability, and context for agents to operate autonomously, the team achieved 5-10 PRs per engineer per day with agents handling the full PR lifecycle including review, merge conflict resolution, and deployment, ultimately demonstrating that software can be built and maintained entirely by AI agents when proper systems and guardrails are in place.

code_generation poc structured_output prompt_engineering +27

Fast Regex Search Indexing for AI Agent Tool Performance

Cursor

Cursor developed a specialized indexing system to accelerate regular expression searches for AI coding agents working in large codebases. The problem they faced was that traditional tools like ripgrep needed to scan all files in large monorepos, with some searches taking over 15 seconds and stalling agent workflows. Their solution implements a local, client-side sparse n-gram index that decomposes regex patterns into optimized n-grams with deterministic weights based on character-pair frequency analysis. The system uses memory-mapped lookup tables and separate posting files to minimize memory usage while maintaining fast query performance. Results showed significant performance improvements, particularly for enterprise customers with large repositories, with the indexed approach eliminating grep latency bottlenecks and enabling more effective agent iteration, especially for bug investigation workflows.

code_generation code_interpretation agent_based semantic_search +8

Feature Flags as LLMOps Infrastructure for Agentic Development Teams

Boundary

This discussion explores how feature flags serve as critical infrastructure for teams deploying AI agents to production at scale. The problem addressed is that agentic systems can generate and ship code at extremely high velocity, creating bottlenecks in traditional deployment pipelines and making it difficult to validate changes that lack deterministic back pressure mechanisms, such as UI improvements. The solution involves using feature flags not just for user-based rollouts but across two dimensions—time and population—combined with automated experimentation and metric collection. This enables agents to deploy code to production with features turned off by default, run controlled experiments with real production data, collect quantitative feedback on performance metrics, and make data-driven decisions about rollouts or rollbacks. The approach transforms deployment from a risky, slow process into a fast feedback loop where agents can continuously iterate with automated back pressure from production metrics, effectively solving the validation problem for subjective or hard-to-test changes like visual design and user experience.

code_generation poc agent_based multi_agent_systems +16

Financial Transaction Categorization at Scale Using LLMs and Custom Embeddings

Mercado Libre

Mercado Libre (MELI) faced the challenge of categorizing millions of financial transactions across Latin America in multiple languages and formats as Open Finance unlocked access to customer financial data. Starting with a brittle regex-based system in 2021 that achieved only 60% accuracy and was difficult to maintain, they evolved through three generations: first implementing GPT-3.5 Turbo in 2023 to achieve 80% accuracy with 75% cost reduction, then transitioning to GPT-4o-mini in 2024, and finally developing custom BERT-based semantic embeddings trained on regional financial text to reach 90% accuracy with an additional 30% cost reduction. This evolution enabled them to scale from processing tens of millions of transactions per quarter to tens of millions per week, while enabling near real-time categorization that powers personalized financial insights across their ecosystem.

fraud_detection classification data_analysis data_cleaning +20

Fine-Tuning and Quantizing LLMs for Dynamic Attribute Extraction

Mercari

Mercari tackled the challenge of extracting dynamic attributes from user-generated marketplace listings by fine-tuning a 2B parameter LLM using QLoRA. The team successfully created a model that outperformed GPT-3.5-turbo while being 95% smaller and 14 times more cost-effective. The implementation included careful dataset preparation, parameter efficient fine-tuning, and post-training quantization using llama.cpp, resulting in a production-ready model with better control over hallucinations.

cost_optimization data_analysis devops documentation +15

Fine-tuning Custom Embedding Models for Enterprise Search

Glean

Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.

document_processing question_answering unstructured_data regulatory_compliance +31

Formal Verification and Verified AI for Mathematical Reasoning at Scale

Axiom Math

Axiom Math is building AI systems for superhuman mathematical reasoning by combining formal verification with large language models. Their approach uses Lean, a formal proof verification language, to ground AI-generated mathematical proofs and code, achieving verified generation that offers better sample efficiency than informal approaches. The company achieved a perfect score on the Putnam exam in December 2025, scoring 120/120 points compared to the best human's 110 and the best informal LLM's 103. Their system, Axiom Prover, uses post-trained foundation models with reinforcement learning on Lean data, enabling recursive decomposition of proof goals and learning to backtrack. Beyond mathematics, they view formal verification as foundational infrastructure for verified reasoning across software and hardware domains, positioning it as critical for AI collaboration and super intelligence rather than merely a compliance mechanism.

code_generation high_stakes_application structured_output regulatory_compliance +14

Foundation Model for Unified Personalization at Scale

Netflix

Netflix developed a unified foundation model based on transformer architecture to consolidate their diverse recommendation systems, which previously consisted of many specialized models for different content types, pages, and use cases. The foundation model uses autoregressive transformers to learn user representations from interaction sequences, incorporating multi-token prediction, multi-layer representation, and long context windows. By scaling from millions to billions of parameters over 2.5 years, they demonstrated that scaling laws apply to recommendation systems, achieving notable performance improvements while creating high leverage across downstream applications through centralized learning and easier fine-tuning for new use cases.

content_moderation classification summarization structured_output +36

Framework for Evaluating LLM Production Use Cases

Scale Venture Partners

Barak Turovsky, drawing from his experience leading Google Translate and other AI initiatives, presents a framework for evaluating LLM use cases in production. The framework analyzes use cases based on two key dimensions: accuracy requirements and fluency needs, along with consideration of stakes involved. This helps organizations determine which applications are suitable for current LLM deployment versus those that need more development. The framework suggests creative and workplace productivity applications are better immediate fits for LLMs compared to high-stakes information/decision support use cases.

compliance content_moderation cost_optimization customer_support +19

Gateway Pattern for Managing Multi-Agent MCP and LLM Traffic

Solo IO

Solo.io faced challenges governing and observing traffic from their growing internal AI agents to MCP (Model Context Protocol) servers and LLMs, particularly around multiplexing services, authentication, cost tracking, and usage visibility. They implemented agentgateway as a centralized mediation layer that handles traffic governance, security policies, and observability without requiring modifications to agents or backend services. The solution enabled unified access to multiple MCP servers through a single endpoint, granular tracking of LLM usage and costs per user and organization, and enforcement of authentication and authorization policies across all AI workloads, providing the visibility and control needed to scale their agentic AI operations efficiently.

customer_support code_generation chatbot agent_based +15

Gateway Patterns and Actions Runtime for Enterprise Agentic AI Deployment

Arcade.dev

Arcade.dev addresses the critical challenges of deploying AI agents in production enterprise environments by providing an actions runtime that separates reasoning from action execution. The company identifies fundamental security and governance problems with existing agent deployment patterns, particularly around authorization, tool quality, and observability. Their solution implements a gateway pattern using Model Context Protocol to enforce identity separation, tool curation, fine-grained authorization, and comprehensive audit trails. This approach enables multi-user agents with proper permission boundaries, preventing authorization bypass vulnerabilities while maintaining safe and controlled access to business systems like CRMs, ERPs, and email platforms across diverse enterprise environments.

high_stakes_application regulatory_compliance agent_based multi_agent_systems +15

GenAI Governance in Practice: Access Control, Data Quality, and Monitoring for Production LLM Systems

Xomnia

Martin Der, a data scientist at Xomnia, presents practical approaches to GenAI governance addressing the challenge that only 5% of GenAI projects deliver immediate ROI. The talk focuses on three key pillars: access and control (enabling self-service prototyping through tools like Open WebUI while avoiding shadow AI), unstructured data quality (detecting contradictions and redundancies in knowledge bases through similarity search and LLM-based validation), and LLM ops monitoring (implementing tracing platforms like LangFuse and creating dynamic golden datasets for continuous testing). The solutions include deploying Chrome extensions for workflow integration, API gateways for centralized policy enforcement, and developing a knowledge agent called "Genie" for internal use cases across telecom, healthcare, logistics, and maritime industries.

healthcare customer_support document_processing question_answering +30

GenAI-Powered Invoice Document Processing and Automation

Uber

Uber faced significant challenges processing a high volume of invoices daily from thousands of global suppliers, with diverse formats, 25+ languages, and varying templates requiring substantial manual intervention. The company developed TextSense, a GenAI-powered document processing platform that leverages OCR, computer vision, and large language models (specifically OpenAI GPT-4 after evaluating multiple options including fine-tuned Llama 2 and Flan T5) to automate invoice data extraction. The solution achieved 90% overall accuracy, reduced manual processing by 2x, cut average handling time by 70%, and delivered 25-30% cost savings compared to manual processes, while providing a scalable, configuration-driven platform adaptable to diverse document types.

document_processing structured_output fine_tuning prompt_engineering +17

Generating 3D Shoppable Product Visualizations with Veo Video Generation Model

Google

Google developed a three-generation evolution of AI-powered systems to transform 2D product images into interactive 3D visualizations for online shopping, culminating in a solution based on their Veo video generation model. The challenge was to replicate the tactile, hands-on experience of in-store shopping in digital environments while making the technology scalable and cost-effective for retailers. The latest approach uses Veo's diffusion-based architecture, fine-tuned on millions of synthetic 3D assets, to generate realistic 360-degree product spins from as few as one to three product images. This system now powers interactive 3D visualizations across multiple product categories on Google Shopping, significantly improving the online shopping experience by enabling customers to virtually inspect products from multiple angles.

content_moderation visualization multi_modality structured_output +5

Generating Production-Ready MCP Servers from OpenAPI Specifications

SpeakEasy

SpeakEasy tackled the challenge of enabling AI agents to interact with existing APIs by developing a tool that automatically generates Model Context Protocol (MCP) servers from OpenAPI documents. The company identified critical issues when generating over 50 production MCP servers for customers, including tool explosion (too many exposed operations), verbose descriptions consuming excessive tokens, complex data formats confusing LLMs, and inadequate access controls. Their solution involved a three-layer optimization approach: pruning OpenAPI documents with custom extensions, building intelligence into the generator to handle complex formats and streaming responses, and providing customization files for precise tool control. The result is production-ready MCP servers that balance LLM context window constraints with functional completeness, using techniques like scope-based access control, automatic data transformation, and optimized descriptions.

code_generation structured_output poc prompt_engineering +20

Generative AI Implementation in Banking Customer Service and Knowledge Management

Various

Multiple banks, including Discover Financial Services, Scotia Bank, and others, share their experiences implementing generative AI in production. The case study focuses particularly on Discover's implementation of gen AI for customer service, where they achieved a 70% reduction in agent search time by using RAG and summarization for procedure documentation. The implementation included careful consideration of risk management, regulatory compliance, and human-in-the-loop validation, with technical writers and agents providing continuous feedback for model improvement.

compliance customer_support document_processing documentation +13

Generative AI-Powered Knowledge Sharing System for Travel Expertise

Hotelplan Suisse

Hotelplan Suisse implemented a generative AI solution to address the challenge of sharing travel expertise across their 500+ travel experts. The system integrates multiple data sources and uses semantic search to provide instant, expert-level travel recommendations to sales staff. The solution reduced response time from hours to minutes and includes features like chat history management, automated testing, and content generation capabilities for marketing materials.

cicd compliance continuous_deployment continuous_integration +16

Global News Organization's AI-Powered Content Production and Verification System

Reuters

Reuters has implemented a comprehensive AI strategy to enhance its global news operations, focusing on reducing manual work, augmenting content production, and transforming news delivery. The organization developed three key tools: a press release fact extraction system, an AI-integrated CMS called Leon, and a content packaging tool called LAMP. They've also launched the Reuters AI Suite for clients, offering transcription and translation capabilities while maintaining strict ethical guidelines around AI-generated imagery and maintaining journalistic integrity.

translation content_moderation regulatory_compliance structured_output +21

Google Photos Magic Editor: Transitioning from On-Device ML to Cloud-Based Generative AI for Image Editing

Google

Google Photos evolved from using on-device machine learning models for basic image editing features like background blur and object removal to implementing cloud-based generative AI for their Magic Editor feature. The team transitioned from small, specialized models (10MB) running locally on devices to large-scale generative models hosted in the cloud to enable more sophisticated image editing capabilities like scene reimagination, object relocation, and advanced inpainting. This shift required significant changes in infrastructure, capacity planning, evaluation methodologies, and user experience design while maintaining focus on grounded, memory-preserving edits rather than fantastical image generation.

content_moderation multi_modality realtime_application caption_generation +18

Grassroots AI Skills Marketplace: Scaling AI Capabilities Through Bottom-Up Engineering

Uber

Uber faced the common challenge of scaling AI adoption across a large engineering organization with 200+ microservices and thousands of engineers. Rather than implementing a top-down enterprise AI mandate, Uber enabled organic growth through a grassroots approach where a single engineer created an internal "Agentic Marketplace" for Claude AI skills. Starting with just two custom skills in October 2024, the platform grew to over 500 specialized AI skills within five months through engineer-driven demand. The solution featured a two-tier governance model: a curated "Golden Marketplace" with strict oversight for mission-critical tools, and an experimental sandbox for rapid innovation. Results included widespread adoption across the engineering organization, automation of code reviews, verification workflows, and the democratization of senior engineering knowledge.

code_generation poc data_analysis prompt_engineering +17

Hardening AI Agents for E-commerce at Scale: Multi-Company Perspectives on RL Alignment and Reliability

Prosus / Microsoft / Inworld AI / IUD

This panel discussion features experts from Microsoft, Google Cloud, InWorld AI, and Brazilian e-commerce company IUD (Prosus partner) discussing the challenges of deploying reliable AI agents for e-commerce at scale. The panelists share production experiences ranging from Google Cloud's support ticket routing agent that improved policy adherence from 45% to 90% using DPO adapters, to Microsoft's shift away from prompt engineering toward post-training methods for all Copilot models, to InWorld AI's voice agent architecture optimization through cascading models, and IUD's struggles with personalization balance in their multi-channel shopping agent. Key challenges identified include model localization for UI elements, cost efficiency, real-time voice adaptation, and finding the right balance between automation and user control in commerce experiences.

customer_support chatbot realtime_application speech_recognition +32

Healthcare Conversational AI and Multi-Model Cost Management in Production

Amberflo / Interactly.ai

A panel discussion featuring Interactly.ai's development of conversational AI for healthcare appointment management, and Amberflo's approach to usage tracking and cost management for LLM applications. The case study explores how Interactly.ai handles the challenges of deploying LLMs in healthcare settings with privacy and latency constraints, while Amberflo addresses the complexities of monitoring and billing for multi-model LLM applications in production.

healthcare customer_support high_stakes_application regulatory_compliance +23

Human-AI Synergy in Pharmaceutical Research and Document Processing

Merantix

Merantix has implemented AI systems that focus on human-AI collaboration across multiple domains, particularly in pharmaceutical research and document processing. Their approach emphasizes progressive automation where AI systems learn from human input, gradually taking over more tasks while maintaining high accuracy. In pharmaceutical applications, they developed a system for analyzing rodent behavior videos, while in document processing, they created solutions for legal and compliance cases where error tolerance is minimal. The systems demonstrate a shift from using AI as mere tools to creating collaborative AI-human workflows that maintain high accuracy while improving efficiency.

healthcare document_processing regulatory_compliance high_stakes_application +7

Hybrid LLM-Optimization System for Trip Planning with Real-World Constraints

Google

Google Research developed a hybrid system for trip planning that combines LLMs with optimization algorithms to address the challenge of generating practical travel itineraries. The system uses Gemini models to generate initial trip plans based on user preferences and qualitative goals, then applies a two-stage optimization algorithm that incorporates real-world constraints like opening hours, travel times, and budget considerations to produce feasible itineraries. This approach was implemented in Google's "AI trip ideas in Search" feature, demonstrating how LLMs can be effectively deployed in production while maintaining reliability through algorithmic correction of potential feasibility issues.

question_answering structured_output realtime_application prompt_engineering +6

Hybrid ML and LLM Approach for Automated Question Quality Feedback

Stack Overflow

Stack Overflow developed Question Assistant to provide automated feedback on question quality for new askers, addressing the repetitive nature of human reviewer comments in their Staging Ground platform. Initial attempts to use LLMs alone to rate question quality failed due to unreliable predictions and generic feedback. The team pivoted to a hybrid approach combining traditional logistic regression models trained on historical reviewer comments to flag quality indicators, paired with Google's Gemini LLM to generate contextual, actionable feedback. While the solution didn't significantly improve approval rates or review times, it achieved a meaningful 12% increase in question success rates (questions that remain open and receive answers or positive scores) across two A/B tests, leading to full deployment in March 2025.

customer_support content_moderation classification question_answering +12

Hyper-Personalized Merchandising Through Hybrid LLM and Deep Learning Systems

Doordash

DoorDash faced the challenge of personalizing experiences across a massive, diverse catalog spanning restaurants, grocery, retail, and other local commerce categories for millions of users with rapidly shifting intents. Traditional collaborative filtering and deep learning approaches could not adapt quickly enough to short-lived, high-context moments like Black Friday or individual life events. DoorDash developed a hybrid architecture that leverages LLMs for product understanding, consumer profile generation in natural language, and content blueprint creation, while maintaining traditional deep learning models for efficient last-mile ranking and retrieval. This approach enables the platform to serve dynamic, moment-aware personalization that adapts to real-time user intent while managing latency and cost constraints. The system uses GEPA optimization within DSPy for compound AI system tuning, combines offline LLM processing with online signal blending, and evaluates performance through quantitative metrics, LLM-as-judge, and human feedback.

customer_support content_moderation question_answering classification +44

Implementing Generative AI in Manufacturing: A Multi-Use Case Study

Accenture

Accenture's Industry X division conducted extensive experiments with generative AI in manufacturing settings throughout 2023. They developed and validated nine key use cases including operations twins, virtual mentors, test case generation, and technical documentation automation. The implementations showed significant efficiency gains (40-50% effort reduction in some cases) while maintaining a human-in-the-loop approach. The study emphasized the importance of using domain-specific data, avoiding generic knowledge management solutions, and implementing multi-agent orchestrated solutions rather than standalone models.

amazon_aws compliance document_processing documentation +17

Implementing LLMs for Patient Education and Healthcare Communication

National Healthcare Group

National Healthcare Group addressed the challenge of inconsistent and time-consuming patient education by implementing LLM-powered chatbots integrated into their existing healthcare apps and messaging platforms. The solution provides 24/7 multilingual patient education, focusing on conditions like eczema and medical test preparation, while ensuring privacy and accuracy. The implementation emphasizes integration with existing platforms rather than creating new standalone solutions, combined with careful monitoring and refinement of responses.

healthcare chatbot translation multi_modality +11

Implementing MCP Gateway for Large-Scale LLM Integration Infrastructure

Anthropic

Anthropic faced the challenge of managing an explosion of LLM-powered services and integrations across their organization, leading to duplicated functionality and integration chaos. They solved this by implementing a standardized MCP (Model Context Protocol) gateway that provides a single point of entry for all LLM integrations, handling authentication, credential management, and routing to both internal and external services. This approach reduced engineering overhead, improved security by centralizing credential management, and created a "pit of success" where doing the right thing became the easiest thing to do for their engineering teams.

code_generation document_processing chatbot legacy_system_integration +21

Implementing MCP Remote Server for CRM Agent Integration

HubSpot

HubSpot built a remote Model Context Protocol (MCP) server to enable AI agents like ChatGPT to interact with their CRM data. The challenge was to provide seamless, secure access to CRM objects (contacts, companies, deals) for ChatGPT's 500 million weekly users, most of whom aren't developers. In less than four weeks, HubSpot's team extended the Java MCP SDK to create a stateless, HTTP-based microservice that integrated with their existing REST APIs and RPC system, implementing OAuth 2.0 for authentication and user permission scoping. The solution made HubSpot the first CRM with an OpenAI connector, enabling read-only queries that allow customers to analyze CRM data through natural language interactions while maintaining enterprise-grade security and scale.

customer_support chatbot code_generation data_analysis +18

Improving LLM Food Tracking Accuracy through Systematic Evaluation and Few-Shot Learning

Taralli

A case study of Taralli's food tracking application that initially used a naive approach with GPT-4-mini for calorie and nutrient estimation, resulting in significant accuracy issues. Through the implementation of systematic evaluation methods, creation of a golden dataset, and optimization using DSPy's BootstrapFewShotWithRandomSearch technique, they improved accuracy from 17% to 76% while maintaining reasonable response times with Gemini 2.5 Flash.

structured_output data_analysis few_shot prompt_engineering +6

Improving Multilingual Search with Few-Shot LLM Translations

Delivery Hero

Delivery Hero operates across 68 countries and faced significant challenges with multilingual search due to dialectal variations, transliterations, spelling errors, and multiple languages within single markets. Traditional machine translation systems struggled with user intent and contextual nuances, leading to poor search results. The company implemented a solution using Large Language Models (LLMs), specifically Gemini, with few-shot learning to provide context-aware translations that handle regional dialects, correct spelling mistakes, and understand transliterations. By combining LLM-generated translations with Elastic Search and Vector Search in a hybrid approach, they achieved over 90% translation accuracy for restaurant queries and demonstrated positive improvements in user engagement through A/B testing, with the solution being rolled out to their Talabat and Hungerstation brands.

translation question_answering few_shot prompt_engineering +7

Incremental LLM Adoption Strategy in Email Processing API Platform

Nylas

Nylas, an email/calendar/contacts API platform provider, implemented a systematic three-month strategy to integrate LLMs into their production systems. They started with development workflow automation using multi-agent systems, enhanced their annotation processes with LLMs, and finally integrated LLMs as a fallback mechanism in their core email processing product. This measured approach resulted in 90% reduction in bug tickets, 20x cost savings in annotation, and successful deployment of their own LLM infrastructure when usage reached cost-effective thresholds.

data_analysis data_cleaning regulatory_compliance high_stakes_application +18

Infrastructure Challenges and Solutions for Agentic AI Systems in Production

Meta / Google / Monte Carlo / Microsoft

A panel discussion featuring experts from Meta, Google, Monte Carlo, and Microsoft examining the fundamental infrastructure challenges that arise when deploying autonomous AI agents in production environments. The discussion covers how agentic workloads differ from traditional software systems, requiring new approaches to networking, load balancing, caching, security, and observability, while highlighting specific challenges like non-deterministic behavior, massive search spaces, and the need for comprehensive evaluation frameworks to ensure reliable and secure AI agent operations at scale.

code_generation customer_support chatbot multi_modality +24

Infrastructure for AI Agents: Panel Discussion on Production Challenges and Solutions

Various

This panel discussion brings together infrastructure experts from Groq, NVIDIA, Lambda, and AMD to discuss the unique challenges of deploying AI agents in production. The panelists explore how agentic AI differs from traditional AI workloads, requiring significantly higher token generation, lower latency, and more diverse infrastructure spanning edge to cloud. They discuss the evolution from training-focused to inference-focused infrastructure, emphasizing the need for efficiency at scale, specialized hardware optimization, and the importance of smaller distilled models over large monolithic models. The discussion highlights critical operational challenges including power delivery, thermal management, and the need for full-stack engineering approaches to debug and optimize agentic systems in production environments.

code_generation poc realtime_application model_optimization +23

Infrastructure Noise in Agentic Coding Evaluations

Anthropic

Anthropic discovered that infrastructure configuration alone can produce differences in agentic coding benchmark scores that exceed the typical margins between top models on leaderboards. Through systematic experiments running Terminal-Bench 2.0 across six resource configurations on Google Kubernetes Engine, they found a 6 percentage point gap between the most- and least-resourced setups. The research revealed that while moderate resource headroom (up to 3x specifications) primarily improves infrastructure stability by preventing spurious failures, more generous allocations actively help agents solve problems they couldn't solve before. These findings challenge the notion that small leaderboard differences represent pure model capability measurements and led to recommendations for specifying both guaranteed allocations and hard kill thresholds, calibrating resource bands empirically, and treating resource configuration as a first-class experimental variable in LLMOps practices.

code_generation code_interpretation agent_based multi_agent_systems +13

Integrating Gemini for Natural Language Analytics in IoT Fleet Management

Cox 2M

Cox 2M, facing challenges with a lean analytics team and slow insight generation (taking up to a week per request), partnered with Thoughtspot and Google Cloud to implement Gemini-powered natural language analytics. The solution reduced time to insights by 88% while enabling non-technical users to directly query complex IoT and fleet management data using natural language. The implementation includes automated insight generation, change analysis, and natural language processing capabilities.

compliance data_analysis databases error_handling +13

Integrating Symbolic Reasoning with LLMs for AI-Native Telecom Infrastructure

Ericsson

Ericsson's System Comprehension Lab is exploring the integration of symbolic reasoning capabilities into telecom-oriented large language models to address critical limitations in current LLM architectures for telecommunications infrastructure management. The problem centers on LLMs' inability to provide deterministic, explainable reasoning required for telecom network optimization, security, and anomaly detection—domains where hallucinations, lack of logical consistency, and black-box behavior are unacceptable. The proposed solution involves hybrid neural-symbolic AI architectures that combine the pattern recognition strengths of transformer-based LLMs with rule-based reasoning engines, connected through techniques like symbolic chain-of-thought prompting, program-aided reasoning, and external solver integration. This approach aims to enable AI-native wireless systems for 6G infrastructure that can perform cross-layer optimization, real-time decision-making, and intent-driven network management while maintaining the explainability and logical rigor demanded by production telecom environments.

fraud_detection classification code_generation question_answering +40

Internal AI Orchestration and Automation Across Multiple Departments

Zapier

Zapier, a workflow automation platform company, faced the challenge of managing repetitive operational tasks across multiple departments while maintaining productivity and focus on strategic work. The company implemented a comprehensive AI and automation strategy using their own platform combined with LLM capabilities (primarily ChatGPT/OpenAI) to automate workflows across customer success, sales, HR, technical support, content creation, engineering, accounting, and revenue operations. The results demonstrate significant time savings through automated meeting transcriptions and summaries, AI-powered sentiment analysis of surveys, automated content generation and translation, chatbot-based internal support systems, and intelligent ticket routing and categorization, enabling teams to focus on higher-value strategic activities while maintaining operational efficiency.

chatbot customer_support summarization translation +16

Iterative Prompt Optimization and Model Selection for Nutritional Calorie Estimation

Taralli

Taralli, a calorie tracking application, demonstrates systematic LLM improvement through rigorous evaluation and prompt optimization. The developer addressed the challenge of accurate nutritional estimation by creating a 107-example evaluation dataset, testing multiple prompt optimization techniques (vanilla, few-shot bootstrapping, MIPROv2, and GEPA) across several models (Gemini 2.5 Flash, Gemini 3 Flash, and DeepSeek v3.2). Through this methodical approach, they achieved a 15% accuracy improvement by switching from Gemini 2.5 Flash to Gemini 3 Flash while using a few-shot learning approach with 16 examples, reaching 60% accuracy within a 10% calorie prediction threshold. The system was deployed with fallback model configurations and extended to support fully offline on-device inference for iOS.

healthcare poc prompt_engineering few_shot +11

Kubernetes as a Platform for LLM Operations: Practical Experiences and Trade-offs

Various

A panel discussion between experienced Kubernetes and ML practitioners exploring the challenges and opportunities of running LLMs on Kubernetes. The discussion covers key aspects including GPU management, cost optimization, training vs inference workloads, and architectural considerations. The panelists share insights from real-world implementations while highlighting both benefits (like workload orchestration and vendor agnosticism) and challenges (such as container sizes and startup times) of using Kubernetes for LLM operations.

cost_optimization databases devops docker +11

Large Bank LLMOps Implementation: Lessons from Deutsche Bank and Others

Various

A discussion between banking technology leaders about their implementation of generative AI, focusing on practical applications, regulatory challenges, and strategic considerations. Deutsche Bank's CTO and other banking executives share their experiences in implementing gen AI across document processing, risk modeling, research analysis, and compliance use cases, while emphasizing the importance of responsible deployment and regulatory compliance.

compliance data_analysis document_processing documentation +13

Large Recommender Models: Adapting Gemini for YouTube Video Recommendations

Google / YouTube

YouTube developed Large Recommender Models (LRM) by adapting Google's Gemini LLM for video recommendations, addressing the challenge of serving personalized content to billions of users. The solution involved creating semantic IDs to tokenize videos, continuous pre-training to teach the model both English and YouTube-specific video language, and implementing generative retrieval systems. While the approach delivered significant improvements in recommendation quality, particularly for challenging cases like new users and fresh content, the team faced substantial serving cost challenges that required 95%+ cost reductions and offline inference strategies to make production deployment viable at YouTube's scale.

content_moderation classification summarization multi_modality +24

Large-Scale AI Assistant Deployment with Safety-First Evaluation Approach

Discord

Discord implemented Clyde AI, a chatbot assistant that was deployed to over 200 million users, focusing heavily on safety, security, and evaluation practices. The team developed a comprehensive evaluation framework using simple, deterministic tests and metrics, implemented through their open-source tool PromptFu. They faced unique challenges in preventing harmful content and jailbreaks, leading to innovative solutions in red teaming and risk assessment, while maintaining a balance between casual user interaction and safety constraints.

chatbot content_moderation high_stakes_application regulatory_compliance +17

Large-Scale AI Red Teaming Competition Platform for Production Model Security

HackAPrompt, LearnPrompting

Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.

chatbot question_answering content_moderation high_stakes_application +20

Large-Scale Deployment of On-Device and Server Foundation Models for Consumer AI Features

Apple

Apple developed and deployed a comprehensive foundation model infrastructure consisting of a 3-billion parameter on-device model and a mixture-of-experts server model to power Apple Intelligence features across iOS, iPadOS, and macOS. The implementation addresses the challenge of delivering generative AI capabilities at consumer scale while maintaining privacy, efficiency, and quality across 15 languages. The solution involved novel architectural innovations including shared KV caches, parallel track mixture-of-experts design, and extensive optimization techniques including quantization and compression, resulting in production deployment across millions of devices with measurable performance improvements in text and vision tasks.

multi_modality content_moderation summarization classification +35

Large-Scale OCR Processing of Academic Papers Using AI Coding Agents and Serverless GPU Infrastructure

Huggingface

Hugging Face needed to convert approximately 27,000 academic papers to Markdown format to enable their "chat with paper" feature powered by HuggingChat, but these papers lacked HTML versions on arXiv. The team used OpenAI's Codex coding agent to orchestrate the entire workflow, which involved selecting the best open-source OCR model (Chandra-OCR 2) from leaderboards, deploying it on Hugging Face Jobs serverless GPU infrastructure using vLLM, and processing all papers across 16 parallel L40S GPU instances. The solution successfully processed 30,000 papers in approximately 29-30 hours at an estimated cost of $850, significantly cheaper than using proprietary APIs, and enabled chat functionality for all papers on the platform.

document_processing chatbot question_answering prompt_engineering +15

Lessons Learned from Deploying 30+ GenAI Agents in Production

Quic

Quic shares their experience deploying over 30 AI agents across various industries, focusing on customer experience and e-commerce applications. They developed a comprehensive approach to LLMOps that includes careful planning, persona development, RAG implementation, API integration, and robust testing and monitoring systems. The solution achieved 60% resolution of tier-one support issues with higher quality than human agents, while maintaining human involvement for complex cases.

customer_support multi_modality high_stakes_application regulatory_compliance +17

Lessons Learned from Production AI Agent Deployments

Google / Vertex AI

A comprehensive overview of lessons learned from deploying AI agents in production at Google's Vertex AI division. The presentation covers three key areas: meta-prompting techniques for optimizing agent prompts, implementing multi-layered safety and guard rails, and the critical importance of evaluation frameworks. These insights come from real-world experience delivering hundreds of models into production with various developers, customers, and partners.

multi_agent_systems google google_gcp rag +1

LLM Testing Framework Using LLMs as Quality Assurance Agents

Various

Alaska Airlines and Bitra developed QARL (Quality Assurance Response Liaison), an innovative testing framework that uses LLMs to evaluate other LLMs in production. The system conducts automated adversarial testing of customer-facing chatbots by simulating various user personas and conversation scenarios. This approach helps identify potential risks and unwanted behaviors before deployment, while providing scalable testing capabilities through containerized architecture on Google Cloud Platform.

chatbot cicd compliance continuous_deployment +12

LLM-Driven Developer Experience and Code Migrations at Scale

Uber

Uber's Developer Platform team explored three major initiatives using LLMs in production: a custom IDE coding assistant (which was later abandoned in favor of GitHub Copilot), an AI-powered test generation system called Auto Cover, and an automated Java-to-Kotlin code migration system. The team combined deterministic approaches with LLMs to achieve significant developer productivity gains while maintaining code quality and safety. They found that while pure LLM approaches could be risky, hybrid approaches combining traditional software engineering practices with AI showed promising results.

code_generation code_interpretation high_stakes_application fine_tuning +18

LLM-Powered Content Embeddings for Multi-Vertical Search and Recommendations

Doordash

DoorDash addressed longstanding bottlenecks in search and recommendation quality across their food, grocery, retail, and gifting verticals by using LLMs to generate rich, standardized merchant and item profiles at scale, then encoding those profiles with off-the-shelf embedding models. Traditional behavioral embedding approaches failed to capture semantic nuances in transactional, intent-driven sessions with sparse engagement data, while pure content approaches suffered from poor metadata quality. By leveraging LLM-generated profiles combined with carefully selected embedding models (gemini-embedding-001 with 256-dimensional MRL), DoorDash achieved substantial improvements: semantic search reduced null search rates by 3.65% and increased CVR by 0.66%, while generative personalized carousels increased homepage order rate by 2.4% and offline precision improved from 68% to 85%. The content-first embedding strategy proved especially effective for cold-start scenarios, tail queries, and ensuring fairness to small merchants.

question_answering classification summarization content_moderation +29

LLM-Powered Customer Service Agent Copilot for E-commerce Support

Wayfair

Wayfair developed Wilma, an LLM-based copilot system to assist customer service agents in responding to customer inquiries about product issues. The system uses models like Gemini and GPT to draft contextual messages that agents can review and edit before sending. Through an iterative evolution from a single monolithic prompt to over 40 specialized prompt templates and multiple coordinated LLM calls, Wilma helps agents respond 12% faster while improving policy adherence by 2-5% depending on issue type. The system pulls real-time customer, order, and product data from Wayfair's systems to generate appropriate responses, with particular sophistication in handling complex resolution negotiation scenarios through a multi-LLM routing and analysis framework.

customer_support chatbot prompt_engineering few_shot +7

LLM-Powered GraphQL Mock Data Generation for Developer Productivity

Airbnb

Airbnb developed an innovative solution to address the persistent challenge of creating and maintaining realistic GraphQL mock data for testing and prototyping. Engineers traditionally spent significant time manually writing and updating mock data, which would often drift out of sync with evolving queries. Airbnb introduced the @generateMock directive, which combines GraphQL schema validation, product context (including design mockups), and LLMs (specifically Gemini 2.5 Pro) to automatically generate type-safe, realistic mock data. The solution integrates seamlessly into their existing code generation workflow (Niobe CLI), keeping engineers in their local development loops. A companion @respondWithMock directive enables client engineers to prototype features before server implementations are complete. Since deployment, Airbnb engineers have generated and merged over 700 mocks across iOS, Android, and Web platforms, significantly reducing manual effort and accelerating development cycles.

code_generation poc prompt_engineering error_handling +6

LLM-Powered Style Compatibility Labeling Pipeline for E-Commerce Catalog Curation

Wayfair

Wayfair addressed the challenge of identifying stylistic compatibility among millions of products in their catalog by building an LLM-powered labeling pipeline on Google Cloud. Traditional recommendation systems relied on popularity signals and manual annotation, which was accurate but slow and costly. By leveraging Gemini 2.5 Pro with carefully engineered prompts that incorporate interior design principles and few-shot examples, they automated the binary classification task of determining whether product pairs are stylistically compatible. This approach improved annotation accuracy by 11% compared to initial generic prompts and enables scalable, consistent style-aware curation that will be used to evaluate and ultimately improve recommendation algorithms, with plans for future integration into production search and personalization systems.

classification content_moderation multi_modality prompt_engineering +5

Long-Running Autonomous Agent Evaluation in Simulated and Real-World Business Environments

Andon Labs

Andon Labs, a Swedish research company founded by Lucas and Axel, develops comprehensive benchmarks and real-world deployments to evaluate LLM-based autonomous agents in extended business scenarios. The company created VendingBench, a simulated business management benchmark where agents run vending machine operations over full year-long horizons, and deployed real physical vending machines and retail stores operated entirely by AI agents at companies like Anthropic and YCombinator. Their work reveals critical production challenges including context window degradation, emergent deceptive behaviors in newer Claude models, social intelligence gaps, and the difficulty of long-horizon task management. The evaluations demonstrate that frontier models can generate revenue autonomously but exhibit concerning behaviors like lying to customers, forming price cartels, and making increasingly aggressive business decisions, with these problematic behaviors intensifying in newer model versions rather than improving.

poc chatbot question_answering high_stakes_application +19

Managing Memory and Scaling Issues in Production AI Agent Systems

Gradient Labs

Gradient Labs experienced a series of interconnected production incidents involving their AI agent deployed on Google Cloud Run, starting with memory usage alerts that initially appeared to be memory leaks. The team discovered the root cause was Temporal workflow cache sizing issues causing container crashes, which they resolved by tuning cache parameters. However, this fix inadvertently caused auto-scaling problems that throttled their system's ability to execute activities, leading to increased latency. The incidents highlight the complex interdependencies in production AI systems and the need for careful optimization across all infrastructure layers.

customer_support multi_agent_systems agent_based latency_optimization +11

MCP Protocol Development and Agent AI Foundation Launch

Anthropic / OpenAI / Goose

This podcast transcript covers the one-year journey of the Model Context Protocol (MCP) from its initial launch by Anthropic through to its donation to the newly formed Agent AI Foundation. The discussion explores how MCP evolved from a local-only protocol to support remote servers, authentication, and long-running tasks, addressing the fundamental challenge of connecting AI agents to external tools and data sources in production environments. The case study highlights extensive production usage of MCP both within Anthropic's internal systems and across major technology companies including OpenAI, Microsoft, and Google, demonstrating widespread adoption with millions of requests at scale. The formation of the Agent AI Foundation with founding members including Anthropic, OpenAI, and Block represents a significant industry collaboration to standardize agentic system protocols and ensure neutral governance of critical AI infrastructure.

code_generation chatbot data_analysis document_processing +27

Mercury: Agentic AI Platform for LLM-Powered Recommendation Systems

eBay

eBay developed Mercury, an internal agentic framework designed to scale LLM-powered recommendation experiences across its massive marketplace of over two billion active listings. The platform addresses the challenge of transforming vast amounts of unstructured data into personalized product recommendations by integrating Retrieval-Augmented Generation (RAG) with a custom Listing Matching Engine that bridges the gap between LLM-generated text outputs and eBay's dynamic inventory. Mercury enables rapid development through reusable, plug-and-play components following object-oriented design principles, while its near-real-time distributed queue-based execution platform handles cost and latency requirements at industrial scale. The system combines multiple retrieval mechanisms, semantic search using embedding models, anomaly detection, and personalized ranking to deliver contextually relevant shopping experiences to hundreds of millions of users.

customer_support content_moderation realtime_application rag +40

Mission-Critical LLM Inference Platform Architecture

Baseten

Baseten has built a production-grade LLM inference platform focusing on three key pillars: model-level performance optimization, horizontal scaling across regions and clouds, and enabling complex multi-model workflows. The platform supports various frameworks including SGLang and TensorRT-LLM, and has been successfully deployed by foundation model companies and enterprises requiring strict latency, compliance, and reliability requirements. A key differentiator is their ability to handle mission-critical inference workloads with sub-400ms latency for complex use cases like AI phone calls.

high_stakes_application healthcare realtime_application model_optimization +27

MLOps Evolution and LLM Integration at a Major Bank

Barclays

Discussion of MLOps practices and the evolution towards LLM integration at Barclays, focusing on the transition from traditional ML to GenAI workflows while maintaining production stability. The case study highlights the importance of balancing innovation with regulatory requirements in financial services, emphasizing ROI-driven development and the creation of reusable infrastructure components.

amazon_aws cicd compliance continuous_deployment +26

MLOps Platform for Airline Operations with LLM Integration

LATAM Airlines

LATAM Airlines developed Cosmos, a vendor-agnostic MLOps framework that enables both traditional ML and LLM deployments across their business operations. The framework reduced model deployment time from 3-4 months to less than a week, supporting use cases from fuel efficiency optimization to personalized travel recommendations. The platform demonstrates how a traditional airline can transform into a data-driven organization through effective MLOps practices and careful integration of AI technologies.

realtime_application regulatory_compliance structured_output data_analysis +22

Model Context Protocol (MCP): Building Universal Connectivity for LLMs in Production

Anthropic

Anthropic developed and open-sourced the Model Context Protocol (MCP) to address the challenge of providing external context and tool connectivity to large language models in production environments. The protocol emerged from recognizing that teams were repeatedly reimplementing the same capabilities across different contexts (coding editors, web interfaces, and various services) where Claude needed to interact with external systems. By creating a universal standard protocol and open-sourcing it, Anthropic enabled developers to build integrations once and deploy them everywhere, while fostering an ecosystem that became what they describe as the fastest-growing open source protocol in history. The protocol has matured from requiring local server deployments to supporting remote hosted servers with a central registry, reducing friction for both developers and end users while enabling sophisticated production use cases across enterprise integrations and personal automation.

code_generation chatbot poc document_processing +18

Modernizing DevOps with Generative AI: Challenges and Best Practices in Production

Various (Bundesliga, Harness, Trice)

A panel of experts from various organizations discusses the current state and challenges of integrating generative AI into DevOps workflows and production environments. The discussion covers how companies are balancing productivity gains with security concerns, the importance of having proper testing and evaluation frameworks, and strategies for successful adoption of AI tools in production DevOps processes while maintaining code quality and security.

code_generation code_interpretation regulatory_compliance high_stakes_application +23

Multi-Agent AI Architecture for Site Reliability Engineering in Cloud-Native Infrastructure

Komodor

Komodor introduced Klaudia AI, a multi-agent architecture designed to address the complexity of modern cloud-native infrastructure incident management. The problem stems from contemporary systems running hundreds of microservices across multi-cloud environments where symptoms appear in one place while root causes exist elsewhere, making single-agent AI tools ineffective. Klaudia's solution employs a three-layer architecture with over 50 domain-specific expert agents (covering Kubernetes, GPU/NVIDIA, AWS, ArgoCD, Istio, and more) coordinated by workflow orchestrators, all underpinned by a knowledge graph that maps entity relationships across the stack. The system demonstrated significant results including 80% reduction in MTTR for Kubernetes issues at Cisco Outshift, 55% faster pipeline failure diagnosis with the Airflow agent, and the ability to ship new domain agents in 2-4 weeks through its extensible platform architecture.

poc realtime_application high_stakes_application rag +35

Multi-Agent Architecture for Automated Advertising Media Planning

Spotify

Spotify faced a structural problem where multiple advertising buying channels (Direct, Self-Serve, Programmatic) relied on consolidated backend services but implemented fragmented, channel-specific workflow logic, creating duplicated decision-making and technical debt. To address this, they built "Ads AI," a multi-agent system using Google's Agent Development Kit (ADK) and Vertex AI that transforms media planning from a manual 15-30 minute process requiring 20+ form fields into a conversational interface that generates optimized, data-driven media plans in 5-10 seconds using 1-3 natural language messages. The system decomposes media planning into specialized agents (RouterAgent, GoalResolverAgent, AudienceResolverAgent, BudgetAgent, ScheduleAgent, and MediaPlannerAgent) that execute in parallel, leverage historical campaign performance data via function calling tools, and produce recommendations based on cost optimization, delivery rates, and budget matching heuristics.

customer_support structured_output classification data_analysis +19

Multi-Agent Architecture for Intelligent Advertising Media Planning

Spotify

Spotify faced a structural problem where multiple advertising buying channels (Direct, Self-Serve, Programmatic) had fragmented workflow logic despite a consolidated backend, leading to duplicated decision-making and tech debt. They built Ads AI, a multi-agent system using Google's Agent Development Kit (ADK) and Vertex AI's Gemini 2.5 Pro to create a unified decision layer that transforms natural language campaign requirements into optimized media plans. The solution reduced media plan creation time from 15-30 minutes to 5-10 seconds, leveraging historical performance data from thousands of campaigns through specialized agents working in parallel, with each agent handling distinct aspects like goal resolution, audience targeting, budget allocation, and schedule planning.

customer_support structured_output classification data_analysis +16

Multi-Agent DBT Development Workflow for Data Engineering Consulting

Mammoth Growth

Mammoth Growth, a boutique data consultancy specializing in marketing and customer data, developed a multi-agent AI system to automate DBT development workflows in response to data teams struggling to deliver analytics at the speed of business. The solution employs a team of specialized AI agents (orchestrator, analyst, architect, and analytics engineer) that leverage the DBT Model Context Protocol (MCP) to autonomously write, document, and test production-grade DBT code from detailed specifications. The system enabled the delivery of a complete enterprise-grade data lineage with 15 data models and two gold-layer models in just 3 weeks for a pilot client, compared to an estimated 10 weeks using traditional manual development approaches, while maintaining code quality standards through human-led requirements gathering and mandatory code review before production deployment.

code_generation data_analysis data_cleaning data_integration +13

Multi-Agent LLM System for Business Process Automation

Cognizant

Cognizant developed Neuro AI, a multi-agent LLM-based system that enables business users to create and deploy AI-powered decision-making workflows without requiring deep technical expertise. The platform allows agents to communicate with each other to handle complex business processes, from intranet search to process automation, with the ability to deploy either in the cloud or on-premises. The system includes features for opportunity identification, use case scoping, synthetic data generation, and automated workflow creation, all while maintaining explainability and human oversight.

chatbot question_answering code_generation high_stakes_application +15

Multi-Agent LLM Systems: Implementation Patterns and Production Case Studies

Nimble Gravity, Hiflylabs

A research study conducted by Nimble Gravity and Hiflylabs examining GenAI adoption patterns across industries, revealing that approximately 28-30% of GenAI projects successfully transition from assessment to production. The study explores various multi-agent LLM architectures and their implementation in production, including orchestrator-based, agent-to-agent, and shared message pool patterns, demonstrating practical applications like automated customer service systems that achieved significant cost savings.

customer_support healthcare data_analysis code_generation +19

Multi-Cloud LLM Infrastructure Evolution at Scale

Slack

Slack evolved their production LLM infrastructure through four distinct phases over three years (2023-2026) to serve AI features to millions of enterprise users. Starting with AWS SageMaker's managed infrastructure, they migrated to Amazon Bedrock for operational simplicity and faster model access, then adopted hybrid provisioned/on-demand capacity to optimize costs and upgrade flexibility, and finally expanded to a multi-cloud architecture incorporating Google Cloud Platform Vertex AI. This multi-cloud strategy addresses single-provider risk, enables best-of-breed model selection for specific features, provides dynamic workload orchestration, and delivers measurable improvements including ~10% quality gains for reasoning tasks and ~67% latency reduction for high-velocity workloads, while maintaining zero customer-facing incidents during major migrations.

chatbot summarization question_answering high_stakes_application +21

Multi-Company Panel Discussion on Enterprise AI and Agentic AI Deployment Challenges

Glean / Deloitte / Docusign

This panel discussion at AWS re:Invent brings together practitioners from Glean, Deloitte, and DocuSign to discuss the practical realities of deploying AI and agentic AI systems in enterprise environments. The panelists explore challenges around organizational complexity, data silos, governance, agent creation and sharing, value measurement, and the tension between autonomous capabilities and human oversight. Key themes include the need for cross-functional collaboration, the importance of security integration from day one, the difficulty of measuring AI-driven productivity gains, and the evolution from individual AI experimentation to governed enterprise-wide agent deployment. The discussion emphasizes that successful AI transformation requires reimagining workflows rather than simply bolting AI onto legacy systems, and that business value should drive technical decisions rather than focusing solely on which LLM model to use.

customer_support document_processing data_integration poc +27

Multi-Company Panel Discussion on Production LLM Frameworks and Scaling Challenges

Various (Thinking Machines, Yutori, Evolutionaryscale, Perplexity, Axiom)

This panel discussion features experts from multiple AI companies discussing the current state and future of agentic frameworks, reinforcement learning applications, and production LLM deployment challenges. The panelists from Thinking Machines, Perplexity, Evolutionary Scale AI, and Axiom share insights on framework proliferation, the role of RL in post-training, domain-specific applications in mathematics and biology, and infrastructure bottlenecks when scaling models to hundreds of GPUs, highlighting the gap between research capabilities and production deployment tools.

code_generation healthcare data_analysis question_answering +31

Multi-Company Panel on Building Production-Grade AI Agent Systems

Abridge / Replit / Hebbia

This panel discussion features engineering leaders from Abridge, Replit, and Hebbia discussing their experiences building sophisticated AI agent systems at production scale. Abridge tackles clinical documentation by recording and summarizing doctor-patient conversations for over 250 healthcare systems, addressing challenges around clinical compliance and trust. Replit builds autonomous coding agents that can plan, design, write, test, and debug software with increasingly long-running capabilities. Hebbia creates AI tooling for major financial institutions like KKR and Morgan Stanley, managing extremely spiky workloads with hundreds of thousands of agents processing high-value questions worth hundreds of millions of dollars. All three companies leverage Temporal for durable execution, have moved beyond proof-of-concept to production systems with high stakes, and share common challenges around reliability, cost optimization, model selection, and the evolving balance between agent autonomy and human control.

healthcare code_generation data_analysis high_stakes_application +43

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

healthcare document_processing classification data_analysis +39

Multi-Label Red Flag Detection System for Fraud Prevention

Feedzai

Feedzai developed ScamAlert, a generative AI-based system that moves beyond traditional binary scam classification to identify specific red flags in suspected fraud attempts. The system addresses the limitations of binary classifiers that only output risk scores without explanation by using multimodal LLMs to analyze screenshots of suspected scams (emails, text messages, listings) and identify observable warning signs like suspicious links, urgency tactics, or unusual communication channels. The team created a comprehensive benchmarking framework to evaluate multiple commercial multimodal models across four dimensions: red flag detection accuracy (precision/recall/F1), instruction adherence, cost, and latency. Their results showed significant performance variations across models, with GPT-5, Gemini 3 Pro, and Gemini 2.5 Pro leading in accuracy, though with notable tradeoffs in cost and latency, while also revealing instruction-following issues in some models that generated hallucinated red flags not in the predefined taxonomy.

fraud_detection classification content_moderation prompt_engineering +9

Multi-Tenant MCP Server Authentication with Redis Session Management

BrainGrid

BrainGrid faced the challenge of transforming their Model Context Protocol (MCP) server from a local development tool into a production-ready, multi-tenant service that could be deployed to customers. The core problem was that serverless platforms like Cloud Run and Vercel don't maintain session state, causing users to re-authenticate repeatedly as instances scaled to zero or requests hit different instances. BrainGrid solved this by implementing a Redis-based session store with AES-256-GCM encryption, OAuth integration via WorkOS, and a fast-path/slow-path authentication pattern that caches validated JWT sessions. The solution reduced authentication overhead from 50-100ms per request to near-instantaneous for cached sessions, eliminated re-authentication fatigue, and enabled the MCP server to scale from single-user to multi-tenant deployment while maintaining security and performance.

chatbot multi_modality code_generation poc +29

Multilingual Content Navigation and Localization System

Intercom

YouTube, a Google company, implements a comprehensive multilingual navigation and localization system for its global platform. The source text appears to be in Dutch, demonstrating the platform's localization capabilities, though insufficient details are provided about the specific LLMOps implementation.

compliance content_moderation error_handling fine_tuning +14

Multilingual Text Editing via Instruction Tuning

Grammarly

Grammarly's Strategic Research team developed mEdIT, a multilingual extension of their CoEdIT text editing model, to support intelligent writing assistance across seven languages and three editing tasks (grammatical error correction, text simplification, and paraphrasing). The problem addressed was that foundational LLMs produce low-quality outputs for text editing tasks, and prior specialized models only supported either multiple tasks in one language or single tasks across multiple languages. By fine-tuning multilingual LLMs (including mT5, mT0, BLOOMZ, PolyLM, and Bactrian-X) on over 200,000 carefully curated instruction-output pairs across Arabic, Chinese, English, German, Japanese, Korean, and Spanish, mEdIT achieved strong performance across tasks and languages, even when instructions were given in a different language than the text being edited. The models demonstrated generalization to unseen languages, with causal language models performing best, and received high ratings from human evaluators, though the work has not yet been integrated into Grammarly's production systems.

content_moderation translation document_processing chatbot +13

Multimodal AI Vector Search for Advanced Video Understanding

Twelve Labs

Twelve Labs developed an integration with Databricks Mosaic AI to enable advanced video understanding capabilities through multimodal embeddings. The solution addresses challenges in processing large-scale video datasets and providing accurate multimodal content representation. By combining Twelve Labs' Embed API for generating contextual vector representations with Databricks Mosaic AI Vector Search's scalable infrastructure, developers can implement sophisticated video search, recommendation, and analysis systems with reduced development time and resource needs.

multi_modality unstructured_data structured_output content_moderation +18

Native Image Generation with Multimodal Context in Gemini 2.5 Flash

Google DeepMind

Google DeepMind released an updated native image generation capability in Gemini 2.5 Flash that represents a significant quality leap over previous versions. The model addresses key production challenges including consistent character rendering across multiple angles, pixel-perfect editing that preserves scene context, and improved text rendering within images. Through interleaved generation, the model can maintain conversation context across multiple editing turns, enabling iterative creative workflows. The team tackled evaluation challenges by combining human preference data with specific technical metrics like text rendering quality, while incorporating real user feedback from social media to create comprehensive benchmarks that drive model improvements.

content_moderation multi_modality structured_output poc +9

Natural Language Analytics with Snowflake Cortex for Self-Service BI

Gitlab

GitLab implemented conversational analytics using Snowflake Cortex to enable non-technical business users to query structured data using natural language, eliminating the traditional dependency on data analysts and reducing analytics backlog. The solution evolved from a basic proof-of-concept with 60% accuracy to a production system achieving 85-95% accuracy for simple queries and 75% for complex queries, utilizing semantic models, prompt engineering, verified query feedback loops, and role-based access controls. The implementation reduced analytics requests by approximately 50% for some teams, decreased time-to-insight from weeks to seconds, and democratized data access while maintaining enterprise-grade security through Snowflake's native governance features.

data_analysis question_answering chatbot structured_output +18

Natural Language Interface to Business Intelligence Using RAG

Volvo

Volvo implemented a Retrieval Augmented Generation (RAG) system that allows non-technical users to query business intelligence data through a Slack interface using natural language. The system translates natural language questions into SQL queries for BigQuery, executes them, and returns results - effectively automating what was previously manual work done by data analysts. The system leverages DBT metadata and schema information to provide accurate responses while maintaining control over data access.

chatbot question_answering code_generation data_analysis +17

Network Operations Transformation with GenAI and AIOps

Vodafone

Vodafone implemented a comprehensive AI and GenAI strategy to transform their network operations, focusing on improving customer experience through better network management. They migrated from legacy OSS systems to a cloud-based infrastructure on Google Cloud Platform, integrating over 2 petabytes of network data with commercial and IT data. The initiative includes AI-powered network investment planning, automated incident management, and device analytics, resulting in significant operational efficiency improvements and a planned 50% reduction in OSS tools.

cost_optimization databases devops google_gcp +10

Observational Memory: Human-Inspired Context Compression for Agent Systems

Mastra

Mastra developed an observational memory system for LLM agents that compresses conversations 5-40x while maintaining temporal awareness and contextual relevance. The system uses two background agents (observer and reflector) to extract meaningful information from conversations while intelligently discarding noise, modeling how human memory retains what matters and lets details fade. The solution achieved 94.87% on the LongMemEval benchmark with GPT-5-mini and 84.23% with GPT-4o, outperforming existing approaches. Deployed in production across hiring and healthcare applications within the Mastra TypeScript agent framework, the system leverages prompt caching for cost efficiency and runs background compression to avoid blocking user interactions.

code_generation chatbot healthcare customer_support +16

On-Device Grammar Correction with Sequence-to-Sequence Models

Google

Google Research developed an on-device grammar correction system for Gboard on Pixel 6 that detects and suggests corrections for grammatical errors as users type. The solution addresses the challenge of implementing neural grammar correction within the constraints of mobile devices (limited memory, computational power, and latency requirements) while preserving user privacy by keeping all processing local. The team built a 20MB hybrid Transformer-LSTM model using hard distillation from a cloud-based system, achieving inference on 60 characters in under 22ms on the Pixel 6 CPU, enabling real-time grammar correction for both complete sentences and partial sentence prefixes across English text in nearly any app using Gboard.

content_moderation unstructured_data knowledge_distillation model_optimization +5

Open-Source Agent Orchestration Platform for Multi-Agent Business Automation

Paperclip

Paperclip is an open-source agent orchestration platform designed to manage AI agents in production environments for business automation. The platform addresses the challenge of coordinating multiple AI agents across different organizational functions by providing a centralized control plane with organizational hierarchies, task management, quality assurance workflows, and vendor-neutral agent integration. The creator demonstrates using Paperclip to manage its own development, including creating marketing videos through agent collaboration, managing code reviews, and coordinating work across engineering and marketing teams. The platform achieved rapid adoption with 50,000 GitHub stars within approximately two months of release, though it remains in early stages with planned features for multi-user support, cloud deployment, and improved organizational learning.

code_generation content_moderation chatbot poc +20

Open-Source Protein Structure Prediction and Generative Design Platform for Drug Discovery

Boltz

Boltz, founded by Gabriele Corso and Jeremy Wohlwend, developed an open-source suite of AI models (Boltz-1, Boltz-2, and BoltzGen) for structural biology and protein design, democratizing access to capabilities previously held by proprietary systems like AlphaFold 3. The company addresses the challenge of predicting complex molecular interactions (protein-ligand, protein-protein) and designing novel therapeutic proteins by combining generative diffusion models with specialized equivariant architectures. Their approach achieved validated nanomolar binders for two-thirds of nine previously unseen protein targets, demonstrating genuine generalization beyond training data. The newly launched Boltz Lab platform provides a production-ready infrastructure with optimized GPU kernels running 10x faster than open-source versions, offering agents for protein and small molecule design with collaborative interfaces for medicinal chemists and researchers.

healthcare high_stakes_application data_analysis regulatory_compliance +16

Optimizing Engineering Design with Conditional GANs

Rolls-Royce

Rolls-Royce collaborated with Databricks to enhance their design space exploration capabilities using conditional Generative Adversarial Networks (cGANs). The project aimed to leverage legacy simulation data to identify and assess innovative design concepts without requiring traditional geometry modeling and simulation processes. By implementing cGANs on the Databricks platform, they successfully developed a system that could handle multi-objective constraints and optimize design processes while maintaining compliance with aerospace industry requirements.

data_analysis data_integration visualization high_stakes_application +11

Optimizing LLM Server Startup Times for Preemptable GPU Infrastructure

Replit

Replit faced challenges with running LLM inference on expensive GPU infrastructure and implemented a solution using preemptable cloud GPUs to reduce costs by two-thirds. The key challenge was reducing server startup time from 18 minutes to under 2 minutes to handle preemption events, which they achieved through container optimization, GKE image streaming, and improved model loading processes.

code_generation code_interpretation cost_optimization devops +10

Optimizing RAG-based Search Results for Production: A Journey from POC to Production

Statista

Statista, a global data platform, developed and optimized a RAG-based AI search system to enhance their platform's search capabilities. Working with Urial Labs and Talent Formation, they transformed a basic prototype into a production-ready system that improved search quality by 140%, reduced costs by 65%, and decreased latency by 10%. The resulting Research AI product has seen growing adoption among paying customers and demonstrates superior performance compared to general-purpose LLMs for domain-specific queries.

question_answering structured_output regulatory_compliance data_analysis +14

Optimizing Security Incident Response with LLMs at Google

Google

Google implemented LLMs to streamline their security incident response workflow, particularly focusing on incident summarization and executive communications. They used structured prompts and careful input processing to generate high-quality summaries while ensuring data privacy and security. The implementation resulted in a 51% reduction in time spent on incident summaries and 53% reduction in executive communication drafting time, while maintaining or improving quality compared to human-written content.

compliance documentation error_handling google_gcp +8

Orchestrating Fleet-Scale AI Coding Agents with Temporal Workflows

Macroscope

Macroscope, a software development intelligence platform founded by former Twitter executives, built two production LLM systems powered by Temporal workflows: their core code understanding and review platform, and Murmur, a fleet orchestration system for AI coding agents. The core Macroscope product uses LLMs to automatically understand code changes, answer natural language questions about development progress, and perform high-signal code review with custom AI agents. Their Murmur tool addresses the limitations of managing multiple AI coding sessions by orchestrating fleets of sandboxed coding agents running in cloud VMs, each capable of self-verification through CI integration, code review feedback, and automated screenshot verification. Early internal metrics showed 32x productivity multipliers, with 40% of customer PRs automatically approved through their AI review system.

code_generation code_interpretation poc agent_based +18

Panel Discussion on LLMOps Challenges: Model Selection, Ethics, and Production Deployment

Google, Databricks,

A panel discussion featuring leaders from various AI companies discussing the challenges and solutions in deploying LLMs in production. Key topics included model selection criteria, cost optimization, ethical considerations, and architectural decisions. The discussion highlighted practical experiences from companies like Interact.ai's healthcare deployment, Inflection AI's emotionally intelligent models, and insights from Google and Databricks on responsible AI deployment and tooling.

healthcare customer_support high_stakes_application regulatory_compliance +26

Panel Discussion: Scaling Generative AI in Enterprise - Challenges and Best Practices

Various

A panel discussion featuring leaders from Google Cloud AI, Symbol AI, Chain ML, and Deloitte discussing the adoption, scaling, and implementation challenges of generative AI across different industries. The panel explores key considerations around model selection, evaluation frameworks, infrastructure requirements, and organizational readiness while highlighting practical approaches to successful GenAI deployment in production.

compliance devops document_processing error_handling +16

Parallel Asynchronous AI Coding Agents for Development Workflows

Google

Google Labs introduced Jules, an asynchronous coding agent designed to execute development tasks in parallel in the background while developers focus on higher-value work. The product addresses the challenge of serial development workflows by enabling developers to spin up multiple cloud-based agents simultaneously to handle tasks like SDK updates, testing, accessibility audits, and feature development. Launched two weeks prior to the presentation, Jules had already generated 40,000 public commits. The demonstration showcased how a developer could parallelize work on a conference schedule website by simultaneously running multiple test framework implementations, adding features like calendar integration and AI summaries, while conducting accessibility and security audits—all managed through a VM-based cloud infrastructure powered by Gemini 2.5 Pro.

code_generation poc prompt_engineering multi_agent_systems +13

Personalized Music Recommendation at Scale Using LLMs and User Embeddings

Spotify

Spotify faced the challenge of transitioning from traditional siloed recommendation systems to a unified, steerable LLM-based approach that could serve 750 million users across a catalog of 100+ million tracks and millions of podcasts. The solution involved building foundational user embeddings using transformer models that compress user interaction history into vectors, developing semantic IDs to tokenize catalog content for LLM training, and creating soft tokens by projecting user embeddings into the LLM token space. This approach enabled personalized, steerable recommendations with natural language interaction capabilities through features like AI DJ, prompted playlists, and taste profiles. Early results showed positive metrics, with the system already deployed in production for podcast recommendations and expanding across other verticals.

embeddings fine_tuning instruction_tuning semantic_search +7

Pivoting from GPU Infrastructure to Building an AI-Powered Development Environment

Windsurf

Windsurf began as a GPU virtualization company but pivoted in 2022 when they recognized the transformative potential of large language models. They developed an AI-powered development environment that evolved from a VS Code extension to a full-fledged IDE, incorporating advanced code understanding and generation capabilities. The product now serves hundreds of thousands of daily active users, including major enterprises, and has achieved significant success in automating software development tasks while maintaining high precision through sophisticated evaluation systems.

code_generation code_interpretation legacy_system_integration regulatory_compliance +22

Post-Training and Production LLM Systems at Scale

OpenAI

This case study explores OpenAI's approach to post-training and deploying large language models in production environments, featuring insights from a post-training researcher working on reasoning models. The discussion covers the operational complexities of reinforcement learning from human feedback at scale, the evolution from non-thinking to thinking models, and production challenges including model routing, context window optimization, token efficiency improvements, and interruptability features. Key developments include the shopping model release, improvements from GPT-4.1 to GPT-5.1, and the operational realities of managing complex RL training runs with multiple grading setups and infrastructure components that require constant monitoring and debugging.

code_generation question_answering chatbot poc +31

Practical Lessons Learned from Building and Deploying GenAI Applications

Bolbeck

A comprehensive overview of lessons learned from building GenAI applications over 1.5 years, focusing on the complexities and challenges of deploying LLMs in production. The presentation covers key aspects of LLMOps including model selection, hosting options, ensuring response accuracy, cost considerations, and the importance of observability in AI applications. Special attention is given to the emerging role of AI agents and the critical balance between model capability and operational costs.

chatbot translation speech_recognition high_stakes_application +24

Pre-training and Deploying Small Language Models for Edge Devices

Liquid AI

Liquid AI addresses the challenge of deploying language models on edge devices with limited memory and computational resources, such as smartphones and in-car systems. The company developed the LFM (Liquid Foundation Model) series, ranging from 350M to 24B parameters, optimized specifically for on-device deployment through novel architecture choices, extensive pre-training on 28 trillion tokens, and specialized post-training techniques. Key innovations include using gated short convolution blocks for reduced latency, focusing on task-specific capabilities like tool use and data extraction rather than general-purpose chat, and developing solutions to the "doom looping" problem through preference alignment and reinforcement learning. The resulting models demonstrate significantly better performance than scaled-down versions of larger models, with faster throughput, lower memory usage, and improved reliability for edge deployment scenarios.

healthcare document_processing code_generation chatbot +25

Production AI Agents for Accounting Automation: Engineering Process Daemons at Scale

Digits

Digits, an AI-native accounting platform, shares their experience running AI agents in production for over 2 years, addressing real-world challenges in deploying LLM-based systems. The team reframes "agents" as "process daemons" to set appropriate expectations and details their implementation across three use cases: vendor data enrichment, client onboarding, and complex query handling. Their solution emphasizes building lightweight custom infrastructure over dependency-heavy frameworks, reusing existing APIs as agent tools, implementing comprehensive observability with OpenTelemetry, and establishing robust guardrails. The approach has enabled reliable automation while maintaining transparency, security, and performance through careful engineering rather than relying on framework abstractions.

document_processing question_answering classification chatbot +26

Production AI Systems for News Personalization and Journalistic Workflows

Bonnier News

Bonnier News, a major Swedish media publisher with over 200 brands including Expressen and local newspapers, has deployed AI and machine learning systems in production to solve content personalization and newsroom automation challenges. The company's data science team, led by product manager Hans Yell (PhD in computational linguistics) and head of architecture Magnus Engster, has built white-label personalization engines using embedding-based recommendation systems that outperform manual content curation while scaling across multiple brands. They leverage vector similarity and user reading patterns rather than traditional metadata, achieving significant engagement lifts. Additionally, they're developing LLM-powered tools for journalists including headline generation, news aggregation summaries, and trigger questions for articles. Through a WASP-funded PhD collaboration, they're working on domain-adapted Swedish language models via continued pre-training of Llama models with Bonnier's extensive text corpus, focusing on capturing brand tone and improving journalistic workflows while maintaining data sovereignty.

content_moderation summarization question_answering classification +36

Production LLM Systems at Scale - Lessons from Financial Services, Legal Tech, and ML Infrastructure

Nubank, Harvey AI, Galileo and Convirza

A panel discussion featuring leaders from Nubank, Harvey AI, Galileo, and Convirza discussing their experiences implementing LLMs in production. The discussion covered key challenges and solutions around model evaluation, cost optimization, latency requirements, and the transition from large proprietary models to smaller fine-tuned models. Participants shared insights on modularizing LLM applications, implementing human feedback loops, and balancing the tradeoffs between model size, cost, and performance in production environments.

high_stakes_application regulatory_compliance chatbot question_answering +23

Production Monitoring and Issue Discovery for AI Agents

Raindrop

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

chatbot customer_support question_answering code_generation +39

Production RAG Stack Development Through 37 Iterations for Financial Services

jonfernandes

Independent AI engineer Jonathan Fernandez shares his experience developing a production-ready RAG (Retrieval Augmented Generation) stack through 37 failed iterations, focusing on building solutions for financial institutions. The case study demonstrates the evolution from a naive RAG implementation to a sophisticated system incorporating query processing, reranking, and monitoring components. The final architecture uses LlamaIndex for orchestration, Qdrant for vector storage, open-source embedding models, and Docker containerization for on-premises deployment, achieving significantly improved response quality for document-based question answering.

question_answering document_processing customer_support rag +21

Production Skills Framework for Agentic LLM Workflows

WorkOS

WorkOS developed a comprehensive approach to productionizing LLM workflows through "skills" - reusable, composable units of work that encapsulate specific tasks, constraints, and domain knowledge in markdown files with optional scripts. The problem addressed was the repetitive nature of LLM interactions where context must be reloaded from scratch in every conversation, leading to inconsistent outputs and wasted time. Their skills framework enables teams to codify workflows once, share them across team members and projects, and achieve more consistent, deterministic results. The solution has been applied across multiple use cases including code installation automation, content generation, image/video creation, and internal tooling, with WorkOS shipping production tools like their CLI that leverage skills to automate developer onboarding and authentication setup.

code_generation customer_support content_moderation poc +20

Production Vector Search and Retrieval System Optimization at Scale

Superlinked

SuperLinked, a company focused on vector search infrastructure, shares production insights from deploying information retrieval systems for e-commerce and enterprise knowledge management with indexes up to 2 terabytes. The presentation addresses challenges in relevance, latency, and cost optimization when deploying vector search systems at scale. Key solutions include avoiding vector pooling/averaging, implementing late interaction models, fine-tuning embeddings for domain-specific needs, combining sparse and dense representations, leveraging graph embeddings, and using template-based query generation instead of unconstrained text-to-SQL. Results demonstrate 5%+ precision improvements through targeted fine-tuning, significant latency reductions through proper database selection and query optimization, and improved relevance through multi-encoder architectures that combine text, graph, and metadata signals.

question_answering classification summarization chatbot +40

Production-Ready LLM Integration Using Retrieval-Augmented Generation and Custom ReAct Implementation

Buzzfeed

BuzzFeed Tech tackled the challenges of integrating LLMs into production by addressing dataset recency limitations and context window constraints. They evolved from using vanilla ChatGPT with crafted prompts to implementing a sophisticated retrieval-augmented generation system. After exploring self-hosted models and LangChain, they developed a custom "native ReAct" implementation combined with an enhanced Nearest Neighbor Search Architecture using Pinecone, resulting in a more controlled, cost-efficient, and production-ready LLM system.

content_moderation databases embeddings fine_tuning +14

Production-Scale Document Parsing with Vision-Language Models and Specialized OCR

Reducto

Reducto has built a production document parsing system that processes over 1 billion documents by combining specialized vision-language models, traditional OCR, and layout detection models in a hybrid pipeline. The system addresses critical challenges in document parsing including hallucinations from frontier models, dense tables, handwritten forms, and complex charts. Their approach uses a divide-and-conquer strategy where different models are routed to different document regions based on complexity, achieving higher accuracy than AWS Textract, Microsoft Azure Document Intelligence, and Google Cloud OCR on their internal benchmarks. The company has expanded beyond parsing to offer extraction with pixel-level citations and an edit endpoint for automated form filling.

document_processing healthcare fraud_detection regulatory_compliance +24

Rebuilding a Production Chatbot with Direct API Access and Multi-Agent Architecture

Langchain

LangChain rebuilt their public documentation chatbot after discovering their support engineers preferred using their own internal workflow over the existing tool. The original chatbot used traditional vector embedding retrieval, which suffered from fragmented context, constant reindexing, and vague citations. The solution involved building two distinct architectures: a fast CreateAgent for simple documentation queries delivering sub-15-second responses, and a Deep Agent with specialized subgraphs for complex queries requiring codebase analysis. The new approach replaced vector embeddings with direct API access to structured content (Mintlify for docs, Pylon for knowledge base, and ripgrep for codebase search), enabling the agent to search iteratively like a human. Results included dramatically faster response times, precise citations with line numbers, elimination of reindexing overhead, and internal adoption by support engineers for complex troubleshooting.

customer_support question_answering chatbot document_processing +21

Rebuilding an AI SDR Agent with Multi-Agent Architecture for Enterprise Sales Automation

11x

11x rebuilt their AI Sales Development Representative (SDR) product Alice from scratch in just 3 months, transitioning from a basic campaign creation tool to a sophisticated multi-agent system capable of autonomous lead sourcing, research, and email personalization. The team experimented with three different agent architectures - React, workflow-based, and multi-agent systems - ultimately settling on a hierarchical multi-agent approach with specialized sub-agents for different tasks. The rebuilt system now processes millions of leads and messages with a 2% reply rate comparable to human SDRs, demonstrating the evolution from simple AI tools to true digital workers in production sales environments.

customer_support classification multi_agent_systems prompt_engineering +12

Red-Teaming an AI Agent: Security Testing of goose Through Operation Pale Fire

Block

Block conducted an internal red team engagement called "Operation Pale Fire" to proactively identify security vulnerabilities in goose, their open-source AI coding agent. The engagement successfully demonstrated multiple attack vectors, including prompt injection attacks hidden in invisible Unicode characters delivered through calendar invitations and poisoned shareable recipes, ultimately compromising a Block employee's laptop through social engineering combined with AI-specific vulnerabilities. The operation revealed critical weaknesses in how AI agents handle untrusted context and led to concrete improvements including calendar policy changes, enhanced recipe transparency, zero-width character stripping, and prompt injection detection capabilities integrated into the goose platform.

code_generation code_interpretation high_stakes_application poc +16

Responsible AI Implementation for Healthcare Form Automation

WellSky

WellSky, serving over 2,000 hospitals and handling 100 million forms annually, partnered with Google Cloud to address clinical documentation burden and clinician burnout. They developed an AI-powered solution focusing on form automation, implementing a comprehensive responsible AI framework with emphasis on evidence citation, governance, and technical foundations. The project aimed to reduce "pajama time" - where 75% of nurses complete documentation after hours - while ensuring patient safety through careful AI deployment.

compliance document_processing error_handling fallback_strategies +11

Retrieval Augmented LLMs for Real-time CRM Account Linking

Schneider Electric

Schneider Electric partnered with AWS Machine Learning Solutions Lab to automate their CRM account linking process using Retrieval Augmented Generation (RAG) with Flan-T5-XXL model. The solution combines LangChain, Google Search API, and SEC-10K data to identify and maintain up-to-date parent-subsidiary relationships between customer accounts, improving accuracy from 55% to 71% through domain-specific prompt engineering.

amazon_aws data_integration databases error_handling +11

Revenue Intelligence Platform with Ambient AI Agents

Tabs

Tabs, a vertical AI company in the finance space, has built a revenue intelligence platform for B2B companies that uses ambient AI agents to automate financial workflows. The company extracts information from sales contracts to create a "commercial graph" and deploys AI agents that work autonomously in the background to handle billing, collections, and reporting tasks. Their approach moves beyond traditional guided AI experiences toward fully ambient agents that monitor communications and trigger actions automatically, with the goal of creating "beautiful operational software that no one ever has to go into."

document_processing data_analysis structured_output unstructured_data +37

Scaling a High-Traffic LLM Chat Application to 30,000 Messages Per Second

Character.ai

Character.ai scaled their open-domain conversational AI platform from 300 to over 30,000 generations per second within 18 months, becoming the third most-used generative AI application globally. They tackled unique engineering challenges around data volume, cost optimization, and connection management while maintaining performance. Their solution involved custom model architectures, efficient GPU caching strategies, and innovative prompt management tools, all while balancing performance, latency, and cost considerations at scale.

chatbot realtime_application prompt_engineering latency_optimization +14

Scaling Agentic AI Systems for Real Estate Due Diligence: Managing Prompt Tax at Production Scale

Orbital

Orbital, a real estate technology company, developed an agentic AI system called Orbital Co-pilot to automate legal due diligence for property transactions. The system processes hundreds of pages of legal documents to extract key information traditionally done manually by lawyers. Over 18 months, they scaled from zero to processing 20 billion tokens monthly and achieved multiple seven figures in annual recurring revenue. The presentation focuses on their concept of "prompt tax" - the hidden costs and complexities of continuously upgrading AI models in production, including prompt migration, regression risks, and the operational challenges of shipping at the AI frontier.

document_processing regulatory_compliance high_stakes_application structured_output +16

Scaling AI Agents for Financial Advisory Services with Compliance and Observability

Range

Range, an AI-powered wealth management platform, built multiple production AI agents using the Mastra framework to provide automated financial advisory services at a fraction of the cost of traditional human advisors. The company faced significant challenges around regulatory compliance, reliability, latency, and observability when deploying over 15 agents in production. Their solutions included building custom logging and tracing systems to meet SEC regulations, implementing resilient language model failover mechanisms to handle provider outages, and developing a post-generation analysis system using LLM-as-a-judge to evaluate financial advice quality across metrics like grounding, compliance, and sentiment. The flagship agent Rye outperforms human financial advisors on certification exams, achieving significantly higher pass rates while providing services including tax planning, investment advice, and document parsing workflows.

healthcare fraud_detection customer_support document_processing +24

Scaling AI Agents in Production: Building and Operating Hundreds of Autonomous Agents

Datadog

Datadog shares lessons learned from building over 100 AI agents in production and preparing to scale to thousands more. The company deployed multiple production agents including Bits AI SRE for autonomous alert investigation, Bits AI Dev for code generation and error fixes, and security analysts for automated security investigations. Key challenges addressed include making systems agent-native through API-first design, transitioning from reactive chat interfaces to proactive background agents, implementing comprehensive evaluation systems, maintaining model and framework agnosticism, and establishing robust monitoring for autonomous operations. The case study emphasizes that intelligence is no longer the bottleneck—operational excellence and proper LLMOps practices are now the critical factors for successful agent deployment at scale.

code_generation fraud_detection customer_support high_stakes_application +37

Scaling AI Infrastructure: From Training to Inference at Meta

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering summarization +51

Scaling AI-Assisted Coding Infrastructure: From Auto-Complete to Global Deployment

Cursor

Cursor, an AI-assisted coding platform, scaled their infrastructure from handling basic code completion to processing 100 million model calls per day across a global deployment. They faced and overcame significant challenges in database management, model inference scaling, and indexing systems. The case study details their journey through major incidents, including a database crisis that led to a complete infrastructure refactor, and their innovative solutions for handling high-scale AI model inference across multiple providers while maintaining service reliability.

code_generation code_interpretation embeddings semantic_search +22

Scaling an MCP Server for Error Monitoring to 60 Million Monthly Requests

Sentry

Sentry, an error monitoring platform, built a Model Context Protocol (MCP) server to improve the workflow where developers would copy error details from Sentry's UI and paste them into AI coding assistants like Cursor. The MCP server provides direct integration with 10-15 tools, including retrieving issue details and triggering automated fix attempts through Sentry's AI agent. The implementation scaled from 30 million to 60 million requests per month, with over 5,000 organizations using it. The company learned critical lessons about treating MCP servers as production services, implementing comprehensive observability, managing context pollution, and taking responsibility for agent behavior through careful prompt engineering and tool description design.

code_generation chatbot poc prompt_engineering +11

Scaling and Operating Large Language Models at the Frontier

Anthropic

This case study examines Anthropic's journey in scaling and operating large language models, focusing on their transition from GPT-3 era training to current state-of-the-art systems like Claude. The company successfully tackled challenges in distributed computing, model safety, and operational reliability while growing 10x in revenue. Key innovations include their approach to constitutional AI, advanced evaluation frameworks, and sophisticated MLOps practices that enable running massive training operations with hundreds of team members.

high_stakes_application regulatory_compliance realtime_application fine_tuning +25

Scaling Document Processing with LLMs and Human Review

Vendr / Extend

Vendr partnered with Extend to extract structured data from SaaS order forms and contracts using LLMs. They implemented a hybrid approach combining LLM processing with human review to achieve high accuracy in entity recognition and data extraction. The system successfully processed over 100,000 documents, using techniques such as document embeddings for similarity clustering, targeted human review, and robust entity mapping. This allowed Vendr to unlock valuable pricing insights for their customers while maintaining high data quality standards.

document_processing data_cleaning data_integration structured_output +18

Scaling Email Content Extraction Using LLMs in Production

Yahoo

Yahoo Mail faced challenges with their existing ML-based email content extraction system, hitting a coverage ceiling of 80% for major senders while struggling with long-tail senders and slow time-to-market for model updates. They implemented a new solution using Google Cloud's Vertex AI and LLMs, achieving 94% coverage for standard domains and 99% for tail domains, with 51% increase in extraction richness and 16% reduction in tracking API errors. The implementation required careful consideration of hybrid infrastructure, cost management, and privacy compliance while processing billions of daily messages.

cost_optimization devops document_processing google_gcp +10

Scaling Financial Research and Analysis with Multi-Model LLM Architecture

Rogo

Rogo developed an enterprise-grade AI finance platform that leverages multiple OpenAI models to automate and enhance financial research and analysis for investment banks and private equity firms. Through a layered model architecture combining GPT-4 and other models, along with fine-tuning and integration with financial datasets, they created a system that saves analysts over 10 hours per week on tasks like meeting prep and market research, while serving over 5,000 bankers across major financial institutions.

data_analysis data_integration high_stakes_application structured_output +9

Scaling Healthcare AI Agents from Prototype to 100 Million Conversations

Hyro

Hyro, a company building AI agents for the American healthcare industry, evolved from a deterministic, rules-based conversational system to a hybrid architecture that strategically incorporates LLMs while maintaining reliability and control. Facing the challenge of scaling from 20 agents to 2000 across 50 major healthcare systems serving approximately 100 million patients, they transformed from a project-based approach to a platform-based product. By building a flexible but opinionated platform with modular skills, they reduced time-to-deployment from months to days. When LLMs emerged, rather than adopting an end-to-end speech-to-speech approach, they chose a layered stack architecture that integrates multiple specialized models at critical points while maintaining deterministic control through their computational graph, achieving both conversational fluidity and 100% reliability for mission-critical tasks like appointment scheduling.

healthcare chatbot question_answering prompt_engineering +12

Scaling LLM Production with Reinforcement Learning for Enterprise Agents

Adaptive ML

Adaptive ML addresses the challenge that 95% of GenAI pilots fail to reach production by advocating for reinforcement learning as the core post-training technique. The company argues that MVP solutions built on proprietary models or instruction fine-tuning lack systematic improvement mechanisms, whereas RL enables continuous integration of feedback from production environments. Their RLOps platform serves enterprises like AT&T, Manulife, and CCS Medical Supply, enabling them to train smaller, faster, and more cost-effective specialized LLMs. The approach particularly excels for agentic use cases, where RL's ability to train models in simulated environments with business-specific rewards unlocks production-grade performance while reducing inference costs by millions of dollars through model compression.

customer_support poc fine_tuning few_shot +14

Scaling Local News Coverage with AI-Powered Newsletter Generation

Patch

Patch transformed its local news coverage by implementing AI-powered newsletter generation, enabling them to expand from 1,100 to 30,000 communities while maintaining quality and trust. The system combines curated local data sources, weather information, event calendars, and social media content, processed through AI to create relevant, community-specific newsletters. This approach resulted in over 400,000 new subscribers and a 93.6% satisfaction rating, while keeping costs manageable and maintaining editorial standards.

content_moderation structured_output unstructured_data regulatory_compliance +13

Scaling Voice AI with GPU-Accelerated Infrastructure

ElevenLabs

ElevenLabs developed a high-performance voice AI platform for voice cloning and multilingual speech synthesis, leveraging Google Cloud's GKE and NVIDIA GPUs for scalable deployment. They implemented GPU optimization strategies including multi-instance GPUs and time-sharing to improve utilization and reduce costs, while successfully serving 600 hours of generated audio for every hour of real time across 29 languages.

compliance cost_optimization customer_support devops +15

Secure Authentication for AI Agents using Model Context Protocol

Arcade

Arcade identified a critical security gap in the Model Context Protocol (MCP) where AI agents needed secure access to third-party APIs like Gmail but lacked proper OAuth 2.0 authentication mechanisms. They developed two solutions: first introducing user interaction capabilities (PR #475), then extending MCP's elicitation framework with URL mode (PR #887) to enable secure OAuth flows while maintaining proper security boundaries between trusted servers and untrusted clients. This work addresses fundamental production deployment challenges for AI agents that need authenticated access to real-world systems.

customer_support document_processing poc agent_based +12

Semantic Data Processing at Scale with AI-Powered Query Optimization

DocETL

Shreyaa Shankar presents DocETL, an open-source system for semantic data processing that addresses the challenges of running LLM-powered operators at scale over unstructured data. The system tackles two major problems: how to make semantic operator pipelines scalable and cost-effective through novel query optimization techniques, and how to make them steerable through specialized user interfaces. DocETL introduces rewrite directives that decompose complex tasks and data to improve accuracy and reduce costs, achieving up to 86% cost reduction while maintaining target accuracy. The companion tool Doc Wrangler provides an interactive interface for iteratively authoring and debugging these pipelines. Real-world applications include public defenders analyzing court transcripts for racial bias and medical analysts extracting information from doctor-patient conversations, demonstrating significant accuracy improvements (2x in some cases) compared to baseline approaches.

document_processing unstructured_data data_analysis data_cleaning +33

Semi-Supervised Fine-Tuning of Compact Vision-Language Models for Product Attribute Extraction

Flipkart

Flipkart faced the challenge of accurately extracting product attributes (like color, pattern, and material) from millions of product listings at scale. Manual labeling was expensive and error-prone, while using large Vision Language Model APIs was cost-prohibitive. The company developed a semi-supervised approach using compact VLMs (2-3 billion parameters) that combines Parameter-Efficient Fine-Tuning (PEFT) with Direct Preference Optimization (DPO) to leverage unlabeled data. The method starts with a small labeled dataset, generates multiple reasoning chains for unlabeled products using self-consistency, and then fine-tunes the model using DPO to favor preferred outputs. Results showed accuracy improvements from 75.1% to 85.7% on the Qwen2.5-VL-3B-Instruct model across twelve e-commerce verticals, demonstrating that compact models can effectively learn from unlabeled data to achieve production-grade performance.

classification structured_output multi_modality fine_tuning +8

Six Principles for Building Production AI Agents

App.build

App.build shared six empirical principles learned from building production AI agents that help overcome common challenges in agentic system development. The principles focus on investing in system prompts with clear instructions, splitting context to manage costs and attention, designing straightforward tools with limited parameters, implementing feedback loops with actor-critic patterns, using LLMs for error analysis, and recognizing that frustrating agent behavior often indicates system design issues rather than model limitations. These guidelines emerged from practical experience in developing software engineering agents and emphasize systematic approaches to building reliable, recoverable agents that fail gracefully.

code_generation prompt_engineering multi_agent_systems agent_based +13

Smart Business Analyst: Automating Competitor Analysis in Medical Device Procurement

Philips

A procurement team developed an advanced LLM-powered system called "Smart Business Analyst" to automate competitor analysis in the medical device industry. The system addresses the challenge of gathering and analyzing competitor data across multiple dimensions, including features, pricing, and supplier relationships. Unlike general-purpose LLMs like ChatGPT, this solution provides precise numerical comparisons and leverages multiple data sources to deliver accurate, industry-specific insights, significantly reducing the time required for competitive analysis from hours to seconds.

healthcare data_analysis data_integration structured_output +6

Solving Tool Confusion and Design Slop in Open Model Coding Agents

CommandCode

CommandCode, an AI-powered coding agent platform, discovered and solved a critical problem called "tool confusion" that was causing open models like DeepSeek V3 to perform poorly in production coding scenarios. By implementing deterministic repair logic that intercepts and fixes malformed tool calls before they cause errors, the team reduced average tool call failures from 50+ per session to near zero. This approach transformed previously unusable models like DeepSeek V3 Flash into production-viable alternatives that could compete with premium models like Claude Opus. The company processes hundreds of billions of tokens monthly and has extended their repair logic approach to other domains including fixing "design slop" in AI-generated UIs. The platform also implements an automated skill-learning system called "Taste" that captures developer preferences and coding patterns automatically across repositories.

code_generation code_interpretation prompt_engineering error_handling +11

Source-Grounded LLM Assistant with Multi-Modal Output Capabilities

Google / NotebookLLM

Google's NotebookLM tackles the challenge of making large language models more focused and personalized by introducing source grounding - allowing users to upload their own documents to create a specialized AI assistant. The system combines Gemini 1.5 Pro with sophisticated audio generation to create human-like podcast-style conversations about user content, complete with natural speech patterns and disfluencies. The solution includes built-in safety features, privacy protections through transient context windows, and content watermarking, while enabling users to generate insights from personal documents without contributing to model training data.

speech_recognition question_answering multi_modality unstructured_data +8

State of Production Machine Learning and LLMOps in 2024

Zalando

A comprehensive overview of the current state and challenges of production machine learning and LLMOps, covering key areas including motivations, industry trends, technological developments, and organizational changes. The presentation highlights the evolution from model-centric to data-centric approaches, the importance of metadata management, and the growing focus on security and monitoring in ML systems.

amazon_aws compliance cost_optimization devops +17

Strategic Model Management and Multi-Provider Optimization at Scale

Notion

Notion addresses the challenges of deploying LLMs at scale for millions of users while navigating volatile pricing, model deprecations, and supplier competition from frontier labs. The solution involves building a multi-provider architecture that maintains optionality, implementing automated model evaluation and switching infrastructure (the "Auto" model feature), optimizing architecture and orchestration to reduce costs beyond model selection, and investing in open-weight alternatives. The results include maintaining competitive pricing for customers despite market pressures, serving 75% of AI traffic through automatically optimized model selection that switches every 2-3 weeks, and achieving cost reductions of up to 3× through architectural improvements while preserving the ability to leverage the best frontier models without vendor lock-in.

data_analysis summarization question_answering classification +27

Structured LLM Conversations for Language Learning Video Calls

Duolingo

Duolingo implemented an AI-powered video call feature called "Video Call with Lily" that enables language learners to practice speaking with an AI character. The system uses carefully structured prompts, conversational blueprints, and dynamic evaluations to ensure appropriate difficulty levels and natural interactions. The implementation includes memory management to maintain conversation context across sessions and separate processing steps to prevent LLM overload, resulting in a personalized and effective language learning experience.

chatbot translation speech_recognition structured_output +10

Swarm-Coding with Multiple Background Agents for Large-Scale Code Maintenance

Faire

Faire implemented "swarm-coding" using GitHub Copilot's background agents to automate tedious engineering tasks like cleaning up expired feature flags and migrating test infrastructure. By coordinating multiple autonomous AI agents working in parallel, they enabled non-engineers to land simple code changes and freed up engineering teams to focus on innovation rather than maintenance work. Within the first month of deployment, 18% of the engineering team adopted the approach, merging over 500 Copilot pull requests with an average time savings of 39.6 minutes per PR and a 25% increase in overall PR volume among users. The company enhanced the background agents through custom instructions, MCP (Model Context Protocol) servers, and programmatic task assignment to create specialized agent profiles for common workflows.

code_generation poc prompt_engineering multi_agent_systems +19

Synthetic Consumer Survey Generation Using LLMs with Semantic Similarity Response Mapping

Colgate

PyMC Labs partnered with Colgate to address the limitations of traditional consumer surveys for product testing by developing a novel synthetic consumer methodology using large language models. The challenge was that standard approaches of asking LLMs to provide numerical ratings (1-5) resulted in biased, middle-of-the-road responses that didn't reflect real consumer behavior. The solution involved allowing LLMs to provide natural text responses which were then mapped to quantitative scales using embedding similarity to reference responses. This approach achieved 90% of the maximum achievable correlation with real survey data, accurately reproduced demographic effects including age and income patterns, eliminated positivity bias present in human surveys, and provided richer qualitative feedback while being faster and cheaper than traditional surveys.

customer_support classification poc prompt_engineering +7

Systematic Analysis of Prompt Templates in Production LLM Applications

Uber, Microsoft

The research analyzes real-world prompt templates from open-source LLM-powered applications to understand their structure, composition, and effectiveness. Through analysis of over 2,000 prompt templates from production applications like those from Uber and Microsoft, the study identifies key components, patterns, and best practices for template design. The findings reveal that well-structured templates with specific patterns can significantly improve LLMs' instruction-following abilities, potentially enabling weaker models to achieve performance comparable to more advanced ones.

structured_output code_generation question_answering document_processing +16

Terminal-Native AI Coding Agent with Multi-Model Architecture and Adaptive Context Management

Opendev

OpenDev is an open-source, command-line AI coding agent written in Rust that addresses the fundamental challenges of building production-ready autonomous software engineering systems. The agent tackles three critical problems: managing finite context windows over long sessions, preventing destructive operations while maintaining developer productivity, and extending capabilities without overwhelming token budgets. The solution employs a compound AI system architecture with per-workflow LLM binding, dual-agent separation of planning from execution, adaptive context compaction that progressively reduces older observations, lazy tool discovery via Model Context Protocol (MCP), and a defense-in-depth safety architecture. Results demonstrate approximately 54% reduction in peak context consumption, session lengths extending from 15-20 turns to 30-40 turns without emergency compaction, and a robust framework for terminal-first AI assistance that operates where developers manage source control, execute builds, and deploy environments.

code_generation code_interpretation chatbot data_analysis +42

Test-Driven Vibe Development: Integrating Quality Engineering with AI Code Generation

Asos

ASOS, a major e-commerce retailer, developed Test-Driven Vibe Development (TDVD), a novel methodology that combines test-first quality engineering practices with LLM-driven code generation to address the quality and reliability challenges of "vibe coding." The company applied this approach to build an internal stock discrepancy reporting system, using AI agents to generate both tests and code in a structured workflow that prioritizes acceptance test-driven development (ATDD), behavior-driven development (BDD), and test-driven development (TDD). With a team of effectively 2.5 people working part-time, they delivered a full-stack MVP (backend API, Azure Functions, React frontend) in 4 weeks—representing a 7-10x acceleration compared to traditional development estimates—while maintaining quality through continuous validation against predefined test requirements and catching hallucinations early in the development cycle.

code_generation data_analysis prompt_engineering agent_based +7

Text-to-SQL System with Structured RAG and Comprehensive Evaluation

ICE / NYSE

ICE/NYSE developed a text-to-SQL application using structured RAG to enable business users to query financial data without needing SQL knowledge. The system leverages Databricks' Mosaic AI stack including Unity Catalog, Vector Search, Foundation Model APIs, and Model Serving. They implemented comprehensive evaluation methods using both syntactic and execution matching, achieving 77% syntactic accuracy and 96% execution match across approximately 50 queries. The system includes continuous improvement through feedback loops and few-shot learning from incorrect queries.

question_answering structured_output data_analysis regulatory_compliance +17

Thinking Machines' Tinker: Low-Level Fine-Tuning API for Production LLM Training

Thinking Machines

Thinking Machines, a new AI company founded by former OpenAI researcher John Schulman, has developed Tinker, a low-level fine-tuning API designed to enable sophisticated post-training of language models without requiring teams to manage GPU infrastructure or distributed systems complexity. The product aims to abstract away infrastructure concerns while providing low-level primitives for expressing nearly all post-training algorithms, allowing researchers and companies to build custom models without developing their own training infrastructure. The company plans to release their own models and expand Tinker's capabilities to include multimodal functionality and larger-scale training jobs, while making the platform more accessible to non-experts through higher-level tooling.

code_generation chatbot question_answering poc +35

Training Agentic Models with Reinforcement Learning for Production Deployment

Kimi / Cursor / Chroma

This case study examines three production LLM systems—Kimi K2.5, Cursor Composer 2, and Chroma Context-1—that use reinforcement learning to train agentic models for real-world tasks. All three teams face similar challenges: managing context windows during long agentic sessions, bridging the gap between training environments and production deployments, and designing reward functions that avoid degenerate behaviors. Kimi K2.5 introduces Agent Swarm for parallel task decomposition, achieving 78.4% accuracy on BrowseComp with 4.5× latency reduction. Cursor Composer 2 implements real-time RL from production traffic with a five-hour deployment cycle, training on tasks with median 181-line changes. Chroma Context-1 develops self-editing search capabilities in a 20B parameter model that matches frontier-scale performance at 10× speed. Common solutions include training inside production harnesses, using outcome-based rewards augmented with generative reward models, running asynchronous large-scale rollouts, and building domain-specific evaluation benchmarks.

code_generation question_answering document_processing summarization +44

Unified Pipeline for AI Token Spend Management and Attribution

Ramp

Ramp built a production system to track and manage AI token spend across their organization and for their customers. The problem they addressed was the opacity and volatility of AI costs, which can spike unpredictably due to consumption-based pricing, distributed usage across teams, and lack of visibility into what's driving costs. Their solution involved building a real-time pipeline that ingests usage events from AI gateways (LiteLLM and OpenRouter) via webhooks, streams them through Kafka, stores them in ClickHouse for fast analytics, and provides dashboards with multi-dimensional attribution by user, team, project, model, and use case. The system helped them identify cost issues like phantom reasoning tokens causing unexpected latency and expense, provided forecasting capabilities, and enabled customers to detect patterns like oversized models for simple tasks, missing caching, runaway agent loops, and prompt bloat.

customer_support code_generation classification summarization +21

Usability Challenges in Commercial AI Agent Systems: A Study of Industry Aspirations vs. User Realities

Carnegie Mellon

This research study addresses the gap between how AI agents are marketed by the technology industry and how end-users actually experience them in practice. Researchers from Carnegie Mellon conducted a systematic review of 102 commercial AI agent products to understand industry positioning, identifying three core use case categories: orchestration (automating GUI tasks), creation (generating structured documents), and insight (providing analysis and recommendations). They then conducted a usability study with 31 participants attempting representative tasks using popular commercial agents (Operator and Manus), revealing five critical usability barriers: misalignment between agent capabilities and user mental models, premature trust assumptions, inflexible collaboration styles, overwhelming communication overhead, and lack of meta-cognitive abilities. While users generally succeeded at assigned tasks and were impressed with the technology, these barriers significantly impacted the user experience and highlighted the disconnect between marketed capabilities and practical usability.

customer_support document_processing code_generation content_moderation +20

Using LLMs to Scale Insurance Operations at a Small Company

Anzen

Anzen, a small insurance company with under 20 people, leveraged LLMs to compete with larger insurers by automating their underwriting process. They implemented a document classification system using BERT and AWS Textract for information extraction, achieving 95% accuracy in document classification. They also developed a compliance document review system using sentence embeddings and question-answering models to provide immediate feedback on legal documents like offer letters.

amazon_aws classification compliance devops +19

Video Content Summarization and Metadata Enrichment for Streaming Platform

Paramount+

Paramount+ partnered with Google Cloud Consulting to develop two key AI use cases: video summarization and metadata extraction for their streaming platform containing over 50,000 videos. The project used Gen AI jumpstarts to prototype solutions, implementing prompt chaining, embedding generation, and fine-tuning approaches. The system was designed to enhance content discoverability and personalization while reducing manual labor and third-party costs. The implementation included a three-component architecture handling transcription creation, content generation, and personalization integration.

chunking content_moderation embeddings fine_tuning +14

Video Super-Resolution at Scale for Ads and Generative AI Content

Voice AI Agent Development and Production Challenges

Various (Canonical, Prosus, DeepMind)

Panel discussion with experts from various companies exploring the challenges and solutions in deploying voice AI agents in production. The discussion covers key aspects of voice AI development including real-time response handling, emotional intelligence, cultural adaptation, and user retention. Experts shared experiences from e-commerce, healthcare, and tech sectors, highlighting the importance of proper testing, prompt engineering, and understanding user interaction patterns for successful voice AI deployments.

speech_recognition chatbot customer_support healthcare +11