ZenML

Production LLM Implementation for Customer Support Response Generation

Stripe 2024
View original source

Stripe implemented a large language model system to help support agents answer customer questions more efficiently. They developed a sequential framework that combined fine-tuned models for question filtering, topic classification, and response generation. While the system achieved good accuracy in offline testing, they discovered challenges with agent adoption and the importance of monitoring online metrics. Key learnings included breaking down complex problems into manageable ML steps, prioritizing online feedback mechanisms, and maintaining high-quality training data.

Industry

Finance

Technologies

Overview

This case study comes from a presentation by Sophie, a data scientist at Stripe, discussing the lessons learned from shipping the company’s first large language model application for customer support. Stripe is a major payments company serving millions of businesses globally, from large enterprises like Google and Amazon to small startups, processing payments across nearly 50 countries with customers in over 200 countries. The support operations team handles tens of thousands of text-based support cases weekly, making it an ideal domain for LLM applications.

The primary goal of the project was to help support agents solve cases more efficiently by prompting them with relevant, AI-generated responses to customer questions. Importantly, customers would always interact directly with human agents—the LLM system was designed as an agent assistance tool rather than a customer-facing automation. The success criteria were twofold: responses needed to be information-accurate (agents must trust the answers) and maintain the right “Stripe tone” (friendly but succinct, avoiding robotic language).

The Problem with Out-of-the-Box LLMs

The team’s first major lesson was that LLMs, despite the valid hype, have important and often subtle limitations when applied to context-heavy business problems. Using a concrete example, the presenter demonstrated how a common customer question—“How do I pause payouts?”—would be incorrectly answered by general-purpose models. The correct Stripe-specific solution involves updating the payout schedule to manual, but GPT-3.5 hallucinated a non-existent “pause payouts button” in the dashboard. While GPT-4 performed better and correctly identified the manual payout approach, it still went off-track with irrelevant information about disputes and included factual inaccuracies.

The underlying problem was that pre-training data was either outdated, incomplete, or confused with generic instructions from other payments companies. Few-shot inference could improve individual answers but wouldn’t scale to the hundreds of different topics Stripe customers ask about daily. This led the team to conclude that solving problems requiring deep subject matter expertise at scale requires breaking down the problem into more ML-tangible steps.

The Sequential Pipeline Architecture

Rather than relying on a single LLM to handle everything, Stripe built a sequential framework consisting of multiple specialized fine-tuned models:

Trigger Classification Model: The first stage filters out user messages that aren’t suitable for automated response generation. This includes chitchat (“Hey, how are you?”) or questions lacking sufficient context (“When will this be fixed?”). This ensures the system only attempts to answer questions where it can provide genuine value.

Topic Classification Model: Messages that pass the trigger filter are then classified by topic to identify which support materials or documentation are relevant. This is a crucial step that enables the RAG-like approach to answer generation.

Answer Generation Model: Using the identified topic and relevant context, this fine-tuned model generates information-accurate responses. The retrieval-augmented approach ensures answers are grounded in actual Stripe documentation rather than relying solely on the model’s pre-trained knowledge.

Tone Adjustment Model: A final few-shot model adjusts the generated answers to meet Stripe’s communication standards before presenting them to agents.

This architecture provided several benefits. The team gained much more control over the solution and could expect more reliable, interpretable results. By adding thresholds at various stages, they reported completely mitigating hallucinations in their system. The approach also proved resource-efficient—fine-tuning GPT requires only a few hundred labels per class, allowing rapid iteration. The team relied on expert agent annotation for quality answers while handling much of the trigger and topic classifier labeling themselves.

The Online vs. Offline Evaluation Gap

Perhaps the most valuable lesson from this case study concerns the critical importance of online feedback and monitoring. During development, the team relied on standard offline evaluation practices: labeled datasets with precision and recall metrics for classification models, and expert agent reviews for generative outputs. User testing with expert agents yielded positive feedback, and offline accuracy metrics trended well, giving the team confidence they were ready to ship.

The production experiment was designed as a controlled A/B test measuring cases where agents received ML-generated response prompts versus those without. Due to volume constraints, they couldn’t segment agents into treatment and control groups—instead, they randomized by support case, meaning thousands of agents only saw ML prompts for a small portion of their daily cases.

The critical gap was that online case labeling wasn’t feasible at scale, leaving them without visibility into online accuracy trends. When they shipped, agent adoption of the ML-generated prompts was dramatically lower than expected—a shock given the positive user testing feedback.

To address this monitoring gap, the team developed a heuristic-based “match rate” metric. When agents didn’t click on the ML prompt, they compared what the agent actually sent to users against what the system had suggested. If the texts were similar enough, they could infer the prompt was correct even though it wasn’t used. This provided a crude lower bound on expected accuracy and allowed them to validate that online performance aligned with offline expectations.

The root cause turned out not to be model quality—the heuristic metrics showed the ML responses were accurate. Instead, agents were simply too accustomed to their existing workflows and were ignoring the prompts entirely. Solving the actual business problem required substantial UX and agent training efforts beyond ML improvements.

Shadow Mode Deployment Strategy

The team evolved their deployment approach to include “shadow mode” shipping, where models are deployed to production but don’t actually take actions. Instead, predictions are logged with tags to populate dashboards and metrics without interfering with agent workflows. This allows teams to validate production performance, catch issues early, and set accurate expectations for each pipeline stage before full activation.

In subsequent projects, the team adopted a practice of randomly sampling data from shadow mode daily and spending about 20 minutes as a team reviewing and annotating to ensure online performance met accuracy targets at each stage. This incremental shipping approach—rather than waiting for one large end-to-end launch—proved invaluable for debugging and validation.

Data Quality Over Model Architecture

The third major lesson reinforces a fundamental truth in applied ML: data quality matters more than model sophistication. The presenter noted that writing code for the LLM framework took days to weeks, while iterating on training datasets took months. More time was spent in Google Sheets reviewing annotations than writing Python code.

Critically, iterating on label data quality yielded higher performance gains than upgrading to more advanced GPT engines. The ML errors they encountered related to gotchas specific to the Stripe support space rather than general language understanding gaps—so adding more or higher-quality data samples typically resolved performance issues. They ultimately didn’t need the latest GPT engines to achieve their performance targets.

For scaling, the team found that moving from generative fine-tuning to classification-based approaches offered significant advantages. Generative fine-tuning adds complexity in data collection, whereas classification enables leveraging weak supervision techniques like Snorkel ML or embeddings across documentation to label data at scale. They invested in subject matter expertise programs to collect and maintain fresh, up-to-date labels as Stripe’s product suite evolves, treating the dataset as a “living oracle” to keep ML responses accurate over time.

Team Structure and Resourcing

The initial team was intentionally lightweight given the experimental nature of the project. It included product managers, support operations experts (providing critical business context), a data scientist, and a software engineer. In retrospect, the presenter noted that once feasibility was validated, the team should have been expanded to include UX engineers, ML engineers for production infrastructure, and additional product support. The lesson here is that shipping a pilot can be scrappy, but maintaining ML models in production at scale requires sound infrastructure and broader cross-functional coverage.

Cost Considerations

Compute costs were initially deprioritized while the team focused on validating feasibility. Because they used smaller GPT engines due to the fine-tuning approach, costs remained manageable. The team developed a strategy to use GPT as a stepping stone—using it to collect training data at scale, then benchmarking against simpler in-house models for tasks that don’t require LLM complexity. For example, trigger classification could potentially be handled by a BERT model given sufficient labels, since the concept isn’t inherently complex.

Key Takeaways

The presentation concluded with three central lessons for practitioners building LLM-powered production systems:

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Scaling Customer Support, Compliance, and Developer Productivity with Gen AI

Coinbase 2025

Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.

customer_support regulatory_compliance fraud_detection +50

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI 2025

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

customer_support healthcare code_generation +42