Stripe: Production LLM Implementation for Customer Support Response Generation

Overview

This case study comes from a presentation by Sophie, a data scientist at Stripe, discussing the lessons learned from shipping the company’s first large language model application for customer support. Stripe is a major payments company serving millions of businesses globally, from large enterprises like Google and Amazon to small startups, processing payments across nearly 50 countries with customers in over 200 countries. The support operations team handles tens of thousands of text-based support cases weekly, making it an ideal domain for LLM applications.

The primary goal of the project was to help support agents solve cases more efficiently by prompting them with relevant, AI-generated responses to customer questions. Importantly, customers would always interact directly with human agents—the LLM system was designed as an agent assistance tool rather than a customer-facing automation. The success criteria were twofold: responses needed to be information-accurate (agents must trust the answers) and maintain the right “Stripe tone” (friendly but succinct, avoiding robotic language).

The Problem with Out-of-the-Box LLMs

The team’s first major lesson was that LLMs, despite the valid hype, have important and often subtle limitations when applied to context-heavy business problems. Using a concrete example, the presenter demonstrated how a common customer question—“How do I pause payouts?”—would be incorrectly answered by general-purpose models. The correct Stripe-specific solution involves updating the payout schedule to manual, but GPT-3.5 hallucinated a non-existent “pause payouts button” in the dashboard. While GPT-4 performed better and correctly identified the manual payout approach, it still went off-track with irrelevant information about disputes and included factual inaccuracies.

The underlying problem was that pre-training data was either outdated, incomplete, or confused with generic instructions from other payments companies. Few-shot inference could improve individual answers but wouldn’t scale to the hundreds of different topics Stripe customers ask about daily. This led the team to conclude that solving problems requiring deep subject matter expertise at scale requires breaking down the problem into more ML-tangible steps.

The Sequential Pipeline Architecture

Rather than relying on a single LLM to handle everything, Stripe built a sequential framework consisting of multiple specialized fine-tuned models:

Trigger Classification Model: The first stage filters out user messages that aren’t suitable for automated response generation. This includes chitchat (“Hey, how are you?”) or questions lacking sufficient context (“When will this be fixed?”). This ensures the system only attempts to answer questions where it can provide genuine value.

Topic Classification Model: Messages that pass the trigger filter are then classified by topic to identify which support materials or documentation are relevant. This is a crucial step that enables the RAG-like approach to answer generation.

Answer Generation Model: Using the identified topic and relevant context, this fine-tuned model generates information-accurate responses. The retrieval-augmented approach ensures answers are grounded in actual Stripe documentation rather than relying solely on the model’s pre-trained knowledge.

Tone Adjustment Model: A final few-shot model adjusts the generated answers to meet Stripe’s communication standards before presenting them to agents.

This architecture provided several benefits. The team gained much more control over the solution and could expect more reliable, interpretable results. By adding thresholds at various stages, they reported completely mitigating hallucinations in their system. The approach also proved resource-efficient—fine-tuning GPT requires only a few hundred labels per class, allowing rapid iteration. The team relied on expert agent annotation for quality answers while handling much of the trigger and topic classifier labeling themselves.

The Online vs. Offline Evaluation Gap

Perhaps the most valuable lesson from this case study concerns the critical importance of online feedback and monitoring. During development, the team relied on standard offline evaluation practices: labeled datasets with precision and recall metrics for classification models, and expert agent reviews for generative outputs. User testing with expert agents yielded positive feedback, and offline accuracy metrics trended well, giving the team confidence they were ready to ship.

The production experiment was designed as a controlled A/B test measuring cases where agents received ML-generated response prompts versus those without. Due to volume constraints, they couldn’t segment agents into treatment and control groups—instead, they randomized by support case, meaning thousands of agents only saw ML prompts for a small portion of their daily cases.

The critical gap was that online case labeling wasn’t feasible at scale, leaving them without visibility into online accuracy trends. When they shipped, agent adoption of the ML-generated prompts was dramatically lower than expected—a shock given the positive user testing feedback.

To address this monitoring gap, the team developed a heuristic-based “match rate” metric. When agents didn’t click on the ML prompt, they compared what the agent actually sent to users against what the system had suggested. If the texts were similar enough, they could infer the prompt was correct even though it wasn’t used. This provided a crude lower bound on expected accuracy and allowed them to validate that online performance aligned with offline expectations.

The root cause turned out not to be model quality—the heuristic metrics showed the ML responses were accurate. Instead, agents were simply too accustomed to their existing workflows and were ignoring the prompts entirely. Solving the actual business problem required substantial UX and agent training efforts beyond ML improvements.

Shadow Mode Deployment Strategy

The team evolved their deployment approach to include “shadow mode” shipping, where models are deployed to production but don’t actually take actions. Instead, predictions are logged with tags to populate dashboards and metrics without interfering with agent workflows. This allows teams to validate production performance, catch issues early, and set accurate expectations for each pipeline stage before full activation.

In subsequent projects, the team adopted a practice of randomly sampling data from shadow mode daily and spending about 20 minutes as a team reviewing and annotating to ensure online performance met accuracy targets at each stage. This incremental shipping approach—rather than waiting for one large end-to-end launch—proved invaluable for debugging and validation.

Data Quality Over Model Architecture

The third major lesson reinforces a fundamental truth in applied ML: data quality matters more than model sophistication. The presenter noted that writing code for the LLM framework took days to weeks, while iterating on training datasets took months. More time was spent in Google Sheets reviewing annotations than writing Python code.

Critically, iterating on label data quality yielded higher performance gains than upgrading to more advanced GPT engines. The ML errors they encountered related to gotchas specific to the Stripe support space rather than general language understanding gaps—so adding more or higher-quality data samples typically resolved performance issues. They ultimately didn’t need the latest GPT engines to achieve their performance targets.

For scaling, the team found that moving from generative fine-tuning to classification-based approaches offered significant advantages. Generative fine-tuning adds complexity in data collection, whereas classification enables leveraging weak supervision techniques like Snorkel ML or embeddings across documentation to label data at scale. They invested in subject matter expertise programs to collect and maintain fresh, up-to-date labels as Stripe’s product suite evolves, treating the dataset as a “living oracle” to keep ML responses accurate over time.

Team Structure and Resourcing

The initial team was intentionally lightweight given the experimental nature of the project. It included product managers, support operations experts (providing critical business context), a data scientist, and a software engineer. In retrospect, the presenter noted that once feasibility was validated, the team should have been expanded to include UX engineers, ML engineers for production infrastructure, and additional product support. The lesson here is that shipping a pilot can be scrappy, but maintaining ML models in production at scale requires sound infrastructure and broader cross-functional coverage.

Cost Considerations

Compute costs were initially deprioritized while the team focused on validating feasibility. Because they used smaller GPT engines due to the fine-tuning approach, costs remained manageable. The team developed a strategy to use GPT as a stepping stone—using it to collect training data at scale, then benchmarking against simpler in-house models for tasks that don’t require LLM complexity. For example, trigger classification could potentially be handled by a BERT model given sufficient labels, since the concept isn’t inherently complex.

Key Takeaways

The presentation concluded with three central lessons for practitioners building LLM-powered production systems:

LLMs are not oracles: Break complex business problems into ML-manageable steps rather than expecting a single model to handle everything. Domain expertise cannot be assumed from general-purpose models.
Online feedback is critical: Monitoring is as important as model development. A model isn’t truly shipped until it has comprehensive monitoring and dashboards. Proxy metrics based on heuristics are far better than no visibility into production performance.
Data remains king: A good data strategy will outperform sophisticated model architectures, especially for domain-specific problems requiring deep expertise at scale. The 80/20 rule holds—expect to spend far more time on data than code.

Production LLM Implementation for Customer Support Response Generation

Industry

Technologies

Overview

The Problem with Out-of-the-Box LLMs

The Sequential Pipeline Architecture

The Online vs. Offline Evaluation Gap

Shadow Mode Deployment Strategy

Data Quality Over Model Architecture

Team Structure and Resourcing

Cost Considerations

Key Takeaways

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Scaling Customer Support, Compliance, and Developer Productivity with Gen AI

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production