Klaviyo: AI-Powered Customer Segmentation with Natural Language Interface

Overview

Klaviyo built and deployed Segments AI, a production LLM-powered feature that transforms natural language descriptions into structured customer segment definitions. The company serves 130,000 customers who use their platform to understand and target end customers through segmentation. Released in November 2023, this case study provides valuable insights into the practical challenges of deploying LLM features in production, particularly around validation, prompt engineering strategies, and ongoing maintenance requirements.

The business problem was two-fold: less experienced users lacked the knowledge to translate conceptual customer segments into Klaviyo’s segment builder UI, while highly experienced users were spending hours weekly building repetitive, complex segments manually. Segments AI aimed to democratize segment creation for novices while accelerating workflows for power users.

Technical Architecture and Approach

The system architecture leverages OpenAI’s LLM API but is explicitly described as “far from just being a wrapper around ChatGPT.” The team implemented sophisticated prompt engineering strategies to ensure reliable outputs that conform to Klaviyo’s specific segment JSON schema.

Prompt Chaining Strategy: The core technical innovation involves extensive use of prompt chaining, where complex generation tasks are decomposed into small, atomic, and simple requests. Rather than asking the LLM to generate an entire segment definition in one pass, the system breaks the process into discrete subtasks that can be executed and validated independently. The author notes that prompt chaining typically improves generation speed and quality when outputs can be joined without loss of cohesion - ideal for generating parts of a JSON object but problematic for prose where sentence-by-sentence generation would create disjointed text.

A critical challenge the team identified is “chain entropy” - their term for error propagation through sequential LLM calls, akin to a “telephone game” where mistakes compound. To mitigate this risk, they designed subtasks to be as separable and parallelizable as possible, avoiding waterfall structures where early errors cascade through the entire process. The author clarifies that “prompt chaining” is somewhat of a misnomer in their implementation since many processes run asynchronously in parallel rather than strictly sequentially.

Few-Shot Learning Implementation: Klaviyo implemented few-shot learning not through traditional model fine-tuning but by embedding training examples directly into system instructions. The team describes this as analogous to “reading ChatGPT bedtime stories” - teaching the model through annotated examples that include input, ideal output, and generalizable lessons. Combined with prompt chaining, this creates specialized “chatbot agents” with niche expertise for specific subtasks. The system essentially asks “a series of small, highly specific questions to specialized chatbot agents” and assembles the results into the final segment definition. This approach allows them to create dozens of subtask experts without the overhead of explicit fine-tuning.

The Validation Challenge

The case study provides particularly valuable insight into the thorny problem of validating LLM features with creative, non-deterministic outputs. The team struggled with fundamental questions: how do you test a feature where there’s no one-to-one mapping between input and desired output? When “dozens of jointly valid ways” exist to define a segment like “engaged users,” traditional testing approaches break down.

The author outlines three validation approaches that have emerged across the industry, each with significant tradeoffs:

LLMs evaluating LLMs: Quick and scalable but prone to hallucinations and inconsistent evaluation criteria. The reliability of using one LLM to judge another’s output remains questionable.
Hand-designed test cases: High quality but slow to create and not easily scalable as test requirements grow. Provides clear ground truth but creates maintenance burden.
Human evaluation: High quality and flexible but expensive, time-intensive, and fundamentally unscalable for large test suites.

Klaviyo ultimately adopted a hybrid approach combining LLM-based evaluation with hand-designed test cases, packaging the validation suite to run before major code changes. The author candidly acknowledges both strategies are “imperfect” but provide useful directional feedback for debugging and regression analysis. Importantly, they caution against over-investing in validation engineering: “If you find yourself spending more than a third of your time tinkering with validation, there are probably better uses of time.” The real test is user adoption and behavior in production, not validation suite performance.

As the feature evolved and functionality expanded, the test case library grew, complicating regression analysis. The team found themselves “flitting between different options week over week,” highlighting the unsettled nature of LLM validation practices even within a single project.

Production Deployment and Maintenance

The case study emphasizes that generative AI features fundamentally differ from traditional data science projects in their production lifecycle. The author states plainly: “Generative AI features are never truly finished.” This reflects several realities:

Customer needs continuously evolve
Non-deterministic outputs create ongoing edge cases and bugs
The rapid pace of LLM advancement means state-of-the-art techniques change frequently
Stochastic outputs require more active monitoring and maintenance than deterministic systems

Rather than viewing this as purely negative, the team identifies positive spillover effects from the additional maintenance burden. Continuous work on generative AI features builds internal muscle and expertise for tackling similar problems. Production deployment provides invaluable feedback about actual user needs that can’t be captured in development. Cross-functional collaboration between data science and product teams improves through sustained engagement on customer-facing features.

Organizational and Process Insights

The author reflects on how generative AI features require different approaches than traditional data science projects. They “need to be validated differently, involve new risks, and are often more outward facing than teams are used to.” This suggests organizational adaptation is required - data science teams accustomed to internal analytics or model training pipelines must develop new capabilities around user-facing feature development, non-deterministic behavior management, and rapid iteration cycles.

The case study advocates for a learning-by-doing approach: “Teams, people, and even machine learning models learn by example.” The more time data science teams spend building generative AI tooling, the more institutional knowledge they develop. Similarly, extended user exposure to production features yields better understanding of real-world requirements versus theoretical specifications.

Critical Assessment

The case study provides refreshingly honest insight into the messy realities of production LLM deployment. Several aspects merit balanced consideration:

Strengths: The technical approaches (prompt chaining, few-shot learning via system instructions) represent pragmatic solutions that avoid over-engineering while maintaining quality. The candid discussion of validation challenges and the acknowledgment that perfect validation is neither achievable nor worth pursuing demonstrates mature thinking about production tradeoffs. The emphasis on continuous maintenance as a feature rather than bug of LLM systems sets appropriate expectations.

Open Questions: The case study doesn’t provide quantitative metrics on system performance, reliability, or user adoption. How often do generated segments require manual editing? What percentage of users successfully create segments through natural language versus falling back to traditional UI? How does the system handle ambiguous or underspecified user inputs? The cost implications of the OpenAI API dependency and prompt chaining architecture (multiple API calls per segment generation) aren’t discussed. Additionally, how the team manages prompt versioning, A/B testing of different prompt strategies, and observability into which subtask agents fail most frequently remains unexplored.

Validation Limitations: While the hybrid validation approach is pragmatic, the admission that both LLM-based and hand-designed testing are “imperfect” raises questions about production reliability guarantees. How do they catch regressions that slip through both validation layers? What monitoring exists in production to detect degraded output quality?

Architectural Tradeoffs: The prompt chaining approach trades simplicity and potential cost (multiple API calls) for quality and debuggability. The parallelization strategy mitigates latency concerns but likely increases complexity in error handling and partial failure scenarios. The reliance on OpenAI’s API creates vendor lock-in and exposes the system to external model updates that might change behavior unexpectedly.

Overall, this case study represents a valuable contribution to the emerging body of knowledge around practical LLMOps. It avoids marketing fluff while providing actionable technical details and honest assessment of challenges that remain unsolved. The emphasis on continuous iteration, organizational learning, and realistic expectations for non-deterministic systems provides a mature framework for teams embarking on similar initiatives.

AI-Powered Customer Segmentation with Natural Language Interface

Industry

Technologies

Overview

Technical Architecture and Approach

The Validation Challenge

Production Deployment and Maintenance

Organizational and Process Insights

Critical Assessment

More Like This

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

Agentic Workflow Automation for Financial Operations