Klaviyo, a customer data platform serving 130,000 customers, launched Segments AI in November 2023 to address two key problems: inexperienced users struggling to express customer segments through traditional UI, and experienced users spending excessive time building repetitive complex segments. The solution uses OpenAI's LLMs combined with prompt chaining and few-shot learning techniques to transform natural language descriptions into structured segment definitions adhering to Klaviyo's JSON schema. The team tackled the significant challenge of validating non-deterministic LLM outputs by combining automated LLM-based evaluation with hand-designed test cases, ultimately deploying a production system that required ongoing maintenance due to the stochastic nature of generative AI outputs.
Klaviyo built and deployed Segments AI, a production LLM-powered feature that transforms natural language descriptions into structured customer segment definitions. The company serves 130,000 customers who use their platform to understand and target end customers through segmentation. Released in November 2023, this case study provides valuable insights into the practical challenges of deploying LLM features in production, particularly around validation, prompt engineering strategies, and ongoing maintenance requirements.
The business problem was two-fold: less experienced users lacked the knowledge to translate conceptual customer segments into Klaviyo’s segment builder UI, while highly experienced users were spending hours weekly building repetitive, complex segments manually. Segments AI aimed to democratize segment creation for novices while accelerating workflows for power users.
The system architecture leverages OpenAI’s LLM API but is explicitly described as “far from just being a wrapper around ChatGPT.” The team implemented sophisticated prompt engineering strategies to ensure reliable outputs that conform to Klaviyo’s specific segment JSON schema.
Prompt Chaining Strategy: The core technical innovation involves extensive use of prompt chaining, where complex generation tasks are decomposed into small, atomic, and simple requests. Rather than asking the LLM to generate an entire segment definition in one pass, the system breaks the process into discrete subtasks that can be executed and validated independently. The author notes that prompt chaining typically improves generation speed and quality when outputs can be joined without loss of cohesion - ideal for generating parts of a JSON object but problematic for prose where sentence-by-sentence generation would create disjointed text.
A critical challenge the team identified is “chain entropy” - their term for error propagation through sequential LLM calls, akin to a “telephone game” where mistakes compound. To mitigate this risk, they designed subtasks to be as separable and parallelizable as possible, avoiding waterfall structures where early errors cascade through the entire process. The author clarifies that “prompt chaining” is somewhat of a misnomer in their implementation since many processes run asynchronously in parallel rather than strictly sequentially.
Few-Shot Learning Implementation: Klaviyo implemented few-shot learning not through traditional model fine-tuning but by embedding training examples directly into system instructions. The team describes this as analogous to “reading ChatGPT bedtime stories” - teaching the model through annotated examples that include input, ideal output, and generalizable lessons. Combined with prompt chaining, this creates specialized “chatbot agents” with niche expertise for specific subtasks. The system essentially asks “a series of small, highly specific questions to specialized chatbot agents” and assembles the results into the final segment definition. This approach allows them to create dozens of subtask experts without the overhead of explicit fine-tuning.
The case study provides particularly valuable insight into the thorny problem of validating LLM features with creative, non-deterministic outputs. The team struggled with fundamental questions: how do you test a feature where there’s no one-to-one mapping between input and desired output? When “dozens of jointly valid ways” exist to define a segment like “engaged users,” traditional testing approaches break down.
The author outlines three validation approaches that have emerged across the industry, each with significant tradeoffs:
LLMs evaluating LLMs: Quick and scalable but prone to hallucinations and inconsistent evaluation criteria. The reliability of using one LLM to judge another’s output remains questionable.
Hand-designed test cases: High quality but slow to create and not easily scalable as test requirements grow. Provides clear ground truth but creates maintenance burden.
Human evaluation: High quality and flexible but expensive, time-intensive, and fundamentally unscalable for large test suites.
Klaviyo ultimately adopted a hybrid approach combining LLM-based evaluation with hand-designed test cases, packaging the validation suite to run before major code changes. The author candidly acknowledges both strategies are “imperfect” but provide useful directional feedback for debugging and regression analysis. Importantly, they caution against over-investing in validation engineering: “If you find yourself spending more than a third of your time tinkering with validation, there are probably better uses of time.” The real test is user adoption and behavior in production, not validation suite performance.
As the feature evolved and functionality expanded, the test case library grew, complicating regression analysis. The team found themselves “flitting between different options week over week,” highlighting the unsettled nature of LLM validation practices even within a single project.
The case study emphasizes that generative AI features fundamentally differ from traditional data science projects in their production lifecycle. The author states plainly: “Generative AI features are never truly finished.” This reflects several realities:
Rather than viewing this as purely negative, the team identifies positive spillover effects from the additional maintenance burden. Continuous work on generative AI features builds internal muscle and expertise for tackling similar problems. Production deployment provides invaluable feedback about actual user needs that can’t be captured in development. Cross-functional collaboration between data science and product teams improves through sustained engagement on customer-facing features.
The author reflects on how generative AI features require different approaches than traditional data science projects. They “need to be validated differently, involve new risks, and are often more outward facing than teams are used to.” This suggests organizational adaptation is required - data science teams accustomed to internal analytics or model training pipelines must develop new capabilities around user-facing feature development, non-deterministic behavior management, and rapid iteration cycles.
The case study advocates for a learning-by-doing approach: “Teams, people, and even machine learning models learn by example.” The more time data science teams spend building generative AI tooling, the more institutional knowledge they develop. Similarly, extended user exposure to production features yields better understanding of real-world requirements versus theoretical specifications.
The case study provides refreshingly honest insight into the messy realities of production LLM deployment. Several aspects merit balanced consideration:
Strengths: The technical approaches (prompt chaining, few-shot learning via system instructions) represent pragmatic solutions that avoid over-engineering while maintaining quality. The candid discussion of validation challenges and the acknowledgment that perfect validation is neither achievable nor worth pursuing demonstrates mature thinking about production tradeoffs. The emphasis on continuous maintenance as a feature rather than bug of LLM systems sets appropriate expectations.
Open Questions: The case study doesn’t provide quantitative metrics on system performance, reliability, or user adoption. How often do generated segments require manual editing? What percentage of users successfully create segments through natural language versus falling back to traditional UI? How does the system handle ambiguous or underspecified user inputs? The cost implications of the OpenAI API dependency and prompt chaining architecture (multiple API calls per segment generation) aren’t discussed. Additionally, how the team manages prompt versioning, A/B testing of different prompt strategies, and observability into which subtask agents fail most frequently remains unexplored.
Validation Limitations: While the hybrid validation approach is pragmatic, the admission that both LLM-based and hand-designed testing are “imperfect” raises questions about production reliability guarantees. How do they catch regressions that slip through both validation layers? What monitoring exists in production to detect degraded output quality?
Architectural Tradeoffs: The prompt chaining approach trades simplicity and potential cost (multiple API calls) for quality and debuggability. The parallelization strategy mitigates latency concerns but likely increases complexity in error handling and partial failure scenarios. The reliance on OpenAI’s API creates vendor lock-in and exposes the system to external model updates that might change behavior unexpectedly.
Overall, this case study represents a valuable contribution to the emerging body of knowledge around practical LLMOps. It avoids marketing fluff while providing actionable technical details and honest assessment of challenges that remain unsolved. The emphasis on continuous iteration, organizational learning, and realistic expectations for non-deterministic systems provides a mature framework for teams embarking on similar initiatives.
This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.
OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.
Ramp, a finance automation platform serving over 50,000 customers, built a comprehensive suite of AI agents to automate manual financial workflows including expense policy enforcement, accounting classification, and invoice processing. The company evolved from building hundreds of isolated agents to consolidating around a single agent framework with thousands of skills, unified through a conversational interface called Omnichat. Their Policy Agent product, which uses LLMs to interpret and enforce expense policies written in natural language, demonstrates significant production deployment challenges and solutions including iterative development starting with simple use cases, extensive evaluation frameworks, human-in-the-loop labeling sessions, and careful context engineering. Additionally, Ramp built an internal coding agent called Ramp Inspect that now accounts for over 50% of production PRs merged weekly, illustrating how AI infrastructure investments enable broader organizational productivity gains.