Twilio Segment developed a novel LLM-as-Judge evaluation framework to assess and improve their CustomerAI audiences feature, which uses LLMs to generate complex audience queries from natural language. The system achieved over 90% alignment with human evaluation for ASTs, enabled 3x improvement in audience creation time, and maintained 95% feature retention. The framework includes components for generating synthetic evaluation data, comparing outputs against ground truth, and providing structured scoring mechanisms.
Twilio Segment, a customer data platform company, developed a novel LLM evaluation system called “LLM-as-Judge” to assess and improve their generative AI-powered audience building feature. The core product challenge was to enable marketers to create sophisticated customer audiences through simple natural language prompts rather than navigating complex user interfaces. This case study provides valuable insights into how to build robust evaluation frameworks for production LLM systems, particularly when dealing with outputs that can have multiple valid representations.
The CustomerAI audiences feature allows marketers to describe an audience in natural language (e.g., “Customers who have purchased at least 1 time”) and have the system automatically generate the corresponding query logic. According to Segment, customers using this feature experienced a 3x improvement in median time-to-audience creation and a 95% feature retention rate when the audience generation works on the first attempt—highlighting the business-critical importance of getting the LLM outputs right.
The fundamental challenge Segment faced was how to evaluate a generative AI system when there can be an unbounded set of “right answers.” Behind the Segment UI, audience definitions are compiled into Abstract Syntax Trees (ASTs), which are tree-like data structures representing code structure similar to JSON objects. The same audience logic can be expressed in multiple semantically equivalent ways. For example, “Customers who have purchased at least 1 time” could also be correctly expressed as “Customers who have purchased more than 0 times but less than 2 times.”
Traditional evaluation approaches using exact matching or rule-based systems would fail in this context because they cannot account for semantic equivalence. This is a common challenge in LLMOps for code generation, SQL generation, and any domain where outputs have flexible structure but need to maintain semantic correctness.
Segment adopted the “LLM-as-Judge” paradigm, which uses a separate LLM (the “judge”) to evaluate, compare, and score prompt-output pairs against ground truth examples. This approach draws on recent research including JudgeLM, Prometheus, Generative Judge for Evaluating Alignment, and particularly the LLM-SQL-Solver paper which focuses on determining SQL equivalence—directly relevant to AST evaluation.
The architecture consists of several interconnected components working together:
This multi-agent approach demonstrates a sophisticated understanding of evaluation requirements in production LLM systems.
One of the most interesting aspects of this case study is the synthetic evaluation data generation approach. Segment had a large dataset of ground truth ASTs from their UI, but these ASTs lacked corresponding natural language prompts since they were created through the UI rather than natural language input.
To solve this, they built an “LLM Question Generator Agent” that takes a ground truth AST and generates a plausible natural language prompt that would produce that AST. This is essentially running the generation process in reverse—extracting prompts from outputs rather than outputs from prompts. The synthetic prompts are then fed into the AST Generator Agent, and the resulting AST is compared against the original ground truth by the LLM Judge.
This approach to synthetic data generation is valuable for organizations that have structured data but need to create evaluation sets for natural language interfaces. It enables the creation of large-scale evaluation datasets without requiring expensive human annotation.
Segment discovered several practical lessons about using LLMs for evaluation that are broadly applicable:
Discrete Scoring Scales: LLMs struggle with continuous scores. When asked to provide scores from 0 to 100, models tend to output only discrete values like 0 and 100. Segment addressed this by using a discrete 1-5 scale, with 1 being “very bad” and 5 being “perfect.” This made results more interpretable and reliable.
Chain of Thought (CoT) Reasoning: Implementing Chain of Thought prompting for the judge model improved alignment with human evaluators from approximately 89% to 92%. CoT allows the model to explain its reasoning, which serves dual purposes: it improves the quality of judgments and makes it easier for engineers to understand and debug the evaluation process. This transparency is crucial for building trust in automated evaluation systems.
The evaluation framework enabled systematic model comparison, which is one of the primary use cases for any LLMOps evaluation system. Segment tested multiple models for the AST Generator Agent:
For the Judge model itself, Segment used OpenAI’s GPT-4. Interestingly, they found that using other strong models like Claude 3 Opus as the judge produced similar scores to GPT-4, suggesting good alignment between different frontier models when used as evaluators.
The overall LLM Judge Evaluation system achieved over 90% alignment with human evaluation for ASTs, which is a strong result that justified moving to production with this automated evaluation approach.
The evaluation framework serves multiple LLMOps purposes beyond one-time model selection:
Having baseline scores enables continuous iteration and optimization. As Segment explores adding persistent memory via RAG, adopting new models, or changing prompting strategies, they can compare new scores against baselines to quantify impact.
Segment emphasizes their AI principles of being Transparent, Responsible, and Accountable. They reference a “Generative Audiences Nutrition Facts Label” that documents how data is used for this feature. This is an important aspect of production LLM systems that is often overlooked in technical discussions but is critical for customer trust and regulatory compliance.
Segment outlined several planned optimizations that reflect ongoing LLMOps challenges:
While this case study provides valuable insights, there are some considerations worth noting. The 90% alignment with human evaluation sounds impressive, but the remaining 10% of cases where the automated judge disagrees with humans could represent edge cases that are disproportionately important. The case study does not detail how they handle or investigate these disagreement cases.
Additionally, using an LLM to evaluate another LLM creates potential blind spots—both models may share similar failure modes that neither would catch. Segment’s finding that different frontier models (GPT-4, Claude 3 Opus) produce similar judgments could be interpreted positively (agreement indicates correctness) or negatively (shared biases in training).
The synthetic data generation approach, while clever, may introduce distribution shift if the LLM-generated prompts don’t accurately reflect how real users phrase their requests. This could lead to optimizing for a different distribution than production traffic.
Despite these caveats, the LLM-as-Judge approach represents a practical and scalable solution to a real production challenge, and Segment’s willingness to share specific metrics (90% alignment, 4.55/5.0 scores, 89% to 92% improvement with CoT) provides useful benchmarks for others implementing similar systems.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.