This case study examines a common scenario in LLM systems where proper error handling and response validation is essential. The "Not Acceptable" error demonstrates the importance of implementing robust error handling mechanisms in production LLM applications to maintain system reliability and user experience.
Uber’s QueryGPT represents a compelling case study in taking an LLM-powered tool from hackathon concept to production-ready service for enterprise data access. The system addresses a significant operational challenge: enabling engineers, operations managers, and data scientists to generate SQL queries through natural language prompts rather than requiring deep knowledge of both SQL syntax and Uber’s internal data models. With approximately 1.2 million interactive queries processed monthly on their data platform and the Operations organization contributing around 36% of these queries, the potential productivity gains from automating query authoring are substantial.
The project originated during Uber’s Generative AI Hackdays in May 2023 and underwent over 20 iterations before reaching its current production state. The claimed productivity improvement is reducing query authoring time from approximately 10 minutes to 3 minutes, though these figures should be viewed as estimates rather than rigorously validated metrics. In limited release, the system reported 300 daily active users with 78% indicating time savings compared to writing queries from scratch.
The first version relied on a straightforward Retrieval-Augmented Generation (RAG) approach using few-shot prompting. The system would vectorize the user’s natural language prompt and perform k-nearest neighbor similarity search to retrieve 3 relevant tables and 7 relevant SQL samples from a small dataset of 7 tier-1 tables and 20 SQL queries. These retrieved samples were combined with custom instructions (covering Uber-specific business concepts like date handling) and sent to the LLM for query generation.
This approach revealed several significant limitations as they attempted to scale beyond the initial small dataset:
RAG Quality Issues: Direct similarity search between natural language prompts and SQL/schema samples produced poor results because the semantic spaces don’t align well. A question like “Find the number of trips completed yesterday in Seattle” has limited lexical overlap with CREATE TABLE statements or SELECT queries.
Intent Classification Gap: The system lacked an intermediate step to classify user intent and map it to relevant schemas and samples, leading to retrieval of irrelevant context.
Token Limit Constraints: Large enterprise schemas with 200+ columns could consume 40-60K tokens each. With multiple large tables, the system exceeded the token limits of available models (32K at the time).
The current production architecture introduces several key innovations that demonstrate mature LLMOps practices:
Workspace Concept: Rather than searching across all available data, the system implements “workspaces” as curated collections of SQL samples and tables organized by business domain (Ads, Mobility, Core Services, IT, Platform Engineering, etc.). This domain-specific partitioning significantly narrows the search radius for RAG and improves relevance. The system includes 11+ pre-defined “system workspaces” plus support for user-created “custom workspaces.”
Multi-Agent Architecture: The system employs a chain of specialized LLM agents, each handling a focused task:
Intent Agent: Maps user prompts to business domains/workspaces before retrieval occurs. This classification step dramatically improves RAG precision by constraining the search space to relevant schemas and samples.
Table Agent: An LLM agent that selects appropriate tables and presents them to the user for confirmation or modification. This introduces a human-in-the-loop step that addresses feedback about incorrect table selection, allowing users to “ACK” or edit the table list before query generation proceeds.
Column Prune Agent: Addresses the token limit challenge by using an LLM to identify and remove irrelevant columns from large schemas before sending to the query generation step. This optimization improved not only token consumption and cost but also reduced latency due to smaller input sizes.
The architecture demonstrates the principle that decomposing complex tasks into specialized agents improves accuracy compared to asking an LLM to handle a broad generalized task. The article explicitly notes this as a key learning: “LLMs are excellent classifiers” when given small units of specialized work.
The evaluation approach is particularly noteworthy from an LLMOps perspective, as it acknowledges the inherent challenges of evaluating non-deterministic systems:
Golden Dataset Curation: The team manually created a set of question-to-SQL mappings covering various datasets and business domains. This required upfront investment in manually verifying correct intent, required schemas, and golden SQL for real questions extracted from QueryGPT logs.
Multi-Signal Evaluation: The evaluation captures signals throughout the generation pipeline rather than just final output:
Product Flow Testing: The evaluation supports multiple testing modes including a “vanilla” mode measuring baseline performance and a “decoupled” mode that enables component-level evaluation by providing correct inputs at each stage, removing dependencies on earlier component performance.
Handling Non-Determinism: The team explicitly acknowledges that identical evaluations can produce different outcomes due to LLM non-determinism, advising against over-indexing on run-to-run changes of approximately 5%. Instead, they focus on identifying error patterns over longer time periods that can inform specific improvements.
Visualization and Tracking: The system includes dashboards for tracking question-level run results over time, enabling identification of repeated shortcomings and regressions.
The case study is refreshingly honest about ongoing challenges:
Hallucinations: The system still experiences instances where the LLM generates queries with non-existent tables or columns. The team has experimented with prompt modifications, introduced chat-style iteration modes, and is exploring a “Validation” agent for recursive hallucination correction, but acknowledges this remains unsolved.
User Prompt Quality: Real-world user prompts range from detailed and well-specified to brief 5-word questions with typos asking broad questions requiring multi-table joins. The team found that relying solely on raw user input caused accuracy issues, leading to the development of prompt enhancement/expansion capabilities.
High Accuracy Expectations: Users expect generated queries to “just work” with high accuracy, creating a challenging bar for a generative system. The team recommends careful selection of initial user personas for products of this nature.
Evaluation Coverage: With hundreds of thousands of datasets at varying documentation levels, the evaluation set cannot fully cover all possible business questions. The team treats evaluation as an evolving artifact that grows with product usage and new bug discoveries.
Multiple Valid Answers: SQL queries often have multiple valid solutions using different tables or writing styles, complicating automated evaluation. The LLM-based similarity scoring helps identify when generated queries achieve the same intent through different approaches.
The system uses OpenAI GPT-4 Turbo with 128K token context (model version 1106 mentioned). Vector databases and similarity search (k-nearest neighbor) are employed for the RAG components. The transition from 32K to 128K context models helped address token limit issues, but the Column Prune Agent remained necessary due to the extreme size of some enterprise schemas.
As of publication (September 2024), QueryGPT is in limited release to Operations and Support teams, indicating a cautious rollout approach appropriate for a system that generates executable code. The 300 daily active users and 78% satisfaction metric suggest positive early adoption, though broader rollout presumably depends on continued accuracy improvements.
The iterative development process (20+ algorithm versions) and the human-in-the-loop design for table selection demonstrate a pragmatic approach to deploying LLM systems in production where errors have direct operational impact on data access and analysis workflows.
Roblox has implemented a comprehensive suite of generative AI features across their gaming platform, addressing challenges in content moderation, code assistance, and creative tools. Starting with safety features using transformer models for text and voice moderation, they expanded to developer tools including AI code assistance, material generation, and specialized texture creation. The company releases new AI features weekly, emphasizing rapid iteration and public testing, while maintaining a balance between automation and creator control. Their approach combines proprietary solutions with open-source contributions, demonstrating successful large-scale deployment of AI in a production gaming environment serving 70 million daily active users.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.