ZenML

Enterprise-Scale Prompt Engineering Toolkit with Lifecycle Management and Production Integration

Uber 2023
View original source

Uber developed a comprehensive prompt engineering toolkit to address the challenges of managing and deploying LLMs at scale. The toolkit provides centralized prompt template management, version control, evaluation frameworks, and production deployment capabilities. It includes features for prompt creation, iteration, testing, and monitoring, along with support for both offline batch processing and online serving. The system integrates with their existing infrastructure and supports use cases like rider name validation and support ticket summarization.

Industry

Tech

Technologies

Overview

Uber’s Prompt Engineering Toolkit represents a comprehensive enterprise solution for managing the full lifecycle of LLM interactions at scale. The toolkit addresses a fundamental challenge in large organizations: how to centralize and standardize the creation, management, execution, and monitoring of prompt templates across diverse teams and use cases. This case study provides valuable insights into how a major technology company approaches LLMOps infrastructure, though it should be noted that the blog post is primarily a technical overview rather than a results-focused case study with quantitative outcomes.

The core motivation behind the toolkit was the need for centralization—enabling teams to seamlessly construct prompt templates, manage them with proper governance, and execute them against various underlying LLMs. This reflects a common enterprise challenge where individual teams may otherwise develop siloed approaches to LLM integration, leading to inconsistency, duplication of effort, and difficulty maintaining quality standards.

Architecture and Technical Components

The toolkit architecture consists of several interconnected components that work together to facilitate LLM deployment, prompt evaluation, and batch inference. At the core is a Prompt Template UI/SDK that manages prompt templates and their revisions. This integrates with key APIs—specifically GetAPI for retrieving templates and Execute API for running prompts against models.

The underlying infrastructure leverages ETCD and UCS (Object Configuration Storage) for storing models and prompts. These stored artifacts feed into two critical pipelines: an Offline Generation Pipeline for batch processing and a Prompt Template Evaluation Pipeline for assessing prompt quality. The system also integrates with what Uber calls “ObjectConfig,” an internal configuration deployment system that handles the safe dissemination of deployed prompt templates to production services.

A notable architectural decision is the use of Uber’s internal “Langfx” framework, which is built on top of LangChain. This abstraction layer enables the auto-prompt builder functionality while providing a standardized interface for LLM interactions across the organization.

Prompt Engineering Lifecycle

The toolkit structures the prompt engineering process into two distinct stages: development and productionization. This separation reflects mature LLMOps thinking about the differences between experimentation and production deployment.

Development Stage

The development stage comprises three phases. First, in LLM exploration, users interact with a model catalog and the GenAI Playground to understand what models are available and test their applicability to specific use cases. The model catalog contains detailed specifications, expected use cases, cost estimations, and performance metrics for each model—information critical for making informed decisions about model selection.

The prompt template iteration phase is where the core prompt engineering work happens. Users identify business needs, gather sample data, create and test prompts, and make iterative revisions. The toolkit includes an auto-prompting feature that suggests prompt creation to help users avoid starting from scratch. A prompt template catalog enables discovery and reuse of existing templates, promoting organizational knowledge sharing.

The evaluation phase focuses on testing prompt templates against extensive datasets to measure performance. The toolkit supports two evaluation mechanisms: using an LLM as the evaluator (the “LLM as Judge” paradigm) and using custom, user-defined code for assessment. The LLM-as-judge approach is noted as particularly useful for subjective quality assessments or linguistic nuances, while code-based evaluation allows for highly tailored metrics.

Productionization Stage

The productionization stage only proceeds with prompt templates that have passed evaluation thresholds. This gatekeeping mechanism is a critical LLMOps control that helps prevent poorly-performing prompts from reaching production. Once deployed, the system enables tracking and monitoring of usage in the production environment, with data collection informing further enhancements.

Version Control and Safe Deployment

One of the more sophisticated aspects of the toolkit is its approach to version control and deployment safety. Prompt template iteration follows code-based iteration best practices, requiring code review for every iteration. When changes are approved and landed, a new prompt template revision is created.

The system addresses a subtle but important concern: users may not want their production prompt templates altered with each update, as inadvertent errors in revisions could impact live systems. To solve this, the toolkit supports a deployment naming system where prompt templates can be deployed under arbitrary deployment names, allowing users to “tag” their preferred prompt template for production. This prevents accidental changes to production services.

The deployment mechanism leverages ObjectConfig for what Uber calls “universal configuration synchronization,” ensuring that production services fetch the correct prompt template upon deployment. This approach mirrors configuration management practices from traditional software engineering, adapted for the LLM context.

Advanced Prompting Techniques

The toolkit incorporates several research-backed prompting techniques into its auto-prompt builder. These include Chain of Thought (CoT) prompting for complex reasoning tasks, Auto-CoT for automatic reasoning chain generation, prompt chaining for multi-step operations, Tree of Thought (ToT) for exploratory problem-solving, Automatic Prompt Engineer (APE) for instruction generation and selection, and Multimodal CoT for incorporating both text and vision modalities.

By embedding these techniques into the platform’s guidance system, Uber aims to democratize advanced prompting capabilities—enabling users without deep ML expertise to leverage sophisticated approaches. However, it’s worth noting that the blog doesn’t provide quantitative evidence on how effectively these techniques improve outcomes compared to simpler approaches.

Production Use Cases

The blog describes two concrete production use cases. The first is an offline batch processing scenario for rider name validation, which verifies the legitimacy of consumer usernames. The LLM Batch Offline Generation pipeline processes all existing usernames in Uber’s consumer database plus new registrations asynchronously in batches. The prompt template uses dynamic placeholders (e.g., “Is this {{user_name}} a valid human name?”) that get hydrated from dataset columns during processing.

The second use case involves online LLM services for customer support ticket summarization. When support contacts are handed off between agents, the system generates summaries so new agents don’t need to review entire ticket histories or ask customers to repeat themselves. This demonstrates a practical application of LLMs to improve operational efficiency.

The online service supports dynamic placeholder substitution using Jinja-based template syntax, with the caller responsible for providing runtime values. The service also supports fan-out capabilities across prompts, templates, and models, allowing for flexible deployment patterns.

Monitoring and Observability

Production monitoring is treated as a first-class concern in the toolkit. A daily performance monitoring pipeline runs against production traffic to evaluate prompt template performance. Metrics tracked include latency, accuracy, and correctness, among others. Results are displayed in an MES (Machine Learning Experimentation System) dashboard that refreshes daily.

This monitoring approach enables regression detection and continuous quality tracking for each production prompt template iteration. The daily cadence suggests a balance between monitoring freshness and computational overhead, though more latency-sensitive applications might require more frequent monitoring.

Critical Assessment

While the toolkit represents a thoughtful approach to enterprise LLMOps, several aspects warrant critical consideration. The blog is primarily a technical architecture overview rather than a results-focused case study—there are no quantitative metrics on improved prompt quality, reduced development time, or cost savings. Claims about benefits remain largely theoretical.

The integration with Uber-specific infrastructure (ObjectConfig, MES, internal Langfx service) means the specific implementation isn’t directly transferable to other organizations, though the architectural patterns and lifecycle concepts are broadly applicable.

The safety measures mentioned (hallucination checks, standardized evaluation framework, safety policy) are noted as needs in the introduction but receive limited detail in the technical discussion. Organizations implementing similar systems would need to develop these components more thoroughly.

Future development directions mentioned include integration with online evaluation and RAG for both evaluation and offline generation, suggesting the toolkit is still evolving rather than representing a complete solution.

Conclusion

Uber’s Prompt Engineering Toolkit demonstrates a mature enterprise approach to LLMOps, emphasizing centralization, version control, safe deployment practices, and continuous monitoring. The system bridges development and production stages with appropriate gatekeeping while enabling self-service capabilities for prompt engineers across the organization. While specific outcomes aren’t quantified, the architectural patterns and lifecycle management concepts provide valuable reference points for organizations building similar LLMOps infrastructure.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra 2025

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

customer_support chatbot speech_recognition +36

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57