Lyft's journey of evolving their ML platform to support GenAI infrastructure, focusing on how they adapted their existing ML serving infrastructure to handle LLMs and built new components for AI operations. The company transitioned from self-hosted models to vendor APIs, implemented comprehensive evaluation frameworks, and developed an AI assistants interface, while maintaining their established ML lifecycle principles. This evolution enabled various use cases including customer support automation and internal productivity tools.
This case study comes from a conference talk by Constantine, an engineer at Lyft who has worked on ML platform and its applications for over four and a half years. The presentation provides a detailed look at how Lyft approached integrating generative AI capabilities into their existing ML infrastructure, treating it as an evolution of their platform rather than building something entirely new. Lyft operates a substantial ML infrastructure with more than 50 engineering teams using models, over 100 GitHub repositories, and more than 1,000 unique models, some handling 10,000+ requests per second. This breadth of ML adoption, which Constantine notes is more liberal than many companies of similar size, provided both challenges and opportunities when LLMs gained popularity in 2023.
Lyft’s approach to AI infrastructure was heavily informed by their existing ML platform philosophy. They don’t think of models as something trained once and forgotten, but as entities that exist indefinitely throughout a lifecycle. This lifecycle includes: prototyping in Jupyter notebooks, registering reproducible training jobs, running training in compute environments, deploying trained models, serving in standardized environments, and then monitoring, understanding performance, iterating, and retraining. This same lifecycle thinking became the lens through which they approached AI/LLM infrastructure.
A key design principle that enabled their AI evolution was the concept of a unified Lyft ML model interface. Early on, they recognized that supporting diverse frameworks required a common wrapper interface, which made it much easier to deploy models into unified serving infrastructure. Around 2021, they found that building wrappers for every framework wasn’t scalable, so they adopted a pattern inspired by projects like AWS SageMaker, Seldon, and KServe, allowing developers to bring their own pre- and post-processing code that would run between the company’s ML interface and the trained model.
When LLMs gained popularity in 2023, Lyft’s flexible serving platform enabled them to quickly experiment with self-hosted models. One of their first deployments was Databricks’ Dolly model. However, they quickly discovered that self-hosting wasn’t what most users wanted, and it became clear that Lyft would rely on vendors via API for the bulk of their LLM usage in the foreseeable future.
This led to an interesting architectural decision: Constantine built a prototype where the Lyft ML model interface wrapped a proxy to the OpenAI API, deployed within their model serving system as just another type of Lyft ML model. The key difference was that there was no underlying model binary—the “model” was essentially arbitrary code that proxied requests to external APIs. As Constantine notes somewhat humorously, their previous optimization allowed their platform to run models with arbitrary code around them, but this new optimization discarded the model portion entirely while keeping the code wrapper.
From a platform standpoint, this proxy approach delivered significant benefits including: standardized observability and operational metrics, security and network infrastructure consistency, simplified model management, and the ability to reason about LLM usage just like any other model in their system.
One of Lyft’s key design decisions was to utilize open-source LLM clients (like the OpenAI Python package) but modify them to interface with their internal proxy. They created wrapper packages that maintained the same interface for constructing requests as the public packages, but overwrote the transport layer to route HTTP requests to their ML serving system, which in turn hosted their proxy.
This dual control over both client-side and server-side code provided significant advantages for building platform features. Concrete benefits included: clients operating without API keys (injected server-side), granular insights about traffic sources (whether from notebooks, laptops, or servers), capturing usernames, service names, and environments, and the flexibility to build additional AI products by modifying either end of the stack. They applied this playbook to more than half a dozen LLM vendors including OpenAI and Anthropic.
By early 2024, Lyft was seeing explosive growth in LLM usage. With 100% of traffic going through their proxy, they could see who was using LLMs but lacked tooling to understand how they were being used and whether usage was meaningful. This led to developing an evaluations framework.
Rather than adopting external vendor tooling, Lyft decided to build a lightweight internal evaluation framework that could meet their immediate requirements. They identified three categories of evaluations:
Specific use cases driving these requirements included:
For PII filtering (online input), their security team preferred filtering out personally identifiable information before sending prompts to vendors. In their implementation, when a prompt like “Hello I am Constantine Garski” is received, it gets routed to an internal PII filtering model hosted in Lyft’s infrastructure, which removes PII before the prompt reaches the LLM vendor. The response can optionally have PII reinserted on the return path.
For output guard rails (online output), product teams wanted to ban certain topics or apply response filters.
For quality analysis (offline), product teams deploying LLMs needed to analyze the quality of their applications. The common pattern here was using LLM-as-judge, where another LLM with a tailored prompt evaluates request-response pairs against specific criteria. Examples mentioned include checking whether responses are unhelpful to users or lack information to fully answer inquiries.
Looking forward, Lyft’s roadmap involves building higher-level interfaces for AI assistants. Their design decision is to create another Lyft ML interface (similar to their model interface) that allows declarative definition of AI applications. This wraps their core LLM functionality, evaluations, proxy, and clients, while adding two key capabilities: knowledge bases and tools.
The assistant architecture involves prompts being augmented with relevant knowledge (RAG pattern) to create augmented prompts, along with tool registration that enables LLMs to call tools in a loop. Constantine notes that almost every LLM vendor supports this pattern, as do higher-level libraries like LangChain.
Constantine draws interesting parallels between traditional ML model lifecycles and AI assistant lifecycles. Several components become different or less relevant:
This perspective led to the insight that AI assistants don’t look fundamentally different from ML models when viewed through the right lens, which validated their approach of treating AI as an evolution of their existing platform rather than something entirely separate.
Lyft has deployed LLMs across several use cases, though some details were noted as sensitive and couldn’t be fully shared:
Slack AI Bot: An internal bot that can search over company data. One example discussed was using few-shot prompting to help generate incident reports. When Lyft has incidents (service outages, data drops), they create Slack channels to discuss them and must complete administrative paperwork with sections like initial detection, root cause, remediation, and action items. By providing the Slack bot with examples of well-structured reports, they can generate good first drafts of these documents, expediting the process for developers.
Customer Support (Flagship Use Case): When a customer support session starts, the first attempt to answer questions uses a RAG-based document search—an LLM plus knowledge base finding relevant documents. If the issue isn’t resolved quickly, human support agents join with better context from the initial AI interaction. This has resulted in faster time to first response and better agent context when human handoff occurs.
Other mentioned use cases include fraud detection and prevention, performance review self-reflection iteration, and translation services. The speaker noted more user-facing products are coming in 2025 but couldn’t share details.
Constantine distilled the evolution of Lyft’s ML platform to AI platform into three steps:
The theme of expanding model capabilities over time is also relevant—from simple regression models to distributed deep learning to image/text inputs to LLM API proxies to full assistants. The approach suggests a long-term roadmap of supporting more capabilities within their AI container abstraction.
While the presentation provides valuable insights into building LLM infrastructure at scale, some caveats should be noted. The speaker acknowledges that LLM usage growth “tapered off throughout the year” after initial exponential growth in early 2024, suggesting the initial excitement may have exceeded practical adoption. The decision to build custom evaluation tooling rather than use vendors was framed as meeting immediate requirements, but may require ongoing investment to keep pace with rapidly evolving vendor offerings. Additionally, specific metrics around cost savings, latency impacts, or quantified improvements from the customer support use case were not provided, making it difficult to assess the concrete business impact. However, the architectural patterns and lifecycle thinking presented offer practical templates for organizations looking to integrate LLMs into existing ML infrastructure.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.