Lyft’s Generative AI Infrastructure: Evolving from ML Platform to AI Platform

Overview

This case study comes from a conference talk by Constantine, an engineer at Lyft who has worked on ML platform and its applications for over four and a half years. The presentation provides a detailed look at how Lyft approached integrating generative AI capabilities into their existing ML infrastructure, treating it as an evolution of their platform rather than building something entirely new. Lyft operates a substantial ML infrastructure with more than 50 engineering teams using models, over 100 GitHub repositories, and more than 1,000 unique models, some handling 10,000+ requests per second. This breadth of ML adoption, which Constantine notes is more liberal than many companies of similar size, provided both challenges and opportunities when LLMs gained popularity in 2023.

ML Platform Foundation and Lifecycle Philosophy

Lyft’s approach to AI infrastructure was heavily informed by their existing ML platform philosophy. They don’t think of models as something trained once and forgotten, but as entities that exist indefinitely throughout a lifecycle. This lifecycle includes: prototyping in Jupyter notebooks, registering reproducible training jobs, running training in compute environments, deploying trained models, serving in standardized environments, and then monitoring, understanding performance, iterating, and retraining. This same lifecycle thinking became the lens through which they approached AI/LLM infrastructure.

A key design principle that enabled their AI evolution was the concept of a unified Lyft ML model interface. Early on, they recognized that supporting diverse frameworks required a common wrapper interface, which made it much easier to deploy models into unified serving infrastructure. Around 2021, they found that building wrappers for every framework wasn’t scalable, so they adopted a pattern inspired by projects like AWS SageMaker, Seldon, and KServe, allowing developers to bring their own pre- and post-processing code that would run between the company’s ML interface and the trained model.

Transition to LLM Support

When LLMs gained popularity in 2023, Lyft’s flexible serving platform enabled them to quickly experiment with self-hosted models. One of their first deployments was Databricks’ Dolly model. However, they quickly discovered that self-hosting wasn’t what most users wanted, and it became clear that Lyft would rely on vendors via API for the bulk of their LLM usage in the foreseeable future.

This led to an interesting architectural decision: Constantine built a prototype where the Lyft ML model interface wrapped a proxy to the OpenAI API, deployed within their model serving system as just another type of Lyft ML model. The key difference was that there was no underlying model binary—the “model” was essentially arbitrary code that proxied requests to external APIs. As Constantine notes somewhat humorously, their previous optimization allowed their platform to run models with arbitrary code around them, but this new optimization discarded the model portion entirely while keeping the code wrapper.

From a platform standpoint, this proxy approach delivered significant benefits including: standardized observability and operational metrics, security and network infrastructure consistency, simplified model management, and the ability to reason about LLM usage just like any other model in their system.

LLM Client Libraries and Proxy Architecture

One of Lyft’s key design decisions was to utilize open-source LLM clients (like the OpenAI Python package) but modify them to interface with their internal proxy. They created wrapper packages that maintained the same interface for constructing requests as the public packages, but overwrote the transport layer to route HTTP requests to their ML serving system, which in turn hosted their proxy.

This dual control over both client-side and server-side code provided significant advantages for building platform features. Concrete benefits included: clients operating without API keys (injected server-side), granular insights about traffic sources (whether from notebooks, laptops, or servers), capturing usernames, service names, and environments, and the flexibility to build additional AI products by modifying either end of the stack. They applied this playbook to more than half a dozen LLM vendors including OpenAI and Anthropic.

Evaluation Framework

By early 2024, Lyft was seeing explosive growth in LLM usage. With 100% of traffic going through their proxy, they could see who was using LLMs but lacked tooling to understand how they were being used and whether usage was meaningful. This led to developing an evaluations framework.

Rather than adopting external vendor tooling, Lyft decided to build a lightweight internal evaluation framework that could meet their immediate requirements. They identified three categories of evaluations:

Online input evaluations: Running checks before prompts are sent to LLMs
Online output evaluations: Running checks on responses before they’re returned to applications
Offline evaluations: Analyzing input-output pairs for quality assessment

Specific use cases driving these requirements included:

For PII filtering (online input), their security team preferred filtering out personally identifiable information before sending prompts to vendors. In their implementation, when a prompt like “Hello I am Constantine Garski” is received, it gets routed to an internal PII filtering model hosted in Lyft’s infrastructure, which removes PII before the prompt reaches the LLM vendor. The response can optionally have PII reinserted on the return path.

For output guard rails (online output), product teams wanted to ban certain topics or apply response filters.

For quality analysis (offline), product teams deploying LLMs needed to analyze the quality of their applications. The common pattern here was using LLM-as-judge, where another LLM with a tailored prompt evaluates request-response pairs against specific criteria. Examples mentioned include checking whether responses are unhelpful to users or lack information to fully answer inquiries.

AI Assistants and Higher-Level Abstractions

Looking forward, Lyft’s roadmap involves building higher-level interfaces for AI assistants. Their design decision is to create another Lyft ML interface (similar to their model interface) that allows declarative definition of AI applications. This wraps their core LLM functionality, evaluations, proxy, and clients, while adding two key capabilities: knowledge bases and tools.

The assistant architecture involves prompts being augmented with relevant knowledge (RAG pattern) to create augmented prompts, along with tool registration that enables LLMs to call tools in a loop. Constantine notes that almost every LLM vendor supports this pattern, as do higher-level libraries like LangChain.

Lifecycle Comparison: ML vs AI

Constantine draws interesting parallels between traditional ML model lifecycles and AI assistant lifecycles. Several components become different or less relevant:

Feature definitions and feature stores → Knowledge bases (conceptually similar as things computed about the world offline and stored for model use)
Traditional monitoring → Evaluations (conceptually similar as understanding how models operate)

This perspective led to the insight that AI assistants don’t look fundamentally different from ML models when viewed through the right lens, which validated their approach of treating AI as an evolution of their existing platform rather than something entirely separate.

Production Use Cases

Lyft has deployed LLMs across several use cases, though some details were noted as sensitive and couldn’t be fully shared:

Slack AI Bot: An internal bot that can search over company data. One example discussed was using few-shot prompting to help generate incident reports. When Lyft has incidents (service outages, data drops), they create Slack channels to discuss them and must complete administrative paperwork with sections like initial detection, root cause, remediation, and action items. By providing the Slack bot with examples of well-structured reports, they can generate good first drafts of these documents, expediting the process for developers.

Customer Support (Flagship Use Case): When a customer support session starts, the first attempt to answer questions uses a RAG-based document search—an LLM plus knowledge base finding relevant documents. If the issue isn’t resolved quickly, human support agents join with better context from the initial AI interaction. This has resulted in faster time to first response and better agent context when human handoff occurs.

Other mentioned use cases include fraud detection and prevention, performance review self-reflection iteration, and translation services. The speaker noted more user-facing products are coming in 2025 but couldn’t share details.

Key Takeaways and Design Philosophy

Constantine distilled the evolution of Lyft’s ML platform to AI platform into three steps:

Consider AI models and assistants as a special case of ML models
Adapt the ML lifecycle (their product North Star) to support AI through its entire lifecycle
Build components necessary to fill the gaps

The theme of expanding model capabilities over time is also relevant—from simple regression models to distributed deep learning to image/text inputs to LLM API proxies to full assistants. The approach suggests a long-term roadmap of supporting more capabilities within their AI container abstraction.

Balanced Assessment

While the presentation provides valuable insights into building LLM infrastructure at scale, some caveats should be noted. The speaker acknowledges that LLM usage growth “tapered off throughout the year” after initial exponential growth in early 2024, suggesting the initial excitement may have exceeded practical adoption. The decision to build custom evaluation tooling rather than use vendors was framed as meeting immediate requirements, but may require ongoing investment to keep pace with rapidly evolving vendor offerings. Additionally, specific metrics around cost savings, latency impacts, or quantified improvements from the customer support use case were not provided, making it difficult to assess the concrete business impact. However, the architectural patterns and lifecycle thinking presented offer practical templates for organizations looking to integrate LLMs into existing ML infrastructure.

Evolution of ML Platform to Support GenAI Infrastructure

Industry

Technologies