Doordash leverages LLMs to enhance their product knowledge graph and search capabilities as they expand into new verticals beyond food delivery. They employ LLM-assisted annotations for attribute extraction, use RAG for generating training data, and implement LLM-based systems for detecting catalog inaccuracies and understanding search intent. The solution includes distributed computing frameworks, model optimization techniques, and careful consideration of latency and throughput requirements for production deployment.
Doordash, the on-demand delivery platform, has been expanding beyond its core food delivery business into “New Verticals” including groceries, alcohol, and retail. This expansion brought significant ML/AI challenges: managing hundreds of thousands of SKUs across diverse categories, understanding complex user needs, and optimizing a dynamic marketplace. The company presented their approach at the 2024 AI Conference in San Francisco, detailing how they blend traditional ML with Large Language Models to solve these challenges at scale.
This case study is notable because it comes from a major production environment handling real consumer traffic across multiple verticals. The challenges described—cold start problems, annotation costs, catalog quality, and search relevance—are common across e-commerce and marketplace applications. While the source is a company blog post with a recruitment focus, the technical details provided offer genuine insights into LLMOps practices at scale.
One of the key areas where Doordash applies LLMs is in building and enriching their Product Knowledge Graph. This graph contains structured product information that powers both consumer-facing experiences (helping customers find products) and operational workflows (helping dashers identify items during fulfillment).
Training effective NLP models for attribute extraction traditionally requires large amounts of high-quality labeled data, which is expensive and time-consuming to produce through human annotation. Doordash addresses this “cold start” problem using LLM-assisted annotation workflows:
The approach reportedly reduces training timelines from weeks to days, though specific metrics on cost savings or quality improvements are not provided in the source material. The technique is particularly valuable when expanding to new product categories where labeled training data doesn’t exist.
The attribute extraction model must handle diverse product categories with unique attribute schemas. For alcohol products, for example, the system extracts:
This structured extraction powers more intelligent search, recommendations, and filtering capabilities across the platform.
Maintaining catalog accuracy at scale is critical for customer trust. Doordash uses LLMs to automate detection of catalog inconsistencies through a structured workflow:
A P0 issue might be a mismatch between a product’s title and its package image (e.g., wrong flavor shown), requiring immediate correction. P1 issues are addressed promptly, while P2 issues enter a backlog. This prioritization system helps operations teams focus on the most impactful fixes first.
While the automation approach is compelling, it’s worth noting that the accuracy of LLM-based detection and classification is not quantified in the source material. Real-world performance would depend heavily on prompt engineering quality and the robustness of the underlying vision-language model capabilities.
Search at Doordash presents unique challenges due to the multi-intent, multi-entity nature of queries across their marketplace. A search for “apple” could mean fresh fruit from a grocery store, apple juice from a restaurant, or Apple-branded products from retail—the system must disambiguate based on context.
The search engine is designed to be:
Training relevance models traditionally relies on engagement signals, but these can be noisy and sparse for niche or “tail” queries. Doordash uses LLMs to improve training data quality:
They implement “consensus labeling” with LLMs to ensure precision in their automated labeling process, though specific details on how consensus is achieved (e.g., multiple LLM calls, ensemble approaches) are not elaborated.
Search results are personalized based on individual preferences including dietary needs, brand affinities, price sensitivity, and shopping habits. However, the team explicitly addresses the risk of over-personalization:
This balance between personalization and relevance is a common challenge in search systems, and the acknowledgment of this tradeoff reflects mature production thinking.
Doordash mentions leveraging distributed computing frameworks, specifically Ray, to accelerate LLM inference at scale. This suggests they’re running significant LLM workloads that require horizontal scaling.
For domain-specific needs, they employ:
These techniques allow fine-tuning of large models with reduced computational requirements while maintaining flexibility and scalability.
To meet real-time latency requirements, Doordash employs:
This creates smaller, more efficient models suitable for online inference without compromising too heavily on performance. The tension between LLM capability and production latency requirements is a core LLMOps challenge, and these approaches represent standard industry practice for addressing it.
Retrieval-Augmented Generation is mentioned as a technique to inject external knowledge into models, enhancing contextual understanding and relevance. While specific implementation details aren’t provided, RAG is used both for generating training annotations and potentially for production inference to ground LLM responses in domain-specific information.
Doordash outlines several forward-looking initiatives:
These aspirations suggest continued investment in LLM capabilities, though they represent future work rather than current production systems.
While the case study provides valuable insights into LLMOps at scale, several caveats should be noted:
That said, the technical approaches described—RAG for data augmentation, LLM-based labeling, model distillation, and fine-tuning with LoRA—represent sound practices for deploying LLMs in production environments. The emphasis on guardrails (for personalization) and priority-based triage (for catalog issues) suggests mature operational thinking about how to integrate LLMs into production workflows safely and effectively.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.