This case study explores how Prosus builds and deploys AI agents across e-commerce and food delivery businesses serving two billion customers globally. The discussion covers critical lessons learned from deploying conversational agents in production, with a particular focus on context engineering as the most important factor for success—more so than model selection or prompt engineering alone. The team found that successful production deployments require hybrid approaches combining semantic and keyword search, generative UI experiences that mix chat with dynamic visual components, and sophisticated evaluation frameworks. They emphasize that technology has advanced faster than user adoption, leading to failures when pure chatbot interfaces were tested, and success only came through careful UI/UX design, contextual interventions, and extensive testing with both synthetic and real user data.
This case study presents insights from an AI practitioner at Prosus, a global investment firm and technology operator, who has been building production AI agents across multiple verticals including e-commerce and food delivery platforms serving approximately two billion customers worldwide. The discussion focuses on two main classes of agents: productivity tools (like an internal tool called Tokan used by 15,000 employees across finance, design, product management, and engineering) and customer-facing e-commerce agents for online shopping and food ordering.
The speaker works with portfolio companies including OLX (shopping assistant) and food delivery businesses, building agents that help users with complex, ambiguous queries like “I want the latest headphone,” “I’m going for a hiking trip and don’t know what to buy,” or “I want to have a romantic dinner with my wife.” These agents must understand broad user intent, connect to product catalogs, and handle the complexity of real-world e-commerce scenarios. The team is currently reimagining food ordering experiences for the next one to two years, moving beyond simple keyword search to conversational experiences.
The most significant and recurring theme throughout this case study is the emphasis on context engineering over traditional prompt engineering or model selection. The speaker references Andrej Karpathy’s viral tweet advocating for the term “context engineering” and relates it to how data engineering was the unglamorous but essential work underlying data science success—“garbage in, garbage out.”
The practitioner observes that while discussions in the community focus heavily on system prompts, model selection, and tools like MCP (Model Context Protocol), their hard-earned lesson is that context engineering makes the difference between success and failure in production. When comparing two state-of-the-art models (Model A vs Model B), the model with proper context dramatically outperforms the one without, regardless of which specific model is used.
The speaker breaks down context into four essential components:
1. System Prompt: The foundational instructions that everyone discusses, though the speaker notes this gets disproportionate attention relative to its impact.
2. User Message: The dynamic message sent by the user in each interaction.
3. Enterprise Context (The Dirty Data Pipeline): This is described as the most challenging and important component. In real-world e-commerce scenarios, users care about multiple dimensions beyond just product search:
The core challenge is that enterprise data is messy and scattered across multiple databases. There is no single source of truth that can answer “show me everything on promotion.” The data is distributed, real-time, and difficult to consolidate. The speaker emphasizes that data engineers spend significant time building pipelines to connect these disparate data sources and bring the right context into the prompt at query time. When a user asks “show me sushi on promotion,” the system must kick in data pipelines to retrieve current promotional information and incorporate it into the LLM’s context.
4. User History and Memory: This component is critical for creating product stickiness and competitive differentiation. In a crowded market where many companies are building shopping assistants and food ordering agents, the speaker notes they personally have no loyalty to any particular product and switch between ChatGPT and other tools freely. The key differentiator that creates high switching costs is when a product knows the user deeply—their preferences, past orders, browsing history, and conversational context.
The discussion touches on various memory architectures (long-term, short-term, episodic) but emphasizes a pragmatic cold-start solution: leverage existing user data from the current application. For companies like OLX or food delivery platforms, there is already rich data about what users have ordered, browsed, and preferred before any conversational interaction begins. The speaker advises that when launching a new agent, teams should not over-engineer memory systems from day one but should instead use existing behavioral data as initial context. This simple approach “does wonders” and provides a three-month runway while the system begins collecting conversational data and dynamic memory.
The speaker notes that many teams overcomplicate memory from the start when there’s a simpler solution available that allows focus on product-market fit rather than technical optimization.
Search is described as the most fundamental tool for e-commerce and food delivery agents, though it doesn’t apply to all agent types (like agents for suppliers, car dealers, or restaurants). For consumer-facing e-commerce agents, search is the start of the user journey—if search fails, trust is broken immediately, and users will never proceed further in the experience regardless of how good other capabilities are.
Most enterprise search is still keyword-based, which works well for straightforward queries (“burger” → show burger taxonomy results). However, when users interact with conversational agents, especially voice-enabled ones, their queries become fundamentally different and more complex:
These broad, ambiguous queries cannot be effectively handled by keyword search alone. The speaker notes they are “vegetarian,” and when searching for “vegetarian pizza,” keyword search only returns items with “vegetarian” explicitly mentioned in titles or descriptions—missing obvious matches like “pizza margherita” that are vegetarian by nature but not labeled as such.
To address these limitations, the team implements semantic search using embeddings, which can understand that pizza margherita is semantically close to vegetarian even without explicit labeling. However, semantic search also has limitations—it cannot solve inherently ambiguous queries like “romantic dinner” because “romantic” means different things to different people and contexts.
The production solution is a hybrid search system that attempts keyword search first and falls back to semantic search when needed. But this still doesn’t fully solve the problem for the most challenging queries.
The team has developed a sophisticated multi-stage search pipeline:
Query Understanding/Personalization/Expansion (Pre-Search): Before search execution, an LLM analyzes the query to understand intent. For “romantic dinner,” the LLM considers user profile data and breaks down the abstract concept into concrete search terms. The speaker humorously notes suggesting “cupcake” as romantic (which drew mockback), but the principle is that the LLM decomposes ambiguous queries into multiple searchable sub-queries that can be executed against the catalog.
Search Execution: The system runs hybrid keyword and semantic search across the processed queries to retrieve candidate results—potentially thousands of items.
Re-ranking (Post-Search): This step uses another LLM call to re-rank results. While traditional machine learning approaches like LTR (Learning to Rank) are still valuable, the team found they fail on novel query types with rich user context. The LLM-based re-ranking takes the original user query, the thousands of candidate results, and user context to produce a refined set of top results (typically 3-10 items) to present to the user.
The speaker emphasizes that search is difficult, messy, and has “haunted” them in every project. This multi-stage pipeline represents the state of the art in their production systems, and they stress that few people publicly discuss these search challenges despite them being fundamental to e-commerce agent success.
One of the most candid and valuable parts of this case study is the discussion of repeated failures in user adoption and the lessons learned about UI/UX design for AI agents.
The team built what they believed was an excellent shopping assistant—thoroughly tested internally, connected to catalogs, capable of handling complex queries like “furnish my house” with intelligent product recommendations organized by category. The team was excited and confident. They launched it with A/B testing.
The result: “It fell flat on our face. It was terrible.”
The conversion metrics in the A/B test showed the new chatbot experience significantly underperformed the existing UI. Initially, the team thought there must be a data error because the agent was still performing well functionally. The realization was that the problem wasn’t technical capability but user adoption and interface design.
Through extensive user research (the speaker gained “newfound respect for designers and user researchers”), the team identified several key issues:
Friction of New Interfaces: Users are familiar with existing UIs and use them daily. Introducing a completely new interface creates inherent friction. Users will only adopt a new interface if it solves a fundamental problem they’ve struggled with significantly—not for incremental improvements. The value proposition must be immediately obvious within the first 30 seconds.
Lack of Guidance: A blank chatbot interface is inviting but also intimidating. With tools like Alexa, the speaker notes that 8 out of 10 interactions fail because users don’t know the capabilities. When an agent has 20 tools connected behind the scenes, users have no way of discovering what’s possible. Traditional design patterns like onboarding flows, suggested prompts, and tooltips become essential.
Visual Nature of E-commerce: Buying decisions, especially for food and shopping, are highly visual. Users want to scroll, click, swipe, and make decisions based on images. Pure conversation is limiting—an image of food can trigger hunger and purchase intent in ways text cannot.
The most successful approach the team has found is “generative UI”—a hybrid experience that combines conversational interaction with dynamically generated visual interface components.
In this paradigm:
The system is multimodal in input: it tracks both conversational input and user actions on screen (clicks, scrolls, items added to cart). The speaker references the “Jarvis” assistant from Iron Man as the ideal—an agent that watches what you’re doing in the environment and responds naturally.
While this creates potential privacy concerns (users worrying about being “watched”), the speaker personally embraces the tradeoff, stating their philosophy is “take my data if you can give me value.” They acknowledge different users have different comfort levels with this approach.
Rather than presenting a chatbot as the universal interface, the team found much better success with contextual, micro-task interventions:
These contextual interventions:
The speaker compares this to traditional push notifications, noting that if the first few notifications are bad, users silence them or mentally ignore them. The key is to not overdo it (don’t send 10 messages) and to make each intervention highly personalized using available user data.
The speaker makes a striking claim: “If I were a founder, if my system prompt leaks I would not be worried, but if my eval leaks I believe so much in evals. Eval is the real moat of your product, not your system prompt.”
This reflects a deep conviction that systematic evaluation is the differentiator between products that work in production versus those that fail, drawing parallels to the mature software development lifecycle with QA, testing in production, and regression testing.
Pre-Launch (Offline) Evaluation: How do you know the system is good enough before launching?
Post-Launch (Online) Evaluation: Continuous monitoring to detect degradation and handle unexpected user queries.
Mistake #1: Waiting for Real User Data: Many teams wait until after launch to build evaluations because they want real user queries. This is too late—the product may already be failing in production.
Solution: Synthetic Data and Simulation: Start with simple approaches:
This allows early identification of failure scenarios and informs thinking about what metrics matter and how to structure LLM-as-judge evaluations.
Mistake #2: Immediately Jumping to LLM-as-Judge: While LLM-as-judge is popular and relatively easy to implement (get input, get agent output, ask LLM if it satisfied user intent), there are often lower-hanging fruit.
Solution: Deterministic Metrics First: Look for objective, deterministic signals:
These deterministic metrics are more reliable than LLM judgments, which can make mistakes. Only after exhausting deterministic metrics should teams move to LLM-as-judge.
The speaker advocates for a hierarchical evaluation strategy rather than immediately diving into complex, multi-dimensional analysis:
Level 1 - High-Level Business Metrics:
This first level provides “so much information” and identifies “so many things to fix” that teams often never need to proceed to deeper analysis. It resonates with business stakeholders who can understand these metrics without technical knowledge.
Level 2 - Tool-Level Analysis (often unnecessary):
The speaker notes they rarely reach Level 2 because Level 1 already reveals enough actionable insights. This reflects an 80/20 principle (Pareto principle) where two simple evaluations provide 80% of the important information.
One of the most practical and actionable patterns described is the “labeling party”—a recipe the team has used successfully multiple times:
Setup:
Process:
Iteration:
This approach has multiple benefits:
The speaker emphasizes that how you present conversations for evaluation matters significantly. Tools like LangFuse and LangTrace are excellent for observability but user-unfriendly for non-technical evaluators—they display complex JSONs and nested tool calls that overwhelm business users.
Solution: Build custom annotation tools tailored to each use case using rapid prototyping tools like v0 from Vercel. These custom tools can be built in half a day and dramatically improve the labeling party experience by:
The speaker notes that “every use case is different” and requires different visualization approaches. For food/shopping conversations, evaluators need to see product images, restaurant availability, and timing context to properly assess whether the agent’s responses were appropriate.
The speaker reflects on how the technology has evolved dramatically over the three years since they started building Tokan (the internal productivity tool). Initially, significant engineering effort went into intent detection and other scaffolding to make LLMs work. Now, “it blows my mind that you now don’t need [these things]—the model itself is so much better.”
This observation highlights that technology has advanced faster than use cases and user adoption. Four years ago, the practitioner felt their ambitions exceeded technological capabilities. Today, they feel “technology is a few steps ahead” and the challenge is figuring out how to apply it effectively.
A recurring theme is the critical importance of designers and user researchers in production AI systems. The speaker gained “newfound respect” for these disciplines after experiencing failures and now considers them essential team members from day one. They emphasize that while technology is one side, “in the end you want to solve user problems,” which requires understanding human behavior, adoption patterns, and interface design.
When putting together projects, the speaker now insists on having designers and user researchers as core team members, not afterthoughts.
Despite the focus on LLMs, the case study acknowledges that traditional machine learning and recommender systems remain relevant. When discussing generative UI and user click behavior, the speaker notes this is “a recommender system problem” and that “the traditional world still applies.” The example of TikTok’s feed and swipe behavior improving recommendations through traditional algorithms illustrates that LLMs augment rather than replace existing ML infrastructure.
Similarly, in search, traditional Learning to Rank (LTR) algorithms are still important, though they now work alongside LLM-based query understanding and re-ranking.
Throughout the discussion, several clear recommendations emerge:
What Works:
What Doesn’t Work (Anti-Patterns):
While not extensively detailed, the discussion mentions using observability tools like LangFuse to capture interactions in production. These tools track the full agent execution flow including tool calls and provide the raw data needed for evaluation. However, as noted, the raw JSON output requires transformation into user-friendly formats for effective human evaluation.
The implication is that production LLM systems at this scale require:
This case study provides an unusually candid view into the challenges of deploying LLM-based agents at scale in e-commerce and food delivery contexts. The central insight that context engineering is more important than model selection or prompt engineering challenges much of the public discourse around LLM applications.
The repeated emphasis on failures, user adoption challenges, and the gap between technological capability and practical utility offers a valuable counterbalance to the hype often surrounding AI agents. The team’s journey from failed pure-chatbot launches to successful generative UI and contextual interventions illustrates that production success requires deep integration of traditional UX principles, business understanding, and technical sophistication.
The evaluation practices, particularly the labeling party methodology and hierarchical approach starting with business metrics, provide actionable patterns that other teams can adopt. The overarching message is that building production LLM systems is messy, requires cross-functional collaboration, and succeeds through iteration and learning from failures rather than through purely technical optimization.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.