ZenML

Implementing Product Comparison and Discovery Features with LLMs at Scale

idealo 2023
View original source

idealo, a major European price comparison platform, implemented LLM-powered features to enhance product comparison and discovery. They developed two key applications: an intelligent product comparison tool that extracts and compares relevant attributes from extensive product specifications, and a guided product finder that helps users navigate complex product categories. The company focused on using LLMs as language interfaces rather than knowledge bases, relying on proprietary data to prevent hallucinations. They implemented thorough evaluation frameworks and A/B testing to measure business impact.

Industry

E-commerce

Technologies

Overview

This case study comes from a conference presentation by Christopher, the lead machine learning engineer at idealo, a major European price comparison website operating in six countries with Germany as its largest market. The platform aggregates over 4 million products and 500 million offers from more than 50,000 shops, ranging from major marketplaces like Amazon and eBay to local retailers. The presentation focuses on practical lessons learned from integrating large language models into production product development, offering three key learnings distilled from their experience.

The speaker provides valuable context by positioning generative AI within the technology hype cycle, arguing that as of the presentation, the industry has moved past the peak of inflated expectations and is entering the “trough of disillusionment” where organizations are realizing that creating genuine business value from generative AI is harder than initially anticipated. This honest assessment sets the stage for practical, grounded advice rather than hype-driven recommendations.

Key Learning 1: Focus on User Needs and LLM Strengths

The first and perhaps most fundamental lesson emphasizes the importance of starting with user problems rather than technology capabilities. When ChatGPT emerged, there was significant pressure from management to “do something” with AI, and the obvious answer seemed to be building a shopping chatbot assistant. However, after prototyping and user testing, the team discovered that their users didn’t actually want a chat interface—they didn’t want to type, didn’t want to read lengthy text responses, and had short attention spans that made conversational interfaces suboptimal.

This experience reinforced the importance of mapping LLM capabilities to genuine user needs rather than forcing technology into products. The team developed a structured approach where they:

The emphasis on rapid prototyping is notable. The team used tools like AWS Bedrock’s playground environment for quick experimentation, enabling even product managers to test concepts without coding. Once feasibility was established through quick prototypes, they moved to MVP development and aggressive user testing, including showing prototypes to conference attendees for immediate feedback.

Production Example: Product Comparison Tables

One successful production application addresses the user need to quickly compare similar products. On a product page for an Apple iPad, users face a daunting task: comparing 60-70 product attributes between multiple product versions is cognitively overwhelming and causes users to leave the site. The LLM-powered solution:

This approach leverages the LLM’s language understanding and world knowledge about product preferences without requiring a manually maintained rule-based system that would be difficult to scale and maintain across millions of products.

Key Learning 2: Treat LLMs as Language Interfaces, Not Knowledge Databases

This architectural insight is crucial for production reliability. Rather than relying on the LLM’s general world knowledge (which can lead to hallucinations), idealo treats LLMs as language interfaces that operate on their proprietary data. The speaker draws a parallel to ChatGPT’s SearchGPT feature, where the model generates answers based on retrieved web search results rather than its parametric knowledge.

This approach directly addresses trust and hallucination concerns. Users don’t yet trust AI systems for important purchasing decisions where real money is at stake, and convincing hallucinations could severely damage that trust.

Production Example: Guided Product Finder

The team developed a guided shopping experience for category pages (demonstrated with tumble dryers). The problem: users face an overwhelming list of filters (manufacturer, energy efficiency, various specifications) and don’t know how to begin narrowing down results. Paradoxically, idealo has excellent expert content (“Ratgeber”) at the bottom of these pages, but users rarely scroll down to find it.

The solution uses LLMs as an interface to this expert content, creating an interactive question-and-answer flow that guides users through product selection. The prompting architecture involves multiple chained prompts:

This RAG-like pattern (using proprietary data as context rather than relying on world knowledge) significantly reduces hallucination risk while leveraging the LLM’s natural language generation capabilities.

Key Learning 3: Evaluation is Critical and Often Neglected

The speaker emphasizes that evaluation should be established as a first principle before significant development effort begins, not as an afterthought. This is particularly challenging with generative AI because outputs are often “fuzzy”—how do you measure the quality of a generated question or the relevance of answer options?

The team has found that evaluation consistently becomes the bottleneck when scaling prototypes to production. They recommend investing as much development effort into evaluation frameworks as into the solution itself.

Evaluation Techniques

The presentation outlines several evaluation approaches used at idealo:

The speaker is notably candid about the limitations of LLM-as-Judge approaches, quoting external research that it “works to some degree but is definitely not a silver bullet.” Significant effort is required to tune evaluation prompts to make them genuinely useful.

An example evaluation prompt structure was shared for the product comparison use case, where the evaluator LLM receives:

Hallucination Testing and Model Selection

During development of the product comparison feature, the team conducted extensive manual review of hundreds of examples to build confidence in hallucination rates. This process also revealed meaningful differences between models—Claude 2 had notable hallucination issues, but Claude 2.1 showed measurable improvement that matched community reports, demonstrating the value of systematic evaluation.

While they acknowledge there’s “no guarantee” of zero hallucinations, the combination of grounding in proprietary data and extensive testing has given them high confidence levels for production deployment.

Production Infrastructure and Practices

Several production-oriented practices emerge from the presentation:

Balanced Assessment

The presentation offers a refreshingly honest view of LLM deployment challenges. The speaker explicitly acknowledges that creating business value from generative AI is “a lot harder than a lot of people thought” and positions their learnings as ways to accelerate the journey from hype to productivity.

Key strengths of their approach include the emphasis on user validation before committing to solutions, the pragmatic RAG-like architecture that mitigates hallucination risk, and the acknowledgment that evaluation frameworks require substantial investment.

Areas that could use more detail include specific metrics on business impact (referenced as being measured but not shared), infrastructure costs and latency considerations for production LLM calls, and the team structure and organizational support required for these initiatives.

The case study demonstrates mature LLMOps thinking: focus on measurable user value, architect for reliability by constraining the LLM’s scope, and invest heavily in evaluation infrastructure to enable confident iteration and deployment.

More Like This

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49

Building and Sunsetting Ada: An Internal LLM-Powered Chatbot Assistant

Leboncoin 2025

Leboncoin, a French e-commerce platform, built Ada—an internal LLM-powered chatbot assistant—to provide employees with secure access to GenAI capabilities while protecting sensitive data from public LLM services. Starting in late 2023, the project evolved from a general-purpose Claude-based chatbot to a suite of specialized RAG-powered assistants integrated with internal knowledge sources like Confluence, Backstage, and organizational data. Despite achieving strong technical results and valuable learning outcomes around evaluation frameworks, retrieval optimization, and enterprise LLM deployment, the project was phased out in early 2025 in favor of ChatGPT Enterprise with EU data residency, allowing the team to redirect their expertise toward more user-facing use cases while reducing operational overhead.

chatbot question_answering summarization +38