ZenML

Building and Optimizing a RAG-based Customer Service Chatbot

HDI 2022
View original source

HDI, a German insurance company, implemented a RAG-based chatbot system to help customer service agents quickly find and access information across multiple knowledge bases. The system processes complex insurance documents, including tables and multi-column layouts, using various chunking strategies and vector search optimizations. After 120 experiments to optimize performance, the production system now serves 800+ users across multiple business lines, handling 26 queries per second with 88% recall rate and 6ms query latency.

Industry

Insurance

Technologies

Summary

HDI is a German insurance company that built a production RAG (Retrieval-Augmented Generation) chatbot in collaboration with AWS Professional Services to help customer service agents quickly answer customer queries about insurance coverage. The project addressed a fundamental problem in insurance customer service: when customers call with questions like “Am I insured for this?”, inexperienced agents struggle to find answers quickly because information is scattered across multiple SharePoints, databases, and documents—some exceeding 100 pages in length. Even experienced agents who might have bookmarks or memorized certain resources still face the challenge of navigating through lengthy documents to find specific information.

The solution consolidated all knowledge sources into a unified knowledge base accessible through a natural language chat interface, enabling agents to get precise answers without scrolling through extensive documentation. This case study is particularly valuable because it reflects real-world production learnings from a system that has been live for over a year.

Architecture Overview

The team built a modular RAG architecture on AWS with several key components designed for flexibility and scalability. The architecture follows a typical RAG pattern with distinct ingestion and retrieval pipelines, but with notable customizations. The system was designed with a “Lego block” philosophy, allowing components to be swapped out as needed—for example, switching from Amazon Bedrock to OpenAI or a custom SageMaker endpoint.

Key architectural components include:

The team explicitly chose OpenSearch for its scalability (capable of storing billions of vectors and supporting thousands of queries per second), manageability as an AWS service, and alignment with their existing AWS infrastructure. However, they acknowledged that query latency was an area requiring optimization for their specific use case.

The Experimentation Challenge

One of the most candid and valuable aspects of this case study is the team’s discussion of the overwhelming number of hyperparameters and design decisions in RAG systems. They outlined how initial discussions about prompt engineering, LLM selection, and chunk size quickly expanded to include accuracy metrics, quantization, query rewriting, query expansion, guard rails, few-shot prompting, and evaluation approaches.

The team organized their experimentation across several key areas:

Data Preparation and Ground Truth: Before any optimization could begin, the team needed to establish ground truth datasets. This required engaging domain experts to create question-answer pairs that specified not just what answer was expected, but which documents (and which sections within those documents) contained the relevant information. This was particularly important given that some documents span 300+ pages with relevant information potentially on page 2 and page 300.

Chunking Strategy: The team experimented extensively with chunking approaches. They found that German language documents presented unique challenges due to compound words and abbreviations common in German corporate environments. Some embedding models couldn’t handle these linguistic peculiarities effectively. The team used specialized chunking with markdown conversion, preserving document structure including headers (H1, H2, subheadings) and table headers within each chunk to maintain context.

Embedding Models: The choice of embedding model was complicated by language requirements. English-focused models like those from Cohere have limitations (e.g., 512 token input limits) that constrain options. The team considered training custom in-house models hosted on SageMaker endpoints to handle German-specific linguistic patterns.

Vector Indexing with HNSW: The team did deep optimization of OpenSearch’s HNSW (Hierarchical Navigable Small World) algorithm parameters. They explained the algorithm using a library analogy: finding a book efficiently by following anchor points through levels until reaching the target. Key parameters they optimized included:

They noted the tradeoffs: higher values improve recall but increase latency and memory usage. They started with defaults prioritizing recall, then adjusted for faster response times as production requirements became clearer.

Experimentation Process and Scale

The team conducted approximately 120 experiments before moving to MVP, a process that took around 3 months. They tracked custom metrics including document recall (which documents are relevant for answering a question) and passage recall (which specific parts within documents contain the answer).

Their experimental results were visualized with candidates for MVP highlighted in green, organized across pre-processing and retriever categories. The team acknowledged that with approximately 30,000+ possible parameter combinations, they had to prioritize based on their use case rather than exhaustive testing. Their approach was to set a threshold (85% recall rate) and move to MVP once that was achieved, then continue iterative improvement with real user feedback.

One key insight from the Q&A session: the team noted that experiments haven’t stopped—they’ve just shifted from large improvements to incremental percentage-point gains after the initial optimization phase.

Document Parsing Challenges

A significant portion of the team’s effort went into document parsing, particularly for complex PDFs. Challenges included:

The team used AWS Textract for layout recognition but acknowledged that table processing remains an unsolved challenge. They discussed various approaches: adding contextual information to each cell, treating cells individually, or summarizing table content.

Hybrid Search and Reranking

The team implemented custom hybrid search combining vector and keyword-based approaches, along with custom Reciprocal Rank Fusion (RRF) for result merging. This was done approximately two years ago when OpenSearch’s native hybrid search support was limited. They noted that newer versions of OpenSearch (2.19+) include improved RRF capabilities that would simplify this today.

Their retrieval pipeline also includes:

Production Metrics and Feedback Loop

After over a year in production, the system demonstrates solid performance metrics:

The feedback loop is a critical production component. Users can provide positive or negative feedback on answers, and negative feedback includes an optional text field for explanations. This feedback feeds back into the experimentation process for continuous improvement.

Lessons Learned and Modern Recommendations

The team offered candid “hot takes” on what they would do differently if starting today:

Start with a Baseline: Use managed services like Amazon Bedrock Knowledge Bases to quickly establish a baseline. These handle ingestion out-of-the-box (files in S3, automatic chunking and embedding), providing a reference point to ensure custom optimizations are actually improvements.

Leverage Modern Parsing Tools: Amazon Bedrock Data Automation can convert PDFs and images to text, summarize content, and help create context-rich chunks—addressing many of the parsing challenges they struggled with manually.

Accelerate Experimentation: The open-source Flowtorch solution (linked in their presentation) can reduce experimentation timelines from months to hours by automating parameter sweeps.

Establish KPIs Early: Align all stakeholders (security, business, technical teams) on expectations and success metrics from the beginning. Create ground truth datasets early in the project.

Broader Impact

The project served as a blueprint for RAG implementations across HDI. Other business lines have adopted the architecture, making modifications as needed (“putting some stuff here and there, removing some stuff”) while maintaining the core modular design. This demonstrates the value of building flexible, well-documented architectures that can be adapted rather than rebuilt for each new use case.

Technical Realism

This case study is notably honest about the complexities of production RAG systems. The speakers acknowledge:

The modular architecture and emphasis on continuous improvement through feedback loops reflect mature LLMOps practices that prioritize adaptability and long-term operational sustainability over initial launch.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Conversational Assistant for Streamlined Home Buying Experience

Rocket 2025

Rocket Companies, a Detroit-based FinTech company, developed Rocket AI Agent to address the overwhelming complexity of the home buying process by providing 24/7 personalized guidance and support. Built on Amazon Bedrock Agents, the AI assistant combines domain knowledge, personalized guidance, and actionable capabilities to transform client engagement across Rocket's digital properties. The implementation resulted in a threefold increase in conversion rates from web traffic to closed loans, 85% reduction in transfers to customer care, and 68% customer satisfaction scores, while enabling seamless transitions between AI assistance and human support when needed.

customer_support chatbot question_answering +40

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49