ZenML

Data Engineering Challenges and Best Practices in LLM Production

QuantumBlack 2023
View original source

Data engineers from QuantumBlack discuss the evolving landscape of data engineering with the rise of LLMs, highlighting key challenges in handling unstructured data, maintaining data quality, and ensuring privacy. They share experiences dealing with vector databases, data freshness in RAG applications, and implementing proper guardrails when deploying LLM solutions in enterprise settings.

Industry

Consulting

Technologies

Overview

This case study comes from a podcast discussion featuring data engineering experts from QuantumBlack (part of McKinsey), specifically Anu Adora (Principal Data Engineer with 13 years of experience) and Alice (Social Partner). The conversation focuses on the often-overlooked but critical role of data engineering in successfully deploying LLMs in production environments, with particular emphasis on financial services and insurance clients.

The core thesis of this discussion is that despite the excitement around LLMs and generative AI, data engineering remains fundamentally important—perhaps more so than ever—and that the transition from MVP/proof-of-concept to production deployments reveals numerous operational challenges that organizations often underestimate.

The Evolving Role of Data Engineering in the LLM Era

The speakers emphasize that data engineering is not being replaced by LLMs but rather is becoming more complex and critical. While there’s been significant hype around technologies like text-to-SQL models that supposedly democratize data access, the reality on the ground is that data engineers’ workloads have increased substantially since the emergence of LLMs. This is because the paradigm has shifted from primarily structured data (where solutions for quality measurement, lineage, and pipeline building were well-established) to unstructured data where many of these problems need to be solved anew.

The conversation draws an interesting historical parallel to the “data lake” era of 2008-2010, when organizations were promised they could store videos, audio, and other unstructured content and derive insights from it. In practice, unstructured data in data lakes often “went to die” because there were no effective tools to process it. Now, with LLMs, that promise is finally being realized—but it comes with new operational challenges.

Key Technical Challenges in LLM Production

Unstructured Data Ingestion and Quality

One of the most significant challenges discussed is ETL for unstructured data. Unlike traditional ETL where you might track event captures from button clicks on a website, ingesting PDFs and other documents requires entirely different quality considerations. The speakers note that there’s a common misconception that you can “just put a PDF in an LLM and get an answer.” In reality, substantial pre-processing is required, including chunking documents appropriately—failure to do so results in prohibitive costs when making LLM API calls.

Data quality measurement for unstructured data is fundamentally different from structured data. How do you measure whether a PDF has been correctly parsed? How do you verify that the extracted content is accurate and complete? These are open problems that traditional data quality tools weren’t designed to address.

RAG System Maintenance and Document Versioning

A particularly illustrative example from the discussion involves the challenge of keeping RAG (Retrieval-Augmented Generation) systems up to date. The speakers describe a scenario where an HR chatbot using RAG might retrieve outdated policy information—for instance, referencing an old European vacation policy of 30 days when the company has switched to a different policy. Even if the organization believes they’ve updated their documentation, remnants of old versions can persist in vector databases, creating what the speakers describe as “an absolute landmine.”

This highlights a critical LLMOps consideration: vector databases require active maintenance and versioning strategies. It’s not sufficient to simply embed documents once; organizations need processes for tracking document versions, removing outdated content, and ensuring the retrieval system always surfaces the most current information.

The Insurance Claim Processing Example

A concrete production case study discussed involves an insurance client using LLMs to help agents process claims. The use case involves large commercial contracts (sometimes hundreds of pages) where the LLM checks whether a claim is covered by the policy. The POC worked well and demonstrated value, but moving to production revealed a critical data quality issue: the contracts were scanned and stored in document repositories, but there was no reliable way to ensure the system was using the latest version of each contract.

The implications are severe—if the system uses a two-year-old contract version where a particular coverage wasn’t included, it could incorrectly deny valid claims. This example illustrates how LLM production deployments must consider the entire data supply chain, not just the model’s capabilities.

Infrastructure and Architecture Considerations

Vector Databases

The speakers identify vector databases as a critical new component that enterprises need to add to their data stack. Unlike traditional databases, vector databases are specifically designed to store and query embeddings efficiently, which is essential for RAG architectures. The good news, they note, is that many solutions exist either as standalone products or as extensions to existing databases.

LLM Deployment Architectures

Three main architectures are discussed for enterprise LLM deployments:

Data Management Tool Evolution

Traditional data management tools (ETL tools, data catalogs, data quality frameworks) are evolving to support LLM use cases. Examples include data catalogs that can be queried using natural language to generate SQL, and tools that use LLMs to scan documents for PII. The speakers note that their internal tool at QuantumBlack can analyze tables and identify potential data quality issues automatically.

Data Privacy and Security

PII Handling Strategies

The speakers emphasize a “privacy by design” approach: before sending any data to an LLM, organizations should ask whether they actually need to include PII. In 90-95% of cases, the answer is no—data can be anonymized, tokenized, or hashed before being sent to the model. Customer IDs, account numbers, and other sensitive identifiers can often be stripped without affecting the model’s ability to perform its task.

For the 5% of cases where PII might seem necessary, additional considerations include:

Regulatory Landscape

The European AI Act is mentioned as an emerging regulatory framework that will impose requirements similar to GDPR but specifically for AI applications. Organizations need to prepare for these requirements, which will likely mandate specific guard rails around privacy, quality, and access control.

Trust and Maturity Parallels

The speakers draw an interesting parallel between current concerns about LLM data privacy and historical concerns about cloud computing. Just as organizations were initially hesitant to move data to the cloud but eventually built trust through experience and improved security practices, similar maturity will develop around LLM usage. However, they acknowledge a key difference: with LLMs, there’s an additional layer of uncertainty because we don’t fully understand what happens inside the models themselves.

Production Deployment Strategies

Risk-Value-Feasibility Framework

The speakers advocate for a structured approach to prioritizing LLM use cases:

High-value, high-feasibility, low-risk use cases should be prioritized first, allowing organizations to learn and build capabilities before tackling more challenging deployments.

Controlled Rollouts

Several strategies for managing risk in production are discussed:

Three-Layer Quality Framework

For production LLM systems, the speakers recommend quality checks at three levels:

Cost Management

A critical but often overlooked aspect of LLMOps is cost management. The speakers note that while generative AI is often pitched as a cost-reduction technology, organizations need to verify this claim for each use case. LLM API costs can be substantial, especially at scale, and the total cost of ownership includes infrastructure, training, governance, and ongoing maintenance—not just API fees.

LLMs as Tools for Data Engineering

Interestingly, the discussion also covers how LLMs can help data engineers in their work:

However, the speakers caution that LLMs “never take no for an answer”—they will always produce output even when hallucinating, so human oversight remains essential.

Key Takeaways

The overall message is one of cautious optimism: LLMs are powerful tools that can unlock significant value, but production deployment requires careful attention to data engineering fundamentals. Organizations that treat LLMs as magic solutions without addressing data quality, privacy, versioning, and governance will likely face costly failures. Those that approach LLM deployment with the same rigor they would apply to any production data system—plus additional considerations specific to LLMs—will be better positioned for success.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Augmented Cybersecurity Triage Using Graph RAG for Cloud Security Operations

Deloitte 2025

Deloitte developed a Cybersecurity Intelligence Center to help SecOps engineers manage the overwhelming volume of security alerts generated by cloud security platforms like Wiz and CrowdStrike. Using AWS's open-source Graph RAG Toolkit, Deloitte built "AI for Triage," a human-in-the-loop system that combines long-term organizational memory (stored in hierarchical lexical graphs) with short-term operational data (document graphs) to generate AI-assisted triage records. The solution reduced 50,000 security issues across 7 AWS domains to approximately 1,300 actionable items, converting them into over 6,500 nodes and 19,000 relationships for contextual analysis. This approach enables SecOps teams to make informed remediation decisions based on organizational policies, historical experiences, and production system context, while maintaining human accountability and creating automation recipes rather than brittle code-based solutions.

document_processing question_answering high_stakes_application +37

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49