QuantumBlack: Data Quality Assessment and Enhancement Framework for GenAI Applications

Overview

QuantumBlack Labs, the R&D and software development arm of QuantumBlack (AI by McKinsey), has developed AI4DQ Unstructured—a toolkit designed to address data quality challenges specific to unstructured data that underpins generative AI applications. The tool is part of the broader QuantumBlack Horizon product suite, which includes other tools like Kedro (for ML pipeline orchestration) and Brix. This case study presents QuantumBlack’s approach to solving a fundamental but often overlooked challenge in LLMOps: ensuring that the source data feeding into LLM-based applications like RAG pipelines is of sufficient quality.

The motivation behind this tool stems from a common observation: while data quality frameworks for structured data are well-established, unstructured data presents unique challenges that are often overlooked until they cause significant downstream problems. As organizations scale their generative AI initiatives, these data quality issues can manifest as hallucinations, inconsistent outputs, wasted compute resources, and compliance risks from information leakage.

The Problem Space

The case study identifies several categories of data quality issues commonly found in unstructured data that feeds LLM applications:

Diverse document formats: Documents in PDF, PowerPoint, Excel, and other formats often contain elements that are difficult to parse correctly, such as complex tables and embedded images. These parsing challenges can lead to information loss or misinterpretation by the LLM.
Missing metadata: Large document corpora frequently lack consistent metadata tags, making search and retrieval inefficient and imprecise. This is particularly problematic for RAG systems that depend on accurate retrieval to provide relevant context to the LLM.
Data silos: Organizations typically store documents across multiple business units and departments without centralized management, leading to inconsistencies and difficulties in building comprehensive knowledge bases.
Conflicting information: Multiple versions of documents or outdated content can cause the LLM to provide contradictory or stale information, undermining user trust and application utility.
Irrelevant content: Boilerplate text, repetitive content, and documents that aren’t relevant to the use case consume storage and compute resources while potentially diluting the quality of retrieved context.
Multilingual challenges: Document corpora may contain content in languages that the target LLM doesn’t handle well, requiring translation or filtering strategies.
Sensitive information: PII and other sensitive data may be ingested without proper filtering or access controls, creating compliance and security risks.

The downstream consequences of failing to address these issues include LLM hallucinations, information loss during processing, wasted computational resources, and potential regulatory violations from information leakage.

Technical Solution Architecture

AI4DQ Unstructured employs a three-dimensional approach to data quality assessment, combining automated detection with human-in-the-loop review processes. The toolkit generates data quality scores at the document level, providing practitioners with a quantified view of corpus quality that can guide prioritization and remediation efforts.

Document Clustering and Labeling Workflow

This workflow helps organizations understand the composition and thematic organization of their document corpus. The approach combines traditional NLP techniques with generative AI capabilities:

The system trains custom embeddings on the organization’s specific corpus, moving beyond generic pre-trained embeddings to capture domain-specific semantic relationships. These embeddings are then used to cluster documents based on semantic similarity, grouping related content together regardless of superficial differences in format or structure.

Each cluster is labeled with a “document type” descriptor that becomes part of the document metadata. Additionally, fine-grained tags are generated for individual documents within each cluster, enriching the metadata available for search and retrieval operations. This metadata can significantly improve the precision of RAG retrieval, as it provides additional signals beyond raw text similarity.

The clustering and labeling can also be performed at the chunk level, allowing for more granular organization of content within large documents. Chunk-level metadata can be cached alongside documents to enable more targeted retrieval, particularly useful when documents cover multiple topics or contain sections with varying relevance to different queries.

Document De-duplication Workflow

Duplicate and versioned documents represent a significant data quality challenge, as they can cause LLMs to encounter conflicting information or outdated content. The de-duplication workflow employs a multi-step process:

First, the system extracts and creates metadata to describe each document’s content and characteristics. Then, pairwise comparisons identify potential duplicates based on content similarity. These pairwise relationships form a graph where duplicated document sets can be resolved through entity resolution techniques.

Rather than automatically removing duplicates, the system presents potential duplicates to human reviewers with recommended correction strategies. This human-in-the-loop approach acknowledges that determining the “correct” version often requires domain expertise and judgment that automated systems cannot reliably provide.

Reported Impact and Results

QuantumBlack claims several benefits from deploying AI4DQ Unstructured:

The addition of metadata tags on document themes reportedly led to a 20% increase in RAG pipeline accuracy in one project. While this is a significant improvement, it’s worth noting that the specific evaluation methodology and baselines aren’t detailed in the case study, making it difficult to independently verify this claim.

The toolkit is positioned to deliver cost savings by reducing time spent analyzing irrelevant or outdated information, as well as reducing storage and compute overhead. Risk reduction benefits come from avoiding compliance issues related to information leakage and inappropriate data access.

Healthcare Case Study

The most detailed example provided involves an international health organization that wanted to use generative AI to accelerate research and report writing. The organization had 2.5 GB of data across 1,500+ files that needed to be assessed for quality before building an LLM-based application.

AI4DQ Unstructured identified more than ten high-priority data quality issues that were blocking the effectiveness of the proposed gen AI use case. Specific improvements included:

Identification and removal of 100+ irrelevant or duplicated documents, yielding 10-15% savings in data storage costs
Preservation of information that would otherwise have been permanently lost for 5% of critical policy documents

This case study illustrates the practical value of systematic data quality assessment before deploying LLM applications, though the specific nature of the “high-priority issues” and how they were resolved isn’t detailed.

LLMOps Considerations

This case study highlights an important but often overlooked aspect of LLMOps: data quality management for unstructured data. Several key LLMOps themes emerge:

Pre-deployment data preparation: The emphasis on addressing data quality issues “upstream” before training or deploying gen AI applications reflects a mature understanding that many production issues cannot be easily resolved after the fact. This aligns with traditional ML best practices around data quality, adapted for the unique challenges of unstructured data.

Human-in-the-loop workflows: The toolkit’s design explicitly incorporates human review for critical decisions like duplicate resolution and document categorization validation. This acknowledges the limitations of fully automated approaches and the importance of domain expertise in data quality decisions.

Observability and scoring: The document-level quality scoring mechanism provides practitioners with visibility into data quality across their corpus, enabling data-driven prioritization of remediation efforts.

Integration with broader ML infrastructure: AI4DQ Unstructured is presented as part of the QuantumBlack Horizon product suite, which includes Kedro for pipeline orchestration. This suggests an integrated approach to MLOps and LLMOps tooling.

Critical Assessment

While the case study presents compelling use cases for unstructured data quality management, several aspects warrant scrutiny:

The 20% RAG accuracy improvement claim lacks methodological detail. Without understanding the baseline, evaluation metrics, and test conditions, it’s difficult to assess whether this improvement would generalize to other contexts.

The case study is fundamentally a product marketing piece for QuantumBlack’s AI4DQ product. While the underlying problem—data quality for gen AI applications—is real and significant, the specific value propositions should be evaluated critically against alternative approaches.

The human-in-the-loop design is both a strength and a limitation. While it provides important oversight, it also means the toolkit requires ongoing human effort to operate effectively, which may limit scalability for very large document collections.

Despite these caveats, the case study addresses a genuine gap in LLMOps tooling. Data quality for unstructured data remains an underserved area compared to the attention given to model selection, prompt engineering, and other aspects of LLM application development. The framework presented here—encompassing clustering, labeling, and de-duplication—provides a reasonable starting point for organizations looking to systematically address these challenges.

Data Quality Assessment and Enhancement Framework for GenAI Applications

Industry

Technologies

Overview

The Problem Space

Technical Solution Architecture

Document Clustering and Labeling Workflow

Document De-duplication Workflow

Reported Impact and Results

Healthcare Case Study

LLMOps Considerations

Critical Assessment

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Building a Platform for Agentic AI in Clinical Trial Operations

Accelerating Drug Development with AI-Powered Clinical Trial Transformation