ZenML

Data Quality Assessment and Enhancement Framework for GenAI Applications

QuantumBlack 2025
View original source

QuantumBlack developed AI4DQ Unstructured, a comprehensive toolkit for assessing and improving data quality in generative AI applications. The solution addresses common challenges in unstructured data management by providing document clustering, labeling, and de-duplication workflows. In a case study with an international health organization, the system processed 2.5GB of data, identified over ten high-priority data quality issues, removed 100+ irrelevant documents, and preserved critical information in 5% of policy documents that would have otherwise been lost, leading to a 20% increase in RAG pipeline accuracy.

Industry

Healthcare

Technologies

Overview

QuantumBlack Labs, the R&D and software development arm of QuantumBlack (AI by McKinsey), has developed AI4DQ Unstructured—a toolkit designed to address data quality challenges specific to unstructured data that underpins generative AI applications. The tool is part of the broader QuantumBlack Horizon product suite, which includes other tools like Kedro (for ML pipeline orchestration) and Brix. This case study presents QuantumBlack’s approach to solving a fundamental but often overlooked challenge in LLMOps: ensuring that the source data feeding into LLM-based applications like RAG pipelines is of sufficient quality.

The motivation behind this tool stems from a common observation: while data quality frameworks for structured data are well-established, unstructured data presents unique challenges that are often overlooked until they cause significant downstream problems. As organizations scale their generative AI initiatives, these data quality issues can manifest as hallucinations, inconsistent outputs, wasted compute resources, and compliance risks from information leakage.

The Problem Space

The case study identifies several categories of data quality issues commonly found in unstructured data that feeds LLM applications:

The downstream consequences of failing to address these issues include LLM hallucinations, information loss during processing, wasted computational resources, and potential regulatory violations from information leakage.

Technical Solution Architecture

AI4DQ Unstructured employs a three-dimensional approach to data quality assessment, combining automated detection with human-in-the-loop review processes. The toolkit generates data quality scores at the document level, providing practitioners with a quantified view of corpus quality that can guide prioritization and remediation efforts.

Document Clustering and Labeling Workflow

This workflow helps organizations understand the composition and thematic organization of their document corpus. The approach combines traditional NLP techniques with generative AI capabilities:

The system trains custom embeddings on the organization’s specific corpus, moving beyond generic pre-trained embeddings to capture domain-specific semantic relationships. These embeddings are then used to cluster documents based on semantic similarity, grouping related content together regardless of superficial differences in format or structure.

Each cluster is labeled with a “document type” descriptor that becomes part of the document metadata. Additionally, fine-grained tags are generated for individual documents within each cluster, enriching the metadata available for search and retrieval operations. This metadata can significantly improve the precision of RAG retrieval, as it provides additional signals beyond raw text similarity.

The clustering and labeling can also be performed at the chunk level, allowing for more granular organization of content within large documents. Chunk-level metadata can be cached alongside documents to enable more targeted retrieval, particularly useful when documents cover multiple topics or contain sections with varying relevance to different queries.

Document De-duplication Workflow

Duplicate and versioned documents represent a significant data quality challenge, as they can cause LLMs to encounter conflicting information or outdated content. The de-duplication workflow employs a multi-step process:

First, the system extracts and creates metadata to describe each document’s content and characteristics. Then, pairwise comparisons identify potential duplicates based on content similarity. These pairwise relationships form a graph where duplicated document sets can be resolved through entity resolution techniques.

Rather than automatically removing duplicates, the system presents potential duplicates to human reviewers with recommended correction strategies. This human-in-the-loop approach acknowledges that determining the “correct” version often requires domain expertise and judgment that automated systems cannot reliably provide.

Reported Impact and Results

QuantumBlack claims several benefits from deploying AI4DQ Unstructured:

The addition of metadata tags on document themes reportedly led to a 20% increase in RAG pipeline accuracy in one project. While this is a significant improvement, it’s worth noting that the specific evaluation methodology and baselines aren’t detailed in the case study, making it difficult to independently verify this claim.

The toolkit is positioned to deliver cost savings by reducing time spent analyzing irrelevant or outdated information, as well as reducing storage and compute overhead. Risk reduction benefits come from avoiding compliance issues related to information leakage and inappropriate data access.

Healthcare Case Study

The most detailed example provided involves an international health organization that wanted to use generative AI to accelerate research and report writing. The organization had 2.5 GB of data across 1,500+ files that needed to be assessed for quality before building an LLM-based application.

AI4DQ Unstructured identified more than ten high-priority data quality issues that were blocking the effectiveness of the proposed gen AI use case. Specific improvements included:

This case study illustrates the practical value of systematic data quality assessment before deploying LLM applications, though the specific nature of the “high-priority issues” and how they were resolved isn’t detailed.

LLMOps Considerations

This case study highlights an important but often overlooked aspect of LLMOps: data quality management for unstructured data. Several key LLMOps themes emerge:

Pre-deployment data preparation: The emphasis on addressing data quality issues “upstream” before training or deploying gen AI applications reflects a mature understanding that many production issues cannot be easily resolved after the fact. This aligns with traditional ML best practices around data quality, adapted for the unique challenges of unstructured data.

Human-in-the-loop workflows: The toolkit’s design explicitly incorporates human review for critical decisions like duplicate resolution and document categorization validation. This acknowledges the limitations of fully automated approaches and the importance of domain expertise in data quality decisions.

Observability and scoring: The document-level quality scoring mechanism provides practitioners with visibility into data quality across their corpus, enabling data-driven prioritization of remediation efforts.

Integration with broader ML infrastructure: AI4DQ Unstructured is presented as part of the QuantumBlack Horizon product suite, which includes Kedro for pipeline orchestration. This suggests an integrated approach to MLOps and LLMOps tooling.

Critical Assessment

While the case study presents compelling use cases for unstructured data quality management, several aspects warrant scrutiny:

The 20% RAG accuracy improvement claim lacks methodological detail. Without understanding the baseline, evaluation metrics, and test conditions, it’s difficult to assess whether this improvement would generalize to other contexts.

The case study is fundamentally a product marketing piece for QuantumBlack’s AI4DQ product. While the underlying problem—data quality for gen AI applications—is real and significant, the specific value propositions should be evaluated critically against alternative approaches.

The human-in-the-loop design is both a strength and a limitation. While it provides important oversight, it also means the toolkit requires ongoing human effort to operate effectively, which may limit scalability for very large document collections.

Despite these caveats, the case study addresses a genuine gap in LLMOps tooling. Data quality for unstructured data remains an underserved area compared to the attention given to model selection, prompt engineering, and other aspects of LLM application development. The framework presented here—encompassing clustering, labeling, and de-duplication—provides a reasonable starting point for organizations looking to systematically address these challenges.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Accelerating Drug Development with AI-Powered Clinical Trial Transformation

Novartis 2025

Novartis partnered with AWS Professional Services and Accenture to modernize their drug development infrastructure and integrate AI across clinical trials with the ambitious goal of reducing trial development cycles by at least six months. The initiative involved building a next-generation GXP-compliant data platform on AWS that consolidates fragmented data from multiple domains, implements data mesh architecture with self-service capabilities, and enables AI use cases including protocol generation and an intelligent decision system (digital twin). Early results from the patient safety domain showed 72% query speed improvements, 60% storage cost reduction, and 160+ hours of manual work eliminated. The protocol generation use case achieved 83-87% acceleration in producing compliant protocols, demonstrating significant progress toward their goal of bringing life-saving medicines to patients faster.

healthcare regulatory_compliance high_stakes_application +39

Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

John Snow Labs 2024

John Snow Labs developed a comprehensive healthcare LLM system that integrates multimodal medical data (structured, unstructured, FHIR, and images) into unified patient journeys. The system enables natural language querying across millions of patient records while maintaining data privacy and security. It uses specialized healthcare LLMs for information extraction, reasoning, and query understanding, deployed on-premises via Kubernetes. The solution significantly improves clinical decision support accuracy and enables broader access to patient data analytics while outperforming GPT-4 in medical tasks.

healthcare question_answering data_analysis +37