ZenML

Productionizing LLM-Powered Data Governance with LangChain and LangSmith

Grab 2024
View original source

Grab enhanced their LLM-powered data governance system (Metasense V2) by improving model performance and operational efficiency. The team tackled challenges in data classification by splitting complex tasks, optimizing prompts, and implementing LangChain and LangSmith frameworks. These improvements led to reduced misclassification rates, better collaboration between teams, and streamlined prompt experimentation and deployment processes while maintaining robust monitoring and safety measures.

Industry

Tech

Technologies

Overview

Grab, the leading superapp platform in Southeast Asia providing ride-hailing, food delivery, and various on-demand services, developed an LLM-powered system called Metasense to automate data governance and classification across their enterprise data lake. This case study documents the second iteration of the system (Metasense V2), focusing on improvements made after the initial rollout and the productionisation journey of an LLM-based classification system at enterprise scale.

The core problem Grab faced was the challenge of classifying data entities for governance purposes. Their internal metadata generation service, Gemini, relied on third-party data classification services that had restrictions on customisation and required significant resources to train custom models. The LLM-based approach offered a more affordable and flexible alternative that could scale across the organization.

Initial System and Scale

The first version of the system was launched in early 2024 and initially scanned more than 20,000 data entries at an average rate of 300-400 entities per day. The system performed column-level tag classifications, which when combined with Grab’s data privacy rules, determined sensitivity tiers of data entities. Since launch, the model has grown to cover the vast majority of Grab’s data lake tables, significantly reducing manual classification workload.

It’s worth noting that despite automation, the data pipeline still requires human verification from data owners to prevent misclassifications. This is a practical acknowledgment that critical ML workflows, especially in governance contexts, cannot entirely eliminate human oversight—a mature perspective on LLMOps that balances efficiency with risk management.

Model Improvement Through Prompt Engineering

After deployment, the team accumulated substantial feedback from table owners and combined this with manual classifications from the Data Governance Office to create training and testing datasets for model improvements. This post-deployment data collection strategy is a key LLMOps practice that enables continuous model refinement.

The team identified several challenging edge cases that the initial model struggled with:

The team hypothesized that the core issue was model capacity—when given high volumes of classification tasks simultaneously, the model’s effectiveness degraded. The original prompt required the model to distinguish between 21 tags, with 13 of them aimed at differentiating types of non-PII data, which distracted from the primary task of identifying PII.

To address these capacity issues, the team implemented several prompt engineering strategies:

These improvements demonstrate practical prompt engineering techniques for production LLM systems, particularly the recognition that LLMs have limited attention capacity and that thoughtful task design can significantly improve accuracy.

Tooling and Infrastructure: LangChain and LangSmith Adoption

A significant aspect of the Metasense V2 productionisation was the adoption of LangChain and LangSmith frameworks. This upgrade was motivated by the need to enable rapid experimentation with prompt versions and facilitate collaboration among a diverse team of data scientists and engineers.

LangChain was adopted to streamline the process from raw input to desired outcome by chaining interoperable components. The new backend leverages LangChain to construct an updated model supporting both PII and non-PII classification tasks.

LangSmith serves as the unified DevOps platform for the LLM workflow, enabling collaboration among product managers, data scientists, and software engineers. The integration provides several operational benefits:

This tooling choice reflects a broader LLMOps trend toward specialized platforms that bridge the gap between experimentation and production deployment.

Monitoring and Quality Assurance

The case study emphasizes ongoing quality assurance as a critical component of production LLM systems. Despite achieving “exceptionally low misclassification rates,” the team has implemented several safety measures:

This approach demonstrates mature LLMOps thinking—acknowledging that even high-performing models can degrade over time and that proactive monitoring is essential for maintaining quality in perpetuity.

Practical Considerations and Balanced Assessment

While the case study presents largely positive outcomes, there are several practical considerations worth noting:

The system still requires human verification, indicating that full automation of sensitive classification tasks remains challenging even with LLM improvements. This is a realistic acknowledgment that LLMs, while powerful, are not infallible for critical governance decisions.

The specific misclassification rates and accuracy improvements are not quantified in the case study, making it difficult to assess the precise impact of the improvements. Terms like “exceptionally low misclassification rates” are somewhat vague.

The adoption of LangChain and LangSmith represents a strategic choice toward third-party tooling for LLMOps, which offers benefits in collaboration and rapid deployment but also introduces dependencies on external platforms.

The approach of splitting complex classification tasks and reducing prompt length are practical techniques that other organizations can apply. The recognition that LLM capacity is limited and must be managed through thoughtful task design is a valuable insight for production LLM systems.

Key LLMOps Lessons

The Metasense V2 case study offers several transferable lessons for LLMOps practitioners:

The system represents a practical example of LLM-powered automation at enterprise scale, with thoughtful attention to the operational challenges of deploying and maintaining LLM systems in production environments.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed 2026

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation +36

Observability Platform's Journey to Production GenAI Integration

New Relic 2023

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

code_generation data_analysis data_cleaning +32