ZenML

LLMOps Tag: data_cleaning

81 tools with this tag

← Back to LLMOps Database

Common industries

View all industries →

Accelerating Drug Development with AI-Powered Clinical Trial Transformation

Novartis

Novartis partnered with AWS Professional Services and Accenture to modernize their drug development infrastructure and integrate AI across clinical trials with the ambitious goal of reducing trial development cycles by at least six months. The initiative involved building a next-generation GXP-compliant data platform on AWS that consolidates fragmented data from multiple domains, implements data mesh architecture with self-service capabilities, and enables AI use cases including protocol generation and an intelligent decision system (digital twin). Early results from the patient safety domain showed 72% query speed improvements, 60% storage cost reduction, and 160+ hours of manual work eliminated. The protocol generation use case achieved 83-87% acceleration in producing compliant protocols, demonstrating significant progress toward their goal of bringing life-saving medicines to patients faster.

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

AI Agent for Automated Root Cause Analysis in Production Systems

Cleric

Cleric developed an AI agent system to automatically diagnose and root cause production alerts by analyzing observability data, logs, and system metrics. The agent operates asynchronously, investigating alerts when they fire in systems like PagerDuty or Slack, planning and executing diagnostic tasks through API calls, and reasoning about findings to distill information into actionable root causes. The system faces significant challenges around ground truth validation, user feedback loops, and the need to minimize human intervention while maintaining high accuracy across diverse infrastructure environments.

AI Agent Solutions for Data Warehouse Access and Security

Meta

Meta developed a multi-agent system to address the growing complexity of data warehouse access management at scale. The solution employs specialized AI agents that assist data users in obtaining access to warehouse data while helping data owners manage security and access requests. The system includes data-user agents with three sub-agents for suggesting alternatives, facilitating low-risk exploration, and crafting permission requests, alongside data-owner agents that handle security operations and access management. Key innovations include partial data preview capabilities with context-aware access control, query-level granular permissions, data-access budgeting, and rule-based risk management, all supported by comprehensive evaluation frameworks and feedback loops.

AI Agents for Data Labeling and Infrastructure Maintenance at Scale

Plaid

Plaid, a financial data connectivity platform, developed two internal AI agents to address operational challenges at scale. The AI Annotator agent automates the labeling of financial transaction data for machine learning model training, achieving over 95% human alignment while dramatically reducing annotation costs and time. The Fix My Connection agent proactively detects and repairs bank integration issues, having enabled over 2 million successful logins and reduced average repair time by 90%. These agents represent Plaid's strategic use of LLMs to improve data quality, maintain reliability across thousands of financial institution connections, and enhance their core product experiences.

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

AI-Assisted Database Debugging Platform at Scale

Databricks

Databricks built an agentic AI platform to help engineers debug thousands of OLTP database instances across hundreds of regions on AWS, Azure, and GCP. The platform addresses the problem of fragmented tooling and dispersed expertise by unifying metrics, logs, and operational workflows into a single intelligent interface with a chat assistant. The solution reduced debugging time by up to 90%, enabled new engineers to start investigations in under 5 minutes, and has achieved company-wide adoption, fundamentally changing how engineers interact with their infrastructure.

AI-Powered Automated GraphQL Schema Cleanup

Whatnot

Whatnot, a livestream shopping platform, faced significant technical debt in their GraphQL schema with over 2,600 unused fields accumulated from deprecated features and old endpoints. Manual cleanup was time-consuming and risky, requiring 1-2 hours per field and deep domain knowledge. The engineering team built an AI subagent integrated into a GitHub Action that automatically identifies unused fields through traffic analysis and generates pull requests to safely remove them. The agent follows the same process an engineer wouldโ€”removing schema fields, resolvers, dead code, and updating testsโ€”but operates autonomously in the background. Running daily at $1-3 per execution, the system has successfully removed 24 of approximately 200 unused root fields with minimal human intervention, requiring edits to only three PRs, transforming schema maintenance from a neglected one-time project into an ongoing automated process.

AI-Powered Data Copilot for Autonomous Analysis in IDEs

BlaBlaCar

BlaBlaCar developed an AI-powered Data Copilot to address the inefficient workflow between Software Engineers and Data Analysts, where engineers lacked data warehouse access and analysts were overwhelmed with repetitive queries. The solution embeds an LLM-powered assistant directly in VS Code that connects to BigQuery, provides contextual business logic from curated queries, generates SQL and Python code with unit tests, and enables engineers to perform their own analyses with data health checks as guardrails. The tool leverages a "zero-infrastructure" RAG approach using VS Code's native capabilities and GitHub Copilot, treating analyses as code artifacts in pull requests that analysts review, resulting in faster question resolution (from weeks to minutes) and freeing analysts to focus on high-value modeling work.

AI-Powered Help Desk for Accounts Payable Automation

Xelix

Xelix developed an AI-enabled help desk system to automate responses to vendor inquiries for accounts payable teams who often receive over 1,000 emails daily. The solution uses a multi-stage pipeline that classifies incoming emails, enriches them with vendor and invoice data from ERP systems, and generates contextual responses using LLMs. The system handles invoice status inquiries, payment reminders, and statement reconciliation requests, with confidence scoring to indicate response reliability. By pre-generating responses and surfacing relevant financial data, the platform reduces average handling time for tickets while maintaining human oversight through a review-and-send workflow, enabling AP teams to process high volumes of vendor communications more efficiently.

AI-Powered Marketing Intelligence Platform Accelerates Industry Analysis

CLICKFORCE

CLICKFORCE, a digital advertising leader in Taiwan, faced challenges with generic AI outputs, disconnected internal datasets, and labor-intensive analysis processes that took two to six weeks to complete industry reports. The company built Lumos, an AI-powered marketing analysis platform using Amazon Bedrock Agents for contextualized reasoning, Amazon SageMaker for Text-to-SQL fine-tuning, Amazon OpenSearch for vector embeddings, and AWS Glue for data integration. The solution reduced industry analysis time from weeks to under one hour, achieved a 47% reduction in operational costs, and enabled multiple stakeholder groups to independently generate insights without centralized analyst teams.

AI-Powered Merchant Classification Correction Agent

Ramp

Ramp built an AI agent to automatically fix incorrect merchant classifications that were previously causing customer frustration and requiring hours of manual intervention from support, finance, and engineering teams. The solution uses a large language model backed by embeddings and OLAP queries, multimodal retrieval augmented generation (RAG) with receipt image analysis, and carefully constructed guardrails to validate and process user-submitted correction requests. The agent now handles nearly 100% of requests (compared to less than 3% previously handled manually) in under 10 seconds with a 99% improvement rate according to LLM-based evaluation, saving both customer time and substantial operational costs.

AI-Powered Network Operations Assistant with Multi-Agent RAG Architecture

Swisscom

Swisscom, Switzerland's leading telecommunications provider, developed a Network Assistant using Amazon Bedrock to address the challenge of network engineers spending over 10% of their time manually gathering and analyzing data from multiple sources. The solution implements a multi-agent RAG architecture with specialized agents for documentation management and calculations, combined with an ETL pipeline using AWS services. The system is projected to reduce routine data retrieval and analysis time by 10%, saving approximately 200 hours per engineer annually while maintaining strict data security and sovereignty requirements for the telecommunications sector.

AI-Powered Real Estate Transaction Newsworthiness Detection System

The Globe and Mail

A collaboration between journalists and technologists from multiple news organizations (Hearst, Gannett, The Globe and Mail, and E24) developed an AI system to automatically detect newsworthy real estate transactions. The system combines anomaly detection, LLM-based analysis, and human feedback to identify significant property transactions, with a particular focus on celebrity involvement and price anomalies. Early results showed promise with few-shot prompting, and the system successfully identified several newsworthy transactions that might have otherwise been missed by traditional reporting methods.

AI-Powered Sales Intelligence and Go-to-Market Orchestration Platform

Clay

Clay is a creative sales and marketing platform that helps companies execute go-to-market strategies by turning unstructured data about companies and people into actionable insights. The platform addresses the challenge of finding unique competitive advantages in sales ("go-to-market alpha") by integrating with over 150 data providers and using LLM-powered agents to research prospects, enrich data, and automate outreach. Their flagship agent "Claygent" performs web research to extract custom data points that aren't available in traditional sales databases, while their newer "Navigator" agent can interact with web forms and complex websites. Clay has achieved significant scale, crossing one billion agent runs and targeting two billion runs annually, while maintaining a philosophy that data will be imperfect and building tools for rapid iteration, validation, and trust-building through features like session replay.

Automated Product Attribute Extraction and Title Standardization Using Agentic AI

Delivery Hero

Delivery Hero Quick Commerce faced significant challenges managing vast product catalogs across multiple platforms and regions, where manual verification of product attributes was time-consuming, costly, and error-prone. They implemented an agentic AI system using Large Language Models to automatically extract 22 predefined product attributes from vendor-provided titles and images, then generate standardized product titles conforming to their format. Using a predefined agent architecture with two sequential LLM components, optimized through prompt engineering, Teacher/Student knowledge distillation for the title generation step, and confidence scoring for quality control, the system achieved significant improvements in efficiency, accuracy, data quality, and customer satisfaction while maintaining cost-effectiveness and predictability.

Automating Community Conference Operations with AI Coding Agents

PyCon

A volunteer-run conference organization (PyData/PyConDE) with events serving up to 1,500 attendees faced significant operational overhead in managing tickets, marketing, video production, and community engagement. Over a three-month period, the team experimented with various AI coding agents (Claude, Gemini, Qwen Coder Plus, Codex) to automate tasks including LinkedIn scraping for social media content, automated video cutting using computer vision, ticket management integration, and multi-step workflow automation. The results were mixed: while AI agents proved valuable for well-documented API integration, boilerplate code generation, and specific automation tasks like screenshot capture and video processing, they struggled with multi-step procedural workflows, data normalization, and maintaining code quality without close human oversight. The team concluded that AI agents work best when kept on a "short leash" with narrow use cases, frequent commits, and human validation, delivering time savings for generalist tasks but requiring careful expectation management and not delivering the "10x productivity" improvements often claimed.

Automating Merchant Onboarding with Reinforcement Learning

Doordash

DoorDash faced challenges with menu accuracy during merchant onboarding, where their existing AI system struggled with diverse and messy real-world menu formats. Working with Applied Compute, they developed an automated grading system calibrated to internal expert standards, then used reinforcement learning to train a menu error correction model against this grader as a reward function. The solution achieved a 30% relative reduction in low-quality menus and was rolled out to all USA menu traffic, demonstrating how institutional knowledge can be encoded into automated training signals for production LLM systems.

Building a Commonsense Knowledge Graph for E-commerce Product Recommendations

Amazon

Amazon developed COSMO, a framework that leverages LLMs to build a commonsense knowledge graph for improving product recommendations in e-commerce. The system uses LLMs to generate hypotheses about commonsense relationships from customer interaction data, validates these through human annotation and ML filtering, and uses the resulting knowledge graph to enhance product recommendation models. Tests showed up to 60% improvement in recommendation performance when using the COSMO knowledge graph compared to baseline models.

Building a Global Product Catalogue with Multimodal LLMs at Scale

Shopify

Shopify addressed the challenge of fragmented product data across millions of merchants by building a Global Catalogue using multimodal LLMs to standardize and enrich billions of product listings. The system processes over 10 million product updates daily through a four-layer architecture involving product data foundation, understanding, matching, and reconciliation. By fine-tuning open-source vision language models and implementing selective field extraction, they achieve 40 million LLM inferences daily with 500ms median latency while reducing GPU usage by 40%. The solution enables improved search, recommendations, and conversational commerce experiences across Shopify's ecosystem.

Building a Natural Language Business Intelligence Interface with MCP

Ramp

Ramp built an MCP (Model Context Protocol) server to enable natural language querying of business spend data through their developer API. The initial prototype allowed Claude to generate visualizations and run analyses, but struggled with scale due to context window limitations and high token usage. By pivoting to a SQL-based approach using an in-memory SQLite database with a lightweight ETL pipeline, they enabled Claude to query tens of thousands of transactions efficiently. The solution includes load tools for API data extraction, data transformation capabilities, and query execution tools, allowing users to gain insights into business spend patterns through conversational queries while addressing security concerns through audit logging and OAuth scopes.

Building a Production-Ready Business Analytics Assistant with ChatGPT

Microsoft

A detailed case study on automating data analytics using ChatGPT, where the challenge of LLMs' limitations in quantitative reasoning is addressed through a novel multi-agent system. The solution implements two specialized ChatGPT agents - a data engineer and data scientist - working together to analyze structured business data. The system uses ReAct framework for reasoning, SQL for data retrieval, and Streamlit for deployment, demonstrating how to effectively operationalize LLMs for complex business analytics tasks.

Building a Rust-Based AI Agentic Framework for Multimodal Data Quality Monitoring

Zectonal

Zectonal, a data quality monitoring company, developed a custom AI agentic framework in Rust to scale their multimodal data inspection capabilities beyond traditional rules-based approaches. The framework enables specialized AI agents to autonomously call diagnostic function tools for detecting defects, errors, and anomalous conditions in large datasets, while providing full audit trails through "Agent Provenance" tracking. The system supports multiple LLM providers (OpenAI, Anthropic, Ollama) and can operate both online and on-premise, packaged as a single binary executable that the company refers to as their "genie-in-a-binary."

Building a Unified Data Platform with Gen AI and ODL Integration

MongoDB

TCS and MongoDB present a case study on modernizing data infrastructure by integrating Operational Data Layers (ODLs) with generative AI and vector search capabilities. The solution addresses challenges of fragmented, outdated systems by creating a real-time, unified data platform that enables AI-powered insights, improved customer experiences, and streamlined operations. The implementation includes both lambda and kappa architectures for handling batch and real-time processing, with MongoDB serving as the flexible operational layer.

Building an AI Agent Platform with Cloud-Based Virtual Machines and Extended Context

Manus

Manus AI, founded in late 2024, developed a consumer-focused AI agent platform that addresses the limitation of frontier LLMs having intelligence but lacking the ability to take action in digital environments. The company built a system where each user task is assigned a fully functional cloud-based virtual machine (Linux, with plans for Windows and Android) running real applications including file systems, terminals, VS Code, and Chromium browsers. By adopting a "less structure, more intelligence" philosophy that avoids predefined workflows and multi-role agent systems, and instead provides rich context to foundation models (primarily Anthropic's Claude), Manus created an agent capable of handling diverse long-horizon tasks from office location research to furniture shopping to data extraction, with users reporting up to 2 hours of daily GPU consumption. The platform launched publicly in March 2024 after five months of development and reportedly spent $1 million on Claude API usage in its first 14 days.

Building an AI-Powered Email Writing Assistant with Personalized Style Matching

Ghostwriter

Shortwave developed Ghostwriter, an AI writing feature that helps users compose emails that match their personal writing style. The system uses embedding-based semantic search to find relevant past emails, combines them with system prompts and custom instructions, and uses fine-tuned LLMs to generate contextually appropriate suggestions. The solution addresses two key challenges: making AI-generated text sound authentic to each user's style and incorporating accurate, relevant information from their email history.

Building and Evaluating Maya: An AI-Powered Data Pipeline Generation System

Maia

Matillion developed Maya, a digital data engineer product that uses LLMs to help data engineers build data pipelines more productively. Starting as a simple chatbot co-pilot in mid-2022, Maya evolved into a core interface for the Data Productivity Cloud (DPC), generating data pipelines through natural language prompts. The company faced challenges transitioning from informal "vibes-based" evaluation to rigorous testing frameworks required for enterprise deployment. They implemented a multi-phase approach: starting with simple certification exam tests, progressing to LLM-as-judge evaluation with human-in-the-loop validation, and finally building automated testing harnesses integrated with Langfuse for observability. This evolution enabled them to confidently upgrade models (like moving to Claude Sonnet 3.5 within 24 hours) and successfully launch Maya to enterprise customers in June 2024, while navigating challenges around PII handling in trace data and integrating MLOps skillsets into traditional software engineering teams.

Building and Managing Taxonomies for Effective AI Systems

Adobe

Adobe's Information Architect Jessica Talisman discusses how to build and maintain taxonomies for AI and search systems. The case study explores the challenges and best practices in creating taxonomies that bridge the gap between human understanding and machine processing, covering everything from metadata extraction to ontology development. The approach emphasizes the importance of human curation in AI systems and demonstrates how well-structured taxonomies can significantly improve search relevance, content categorization, and business operations.

Building Production-Ready AI Agents for Internal Workflow Automation

Vercel

Vercel, a web hosting and deployment platform, addressed the challenge of identifying and implementing successful AI agent projects across their organization by focusing on employee pain pointsโ€”specifically repetitive, boring tasks that humans disliked. The company deployed three internal production agents: a lead processing agent that automated sales qualification and research (saving hundreds of days of manual work), an anti-abuse agent that accelerated content moderation decisions by 59%, and a data analyst agent that automated SQL query generation for business intelligence. Their methodology centered on asking employees "What do you hate most about your job?" to identify tasks that were repetitive enough for current AI models to handle reliably while still delivering high business impact.

Building Production-Ready SQL and Charting Agents with RAG Integration

Numbers Station

Numbers Station addresses the challenge of overwhelming data team requests in enterprises by developing an AI-powered self-service analytics platform. Their solution combines LLM agents with RAG and a comprehensive knowledge layer to enable accurate SQL query generation, chart creation, and multi-agent workflows. The platform demonstrated significant improvements in real-world benchmarks compared to vanilla LLM approaches, reducing setup time from weeks to hours while maintaining high accuracy through contextual knowledge integration.

Cloud-Native Synthetic Data Generator for Data Pipeline Testing

GoDaddy

GoDaddy faced challenges in testing data pipelines without production data due to privacy concerns and the labor-intensive nature of manual test data creation. They built a cloud-native synthetic data generator that combines LLM intelligence (via their internal GoCode API) with scalable traditional data generation tools (Databricks Labs Datagen and EMR Serverless). The system uses LLMs to understand schemas and automatically generate intelligent data generation templates rather than generating each row directly, achieving a 99.9% cost reduction compared to pure LLM generation. This hybrid approach resulted in a 90% reduction in time spent creating test data, complete elimination of production data in test environments, and 5x faster pipeline development cycles.

Data and AI Governance Integration in Enterprise GenAI Adoption

Various

A panel discussion featuring leaders from Mercado Libre, ATB Financial, LBLA retail, and Collibra discussing how they are implementing data and AI governance in the age of generative AI. The organizations are leveraging Google Cloud's Dataplex and other tools to enable comprehensive data governance, while also exploring GenAI applications for automating governance tasks, improving data discovery, and enhancing data quality management. Their approaches range from careful regulated adoption in finance to rapid e-commerce implementation, all emphasizing the critical connection between solid data governance and successful AI deployment.

Data Engineering Challenges and Best Practices in LLM Production

QuantumBlack

Data engineers from QuantumBlack discuss the evolving landscape of data engineering with the rise of LLMs, highlighting key challenges in handling unstructured data, maintaining data quality, and ensuring privacy. They share experiences dealing with vector databases, data freshness in RAG applications, and implementing proper guardrails when deploying LLM solutions in enterprise settings.

Data Quality Assessment and Enhancement Framework for GenAI Applications

QuantumBlack

QuantumBlack developed AI4DQ Unstructured, a comprehensive toolkit for assessing and improving data quality in generative AI applications. The solution addresses common challenges in unstructured data management by providing document clustering, labeling, and de-duplication workflows. In a case study with an international health organization, the system processed 2.5GB of data, identified over ten high-priority data quality issues, removed 100+ irrelevant documents, and preserved critical information in 5% of policy documents that would have otherwise been lost, leading to a 20% increase in RAG pipeline accuracy.

Deploying LLM-Based Recommendation Systems in Private Equity

Bainbridge Capital

A data scientist shares their experience transitioning from traditional ML to implementing LLM-based recommendation systems at a private equity company. The case study focuses on building a recommendation system for boomer-generation users, requiring recommendations within the first five suggestions. The implementation involves using OpenAI APIs for data cleaning, text embeddings, and similarity search, while addressing challenges of production deployment on AWS.

Domain-Specific AI Platform for Manufacturing and Supply Chain Optimization

Articul8

Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.

Enterprise Data Extraction Evolution from Simple RAG to Multi-Agent Architecture

Box

Box, a B2B unstructured data platform serving Fortune 500 companies, initially built a straightforward LLM-based metadata extraction system that successfully processed 10 million pages but encountered limitations with complex documents, OCR challenges, and scale requirements. They evolved from a simple pre-process-extract-post-process pipeline to a sophisticated multi-agent architecture that intelligently handles document complexity, field grouping, and quality feedback loops, resulting in a more robust and easily evolving system that better serves enterprise customers' diverse document processing needs.

Enterprise Unstructured Data Quality Management for Production AI Systems

Anomalo

Anomalo addresses the critical challenge of unstructured data quality in enterprise AI deployments by building an automated platform on AWS that processes, validates, and cleanses unstructured documents at scale. The solution automates OCR and text parsing, implements continuous data observability to detect anomalies, enforces governance and compliance policies including PII detection, and leverages Amazon Bedrock for scalable LLM-based document quality analysis. This approach enables enterprises to transform their vast collections of unstructured text data into trusted assets for production AI applications while reducing operational burden, optimizing costs, and maintaining regulatory compliance.

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

John Snow Labs

John Snow Labs developed a comprehensive healthcare LLM system that integrates multimodal medical data (structured, unstructured, FHIR, and images) into unified patient journeys. The system enables natural language querying across millions of patient records while maintaining data privacy and security. It uses specialized healthcare LLMs for information extraction, reasoning, and query understanding, deployed on-premises via Kubernetes. The solution significantly improves clinical decision support accuracy and enables broader access to patient data analytics while outperforming GPT-4 in medical tasks.

Financial Transaction Categorization at Scale Using LLMs and Custom Embeddings

Mercado Libre

Mercado Libre (MELI) faced the challenge of categorizing millions of financial transactions across Latin America in multiple languages and formats as Open Finance unlocked access to customer financial data. Starting with a brittle regex-based system in 2021 that achieved only 60% accuracy and was difficult to maintain, they evolved through three generations: first implementing GPT-3.5 Turbo in 2023 to achieve 80% accuracy with 75% cost reduction, then transitioning to GPT-4o-mini in 2024, and finally developing custom BERT-based semantic embeddings trained on regional financial text to reach 90% accuracy with an additional 30% cost reduction. This evolution enabled them to scale from processing tens of millions of transactions per quarter to tens of millions per week, while enabling near real-time categorization that powers personalized financial insights across their ecosystem.

From Simple RAG to Multi-Agent Architecture for Document Data Extraction

Box

Box evolved their document data extraction system from a simple single-model approach to a sophisticated multi-agent architecture to handle enterprise-scale unstructured data processing. The initial straightforward approach of preprocessing documents and feeding them to an LLM worked well for basic use cases but failed when customers presented complex challenges like 300-page documents, poor OCR quality, hundreds of extraction fields, and confidence scoring requirements. By redesigning the system using an agentic approach with specialized sub-agents for different tasks, Box achieved better accuracy, easier system evolution, and improved maintainability while processing millions of pages for enterprise customers.

Generative AI Assistant for Agricultural Field Trial Analysis

Agmatix

Agmatix developed Leafy, a generative AI assistant powered by Amazon Bedrock, to streamline agricultural field trial analysis. The solution addresses challenges in analyzing complex trial data by enabling agronomists to query data using natural language, automatically selecting appropriate visualizations, and providing insights. Using Amazon Bedrock with Anthropic Claude, along with AWS services for data pipeline management, the system achieved 20% improved efficiency, 25% better data integrity, and tripled analysis throughput.

GPT Integration for SQL Stored Procedure Optimization in CI/CD Pipeline

Agoda

Agoda integrated GPT into their CI/CD pipeline to automate SQL stored procedure optimization, addressing a significant operational bottleneck where database developers were spending 366 man-days annually on manual optimization tasks. The system provides automated analysis and suggestions for query improvements, index recommendations, and performance optimizations, leading to reduced manual review time and improved merge request processing. While achieving approximately 25% accuracy, the solution demonstrates practical benefits in streamlining database development workflows despite some limitations in handling complex stored procedures.

Healthcare NLP Pipeline for HIPAA-Compliant Patient Data De-identification

Dandelion Health

Dandelion Health developed a sophisticated de-identification pipeline for processing sensitive patient healthcare data while maintaining HIPAA compliance. The solution combines John Snow Labs' Healthcare NLP with custom pre- and post-processing steps to identify and transform protected health information (PHI) in free-text patient notes. Their approach includes risk categorization by medical specialty, context-aware processing, and innovative "hiding in plain sight" techniques to achieve high-quality de-identification while preserving data utility for medical research.

Human-AI Co-Annotation System for Efficient Data Labeling

Appen

Appen developed a hybrid approach combining LLMs with human annotators to address the growing challenges in data annotation for AI models. They implemented a co-annotation engine that uses model uncertainty metrics to efficiently route annotation tasks between LLMs and human annotators. Using GPT-3.5 Turbo for initial annotations and entropy-based confidence scoring, they achieved 87% accuracy while reducing costs by 62% and annotation time by 63% compared to purely human annotation, demonstrating an effective balance between automation and human expertise.

Implementing MCP Remote Server for CRM Agent Integration

HubSpot

HubSpot built a remote Model Context Protocol (MCP) server to enable AI agents like ChatGPT to interact with their CRM data. The challenge was to provide seamless, secure access to CRM objects (contacts, companies, deals) for ChatGPT's 500 million weekly users, most of whom aren't developers. In less than four weeks, HubSpot's team extended the Java MCP SDK to create a stateless, HTTP-based microservice that integrated with their existing REST APIs and RPC system, implementing OAuth 2.0 for authentication and user permission scoping. The solution made HubSpot the first CRM with an OpenAI connector, enabling read-only queries that allow customers to analyze CRM data through natural language interactions while maintaining enterprise-grade security and scale.

Incremental LLM Adoption Strategy in Email Processing API Platform

Nylas

Nylas, an email/calendar/contacts API platform provider, implemented a systematic three-month strategy to integrate LLMs into their production systems. They started with development workflow automation using multi-agent systems, enhanced their annotation processes with LLMs, and finally integrated LLMs as a fallback mechanism in their core email processing product. This measured approach resulted in 90% reduction in bug tickets, 20x cost savings in annotation, and successful deployment of their own LLM infrastructure when usage reached cost-effective thresholds.

Integrating Foundation Models into the Modern Data Stack: Challenges and Solutions

Numbers Station

Numbers Station addresses the challenges of integrating foundation models into the modern data stack for data processing and analysis. They tackle key challenges including SQL query generation from natural language, data cleaning, and data linkage across different sources. The company develops solutions for common LLMOps issues such as scale limitations, prompt brittleness, and domain knowledge integration through techniques like model distillation, prompt ensembling, and domain-specific pre-training.

Large-Scale Aviation Content Classification on Hacker News Using Small Language Models

Skysight

Skysight conducted a large-scale analysis of Hacker News content using small language models (SLMs) to classify aviation-related posts. The project processed 42 million items (10.7B input tokens) using a parallelized pipeline and cloud infrastructure. Through careful prompt engineering and model selection, they achieved efficient classification at scale, revealing that 0.62% of all posts and 1.13% of stories were aviation-related, with notable temporal trends in aviation content frequency.

Large-Scale Enterprise Data Platform Migration Using AI and Generative AI Automation

CommBank

Commonwealth Bank of Australia (CBA), Australia's largest bank serving 17.5 million customers, faced the challenge of modernizing decades of rich data spread across hundreds of on-premise source systems that lacked interoperability and couldn't scale for AI workloads. In partnership with HCL Tech and AWS, CBA migrated 61,000 on-premise data pipelines (equivalent to 10 petabytes of data) to an AWS-based data mesh ecosystem in 9 months. The solution leveraged AI and generative AI to transform code, check for errors, and test outputs with 100% accuracy reconciliation, conducting 229,000 tests across the migration. This enabled CBA to establish a federated data architecture called CommBank.data that empowers 40 lines of business with self-service data access while maintaining strict governance, positioning the bank for AI-driven innovation at scale.

Large-Scale LLM Batch Processing Platform for Millions of Prompts

Instacart

Instacart faced challenges processing millions of LLM calls required by various teams for tasks like catalog data cleaning, item enrichment, fulfillment routing, and search relevance improvements. Real-time LLM APIs couldn't handle this scale effectively, leading to rate limiting issues and high costs. To solve this, Instacart built Maple, a centralized service that automates large-scale LLM batch processing by handling batching, encoding/decoding, file management, retries, and cost tracking. Maple integrates with external LLM providers through batch APIs and an internal AI Gateway, achieving up to 50% cost savings compared to real-time calls while enabling teams to process millions of prompts reliably without building custom infrastructure.

Large-Scale LLM Infrastructure for E-commerce Applications

Coupang

Coupang, a major e-commerce platform operating primarily in South Korea and Taiwan, faced challenges in scaling their ML infrastructure to support LLM applications across search, ads, catalog management, and recommendations. The company addressed GPU supply shortages and infrastructure limitations by building a hybrid multi-region architecture combining cloud and on-premises clusters, implementing model parallel training with DeepSpeed, and establishing GPU-based serving using Nvidia Triton and vLLM. This infrastructure enabled production applications including multilingual product understanding, weak label generation at scale, and unified product categorization, with teams using patterns ranging from in-context learning to supervised fine-tuning and continued pre-training depending on resource constraints and quality requirements.

Leveraging NLP and LLMs for Music Industry Royalty Recovery

Love Without Sound

Love Without Sound developed an AI-powered system to help the music industry recover lost royalties due to incorrect metadata and unauthorized usage. The solution combines NLP pipelines for metadata standardization, legal document processing, and is now expanding to include RAG-based querying and audio embedding models. The system processes billions of tracks, operates in real-time, and runs in a fully data-private environment, helping recover millions in revenue for artists.

Leveraging RAG and LLMs for ESG Data Intelligence Platform

ESGPedia

ESGpedia faced challenges in managing complex ESG data across multiple platforms and pipelines. They implemented Databricks' Data Intelligence Platform to create a unified lakehouse architecture and leveraged Mosaic AI with RAG techniques to process sustainability data more effectively. The solution resulted in 4x cost savings in data pipeline management, improved time to insights, and enhanced ability to provide context-aware ESG insights to clients across APAC.

LLM-Powered Data Classification System for Enterprise-Scale Metadata Generation

Grab

Grab developed an automated data classification system using LLMs to replace manual tagging of sensitive data across their PetaByte-scale data infrastructure. They built an orchestration service called Gemini that integrates GPT-3.5 to classify database columns and generate metadata tags, significantly reducing manual effort in data governance. The system successfully processed over 20,000 data entities within a month of deployment, with 80% user satisfaction and minimal need for tag corrections.

LLM-Powered Data Labeling Quality Assurance System

Uber

Uber AI Solutions developed a production LLM-based quality assurance system called Requirement Adherence to improve data labeling accuracy for their enterprise clients. The system addresses the costly and time-consuming problem of post-labeling rework by identifying quality issues during the labeling process itself. It works in two phases: first extracting atomic rules from client Standard Operating Procedure (SOP) documents using LLMs with reflection capabilities, then performing real-time validation during the labeling process by routing different rule types to appropriately-sized models with optimization techniques like prefix caching. This approach resulted in an 80% reduction in required audits, significantly improving timelines and reducing costs while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered In-Tool Quality Validation for Data Labeling

Uber

Uber AI Solutions developed a Requirement Adherence system to address quality issues in data labeling workflows, which traditionally relied on post-labeling checks that resulted in costly rework and delays. The solution uses LLMs in a two-phase approach: first extracting atomic rules from Standard Operating Procedure (SOP) documents and categorizing them by complexity, then performing real-time validation during the labeling process within their uLabel tool. By routing different rule types to appropriate LLM models (non-reasoning models for deterministic checks, reasoning models for subjective checks) and leveraging techniques like prefix caching and parallel execution, the system achieved an 80% reduction in required audits while maintaining data privacy through stateless, privacy-preserving LLM calls.

LLM-Powered Product Catalogue Quality Control at Scale

Amazon

Amazon's product catalogue contains hundreds of millions of products with millions of listings added or edited daily, requiring accurate and appealing product data to help shoppers find what they need. Traditional specialized machine learning models worked well for products with structured attributes but struggled with nuanced or complex product descriptions. Amazon deployed large language models (LLMs) adapted through prompt tuning and catalogue knowledge integration to perform quality control tasks including recognizing standard attribute values, collecting synonyms, and detecting erroneous data. This LLM-based approach enables quality control across more product categories and languages, includes latest seller values within days rather than weeks, and saves thousands of hours in human review while extending reach into previously cost-prohibitive areas of the catalogue.

LLM-Powered Upskilling Assistant in Steel Manufacturing

Gerdau

Gerdau, a major steel manufacturer, implemented an LLM-based assistant to support employee re/upskilling as part of their broader digital transformation initiative. This development came after transitioning to the Databricks Data Intelligence Platform to solve data infrastructure challenges, which enabled them to explore advanced AI applications. The platform consolidation resulted in a 40% cost reduction in data processing and allowed them to onboard 300 new global data users while creating an environment conducive to AI innovation.

Mainframe to Cloud Migration with AI-Powered Code Transformation

Mercedes-Benz

Mercedes-Benz faced the challenge of modernizing their Global Ordering system, a critical mainframe application handling over 5 million lines of code that processes every vehicle order and production request across 150 countries. The company partnered with Capgemini, AWS, and Rocket Software to migrate this system from mainframe to cloud using a hybrid approach: replatforming the majority of the application while using agentic AI (GenRevive tool) to refactor specific components. The most notable success was transforming 1.3 million lines of COBOL code in their pricing service to Java in just a few months, achieving faster performance, reduced mainframe costs, and a successful production deployment with zero incidents at go-live.

MLOps Platform for Airline Operations with LLM Integration

LATAM Airlines

LATAM Airlines developed Cosmos, a vendor-agnostic MLOps framework that enables both traditional ML and LLM deployments across their business operations. The framework reduced model deployment time from 3-4 months to less than a week, supporting use cases from fuel efficiency optimization to personalized travel recommendations. The platform demonstrates how a traditional airline can transform into a data-driven organization through effective MLOps practices and careful integration of AI technologies.

Multi-Agent DBT Development Workflow for Data Engineering Consulting

Mammoth Growth

Mammoth Growth, a boutique data consultancy specializing in marketing and customer data, developed a multi-agent AI system to automate DBT development workflows in response to data teams struggling to deliver analytics at the speed of business. The solution employs a team of specialized AI agents (orchestrator, analyst, architect, and analytics engineer) that leverage the DBT Model Context Protocol (MCP) to autonomously write, document, and test production-grade DBT code from detailed specifications. The system enabled the delivery of a complete enterprise-grade data lineage with 15 data models and two gold-layer models in just 3 weeks for a pilot client, compared to an estimated 10 weeks using traditional manual development approaches, while maintaining code quality standards through human-led requirements gathering and mandatory code review before production deployment.

Multi-modal LLM Platform for Catalog Attribute Extraction at Scale

Instacart

Instacart faced significant challenges in extracting structured product attributes (flavor, size, dietary claims, etc.) from millions of SKUs using traditional SQL-based rules and text-only machine learning models. These approaches suffered from low quality, high development overhead, and inability to process image data. To address these limitations, Instacart built PARSE (Product Attribute Recognition System for E-commerce), a self-serve multi-modal LLM platform that enables teams to extract attributes from both text and images with minimal engineering effort. The platform reduced attribute extraction development time from weeks to days, achieved 10% higher recall through multi-modal reasoning compared to text-only approaches, and delivered 95% accuracy on simpler attributes with just one day of effort versus one week with traditional methods.

Multimodal Healthcare Data Integration with Specialized LLMs

John Snow Labs

John Snow Labs developed a comprehensive healthcare data integration system that leverages multiple specialized LLMs to unify and analyze patient data from various sources. The system processes structured, unstructured, and semi-structured medical data (including EHR, PDFs, HL7, FHIR) to create complete patient journeys, enabling natural language querying while maintaining consistency, accuracy, and scalability. The solution addresses key healthcare challenges like terminology mapping, date normalization, and data deduplication, all while operating within secure environments and handling millions of patient records.

Natural Language Analytics Assistant Using Amazon Bedrock Agents

Skai

Skai, an omnichannel advertising platform, developed Celeste, an AI agent powered by Amazon Bedrock Agents, to transform how customers access and analyze complex advertising data. The solution addresses the challenge of time-consuming manual report generation (taking days or weeks) by enabling natural language queries that automatically collect data from multiple sources, synthesize insights, and provide actionable recommendations. The implementation reduced report generation time by 50%, case study creation by 75%, and transformed weeks-long processes into minutes while maintaining enterprise-grade security and privacy for sensitive customer data.

Natural Language Interface for Healthcare Data Analytics using LLMs

Aachen Uniklinik / Aurea Software

A UK-based NLQ (Natural Language Query) company developed an AI-powered interface for Aachen Uniklinik to make intensive care unit databases more accessible to healthcare professionals. The system uses a hybrid approach combining vector databases, large language models, and traditional SQL to allow non-technical medical staff to query complex patient data using natural language. The solution includes features for handling dirty data, intent detection, and downstream complication analysis, ultimately improving clinical decision-making processes.

Observability Platform's Journey to Production GenAI Integration

New Relic

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

Product Attribute Normalization and Sorting Using DSPy for Large-Scale E-commerce

Zoro UK

Zoro UK, an e-commerce subsidiary of Grainger with 3.5 million products from 300+ suppliers, faced challenges normalizing and sorting product attributes across 75,000 different attribute types. Using DSPy (a framework for optimizing LLM prompts programmatically), they built a production system that automatically determines whether attributes require alpha-numeric sorting or semantic sorting. The solution employs a two-tier architecture: Mistral 8B for initial classification and GPT-4 for complex semantic sorting tasks. The DSPy approach eliminated manual prompt engineering, provided LLM-agnostic compatibility, and enabled automated prompt optimization using genetic algorithm-like iterations, resulting in improved product discoverability and search experience for their 1 million monthly active users.

Production AI Agents with Dynamic Planning and Reactive Evaluation

Hex

Hex successfully implemented AI agents in production for data science notebooks by developing a unique approach to agent orchestration. They solved key challenges around planning, tool usage, and latency by constraining agent capabilities, building a reactive DAG structure, and optimizing context windows. Their success came from iteratively developing individual capabilities before combining them into agents, keeping humans in the loop, and maintaining tight feedback cycles with users.

Productionizing LLM-Powered Data Governance with LangChain and LangSmith

Grab

Grab enhanced their LLM-powered data governance system (Metasense V2) by improving model performance and operational efficiency. The team tackled challenges in data classification by splitting complex tasks, optimizing prompts, and implementing LangChain and LangSmith frameworks. These improvements led to reduced misclassification rates, better collaboration between teams, and streamlined prompt experimentation and deployment processes while maintaining robust monitoring and safety measures.

RAG-Based Industry Classification System for Financial Services

Ramp

Ramp, a financial services company, replaced their fragmented homegrown industry classification system with a standardized NAICS-based taxonomy powered by an in-house RAG model. The old system relied on stitched-together third-party data and multiple non-auditable sources of truth, leading to inconsistent, overly broad, and sometimes incorrect business categorizations. By building a custom RAG system that combines embeddings-based retrieval with LLM-based re-ranking, Ramp achieved significant improvements in classification accuracy (up to 60% in retrieval metrics and 5-15% in final prediction accuracy), gained full control over the model's behavior and costs, and enabled consistent cross-team usage of industry data for compliance, risk assessment, sales targeting, and product analytics.

Scaling AI Systems for Unstructured Data Processing: Logical Data Models and Embedding Optimization

CoActive AI

CoActive AI addresses the challenge of processing unstructured data at scale through AI systems. They identified two key lessons: the importance of logical data models in bridging the gap between data storage and AI processing, and the strategic use of embeddings for cost-effective AI operations. Their solution involves creating data+AI hybrid teams to resolve impedance mismatches and optimizing embedding computations to reduce redundant processing, ultimately enabling more efficient and scalable AI operations.

Scaling Data Infrastructure for AI Features and RAG

Notion

Notion faced challenges with rapidly growing data volume (10x in 3 years) and needed to support new AI features. They built a scalable data lake infrastructure using Apache Hudi, Kafka, Debezium CDC, and Spark to handle their update-heavy workload, reducing costs by over a million dollars and improving data freshness from days to minutes/hours. This infrastructure became crucial for successfully rolling out Notion AI features and their Search and AI Embedding RAG infrastructure.

Scaling Document Processing with LLMs and Human Review

Vendr / Extend

Vendr partnered with Extend to extract structured data from SaaS order forms and contracts using LLMs. They implemented a hybrid approach combining LLM processing with human review to achieve high accuracy in entity recognition and data extraction. The system successfully processed over 100,000 documents, using techniques such as document embeddings for similarity clustering, targeted human review, and robust entity mapping. This allowed Vendr to unlock valuable pricing insights for their customers while maintaining high data quality standards.

Scaling ML Annotation Platform with LLMs for Content Classification

Spotify

Spotify needed to generate high-quality training data annotations at massive scale to support ML models covering hundreds of millions of tracks and podcast episodes for tasks like content relations detection and platform policy violation identification. They built a comprehensive annotation platform centered on three pillars: scaling human expertise through tiered workforce structures, implementing flexible annotation tooling with custom interfaces and quality metrics, and establishing robust infrastructure for integration with ML workflows. A key innovation was deploying a configurable LLM-based system running in parallel with human annotators. This approach increased their annotation corpus by 10x while improving annotator productivity by 3x, enabling them to generate millions of annotations and significantly reduce ML model development time.

Scaling Parallel Agent Operations with LangChain and LangSmith Monitoring

Paradigm

Paradigm (YC24) built an AI-powered spreadsheet platform that runs thousands of parallel agents for data processing tasks. They utilized LangChain for rapid agent development and iteration, while leveraging LangSmith for comprehensive monitoring, operational insights, and usage-based pricing optimization. This enabled them to build task-specific agents for schema generation, sheet naming, task planning, and contact lookup while maintaining high performance and cost efficiency.

Self-Learning Generative AI System for Product Catalog Enrichment

Amazon

Amazon's Catalog Team faced the challenge of extracting structured product attributes and generating quality content at massive scale while managing the tradeoff between model accuracy and computational costs. They developed a self-learning system using multiple smaller models working in consensus to process routine cases, with a supervisor agent using more capable models to investigate disagreements and generate reusable learnings stored in a dynamic knowledge base. This architecture, implemented with Amazon Bedrock, resulted in continuously declining error rates and reduced costs over time, as accumulated learnings prevented entire classes of future disagreements without requiring model retraining.

Semantic Data Processing at Scale with AI-Powered Query Optimization

DocETL

Shreyaa Shankar presents DocETL, an open-source system for semantic data processing that addresses the challenges of running LLM-powered operators at scale over unstructured data. The system tackles two major problems: how to make semantic operator pipelines scalable and cost-effective through novel query optimization techniques, and how to make them steerable through specialized user interfaces. DocETL introduces rewrite directives that decompose complex tasks and data to improve accuracy and reduce costs, achieving up to 86% cost reduction while maintaining target accuracy. The companion tool Doc Wrangler provides an interactive interface for iteratively authoring and debugging these pipelines. Real-world applications include public defenders analyzing court transcripts for racial bias and medical analysts extracting information from doctor-patient conversations, demonstrating significant accuracy improvements (2x in some cases) compared to baseline approaches.

SQL Query Agent for Data Democratization

Prosus

Prosus developed a SQL-generating agent called "Token Data Analyst" to help democratize data access across their portfolio companies. The agent serves as a first-line support for data queries, allowing non-technical users to get insights from databases through natural language questions in Slack. The system achieved a 74% reduction in query response time and significantly increased the total number of data insights generated, while maintaining high accuracy through careful prompt engineering and context management.

Strategic Framework for Generative AI Implementation in Food Delivery Platform

Doordash

DoorDash outlines a comprehensive strategy for implementing Generative AI across five key areas: customer assistance, interactive discovery, personalized content generation, information extraction, and employee productivity enhancement. The company aims to revolutionize its delivery platform while maintaining strong considerations for data privacy and security, focusing on practical applications ranging from automated cart building to SQL query generation.