ZenML

Enterprise-Wide LLM Framework for Manufacturing and Knowledge Management

Toyota 2023
View original source

Toyota implemented a comprehensive LLMOps framework to address multiple production challenges, including battery manufacturing optimization, equipment maintenance, and knowledge management. The team developed a unified framework combining LangChain and LlamaIndex capabilities, with special attention to data ingestion pipelines, security, and multi-language support. Key applications include Battery Brain for manufacturing expertise, Gear Pal for equipment maintenance, and Project Cura for knowledge management, all showing significant operational improvements including reduced downtime and faster problem resolution.

Industry

Automotive

Technologies

Overview

Toyota’s Enterprise AI team presented a comprehensive overview of their generative AI initiatives, focusing on how they have built and deployed multiple LLM-powered applications in production across manufacturing and knowledge management domains. The presentation featured multiple speakers from Toyota’s IT organization, covering their agentic RAG framework, specific production applications (Battery Brain, Gear Pal, and Project Cura), and the operational considerations that come with deploying LLMs at enterprise scale.

The team’s mission is to drive efficiency through AI experimentation and enablement, working across legal, government affairs, cybersecurity, HR, privacy, and various business units within Toyota. Their guiding principles emphasize exploration, experimentation, education, enablement, engagement, and ensuring quality—a particularly important point given their opening anecdote about a vendor chatbot that they “broke in one question” during evaluation.

Quality Assurance and Testing

The presentation began with a cautionary tale about generative AI quality. A business unit had been contacted by a vendor offering a chatbot solution. The business unit was enthusiastic and wanted to put it into production quickly. However, when Toyota’s Enterprise AI team evaluated the solution, they were able to break it with a single prompt. This led to the vendor requesting Toyota’s help to fix their own product—and upon code review, the team found the underlying implementation to be substandard.

This experience shaped their approach to quality assurance in LLM deployments. The team employs several evaluation strategies:

This evaluation-first mindset is particularly important given the stakes involved. As one speaker noted, if a customer-facing chatbot were hacked or manipulated to state incorrect pricing (like saying a Toyota Tundra costs $1), the company could face legal liability.

The Agentic Framework Architecture

A core contribution from Toyota’s team is their custom agentic framework that bridges LangChain and LlamaIndex. Rather than committing to a single framework, they recognized that each has distinct strengths:

Their framework creates a common abstraction layer that leverages the best of both worlds—using LlamaIndex as the document processor and LangChain for retrieval operations. Importantly, the framework is designed to be extensible: if any team develops an improved document ingestion pipeline or retrieval mechanism, that code can be integrated into the shared framework for use across all of Toyota’s generative AI applications.

The middle layer of the architecture (highlighted in red during their presentation) represents their research contributions, with all components described as being in production rather than experimental.

Data Ingestion Pipeline Challenges

A recurring theme throughout the presentation was that data ingestion is the fundamental challenge in RAG applications. Toyota’s manufacturing operations generate documents in virtually every format imaginable: PDFs, text documents, videos, Excel files, XDW files, images, and complex nested structures including “tables inside tables inside tables with images.”

The team emphasized that while “everyone is talking about RAG,” the data ingestion pipeline is the real bottleneck. Poor ingestion quality leads to poor retrieval results, regardless of how sophisticated the retrieval and generation components might be. Their approach involves:

Prompt Guardian: Security Layer

Security is a critical concern for enterprise LLM deployments, and Toyota addressed this through “Prompt Guardian”—a security layer built using smaller language models. The system is designed to intercept and evaluate incoming prompts before they reach the main application, protecting against prompt injection attacks and other adversarial inputs.

The team provided a concrete example: if Toyota deployed a customer-facing chatbot on toyota.com and an attacker successfully manipulated it to display incorrect pricing, the legal and financial consequences could be severe. Prompt Guardian serves as a defensive barrier against such attacks.

Production Application: Battery Brain

Battery Brain is a RAG application designed to support Toyota’s new battery manufacturing plant in North Carolina. The problem it addresses is well-documented in manufacturing: new production lines have high scrappage rates, and subject matter expertise is crucial for identifying issues early before they cascade into larger problems.

The challenge is that subject matter expertise—deep academic knowledge and practical experience in battery manufacturing—is scarce and typically concentrated in a few individuals. Battery Brain aims to democratize this knowledge by making it accessible to all team members, effectively turning everyone into “their own subject matter expert.”

Key technical aspects of Battery Brain include:

The OneNote challenge is worth highlighting as a real-world LLMOps issue: training sessions involve participants taking notes in personal styles, mixing images, handwritten notes, typed text, and blank pages. Developing a generalized ingestion strategy for such heterogeneous content required significant engineering effort.

Production Application: Gear Pal

Gear Pal addresses manufacturing line downtime, where every minute of stoppage costs “literally millions of dollars.” The application serves as a centralized, searchable repository of machine manuals and maintenance documentation for powertrain assets.

The problem is significant: some machines date back to the 1980s, maintenance manuals can be thousands of pages long, documentation may be in Japanese (requiring translation), and there are hundreds of machines with thousands of parts. When a machine goes down, it can take days to diagnose and repair.

Gear Pal allows maintenance personnel to quickly search for error codes and receive step-by-step remediation instructions. For example, when a six-axis CNC machine mill throws error 2086, instead of manually searching through documentation, users receive immediate guidance on what the error means and how to resolve it.

Technical highlights of Gear Pal include:

The team shared a compelling user story: a manufacturing line went down on a Sunday, and three people spent 90 minutes searching for the error code. On a whim, they tried Gear Pal and found the answer in 30 seconds. This kind of time-to-resolution improvement represents significant operational value.

The application has been in production for approximately one quarter, with rollout expanding from Kentucky to other manufacturing plants. Projected savings are in the seven figures per quarter per manufacturing line, though the team noted they’re still calibrating actual ROI. Development took approximately nine months total: six months of coding and three months of planning and organizational processes.

Production Application: Product API

The presentation also touched on a Product API application, which appears to be a customer-facing search system for vehicle features. The team encountered a specific problem with semantic search: when users asked about features of the 2025 Camry, distance-based similarity metrics would sometimes retrieve 2024 model information instead—a factually incorrect result with potential legal implications.

Their solution involved:

Project Cura: Knowledge Management System

Project Cura (Japanese for “storehouse”) represents Toyota’s approach to capturing, managing, and retrieving institutional knowledge—addressing the universal problem of information silos and repeated questions in large organizations.

The team experimented with hyperscaler solutions and various knowledge base products, finding that none adequately solved their problem. A key insight was that LLMs naturally structure information in question-answer formats, which informed their approach to knowledge capture.

The platform is divided into three components:

The live interview feature is particularly innovative: it automatically identifies speakers, converts utterances into question-answer pairs, and relates content to meeting topics—even inferring questions when they aren’t explicitly stated. A self-service version allows users to record or type answers to questions, with AI-powered rewriting for clarity and question suggestions based on uploaded documents.

The team is experimenting with knowledge graphs and developing a “rate of decay” scoring algorithm that measures knowledge relevancy over time based on interaction frequency and cross-reference patterns across team members. Currently, approximately 50 users are actively testing the system.

Common Framework Benefits

A key architectural principle emphasized throughout the presentation is the value of a common RAG framework across use cases. Battery Brain, Gear Pal, and other applications share infrastructure, allowing the team to:

The team’s approach to product development in generative AI is notably capability-first rather than use-case-first, acknowledging that the technology is too new to predict all applications. They focus on building extensible capabilities and letting business users discover applications—using the metaphor of a Swiss army knife that can serve many purposes depending on who wields it.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57