Toyota: Enterprise-Wide LLM Framework for Manufacturing and Knowledge Management

Overview

Toyota’s Enterprise AI team presented a comprehensive overview of their generative AI initiatives, focusing on how they have built and deployed multiple LLM-powered applications in production across manufacturing and knowledge management domains. The presentation featured multiple speakers from Toyota’s IT organization, covering their agentic RAG framework, specific production applications (Battery Brain, Gear Pal, and Project Cura), and the operational considerations that come with deploying LLMs at enterprise scale.

The team’s mission is to drive efficiency through AI experimentation and enablement, working across legal, government affairs, cybersecurity, HR, privacy, and various business units within Toyota. Their guiding principles emphasize exploration, experimentation, education, enablement, engagement, and ensuring quality—a particularly important point given their opening anecdote about a vendor chatbot that they “broke in one question” during evaluation.

Quality Assurance and Testing

The presentation began with a cautionary tale about generative AI quality. A business unit had been contacted by a vendor offering a chatbot solution. The business unit was enthusiastic and wanted to put it into production quickly. However, when Toyota’s Enterprise AI team evaluated the solution, they were able to break it with a single prompt. This led to the vendor requesting Toyota’s help to fix their own product—and upon code review, the team found the underlying implementation to be substandard.

This experience shaped their approach to quality assurance in LLM deployments. The team employs several evaluation strategies:

Manual prompt testing: Systematic testing of prompts to identify failure modes and edge cases
LLM-as-judge: Using language models to evaluate the quality of generated responses
Golden query sets: Maintaining a bank of expected question-answer pairs to compare against generated outputs, enabling consistent evaluation of retrieval and generation quality
Code review: Evaluating vendor implementations at the code level, not just the interface level

This evaluation-first mindset is particularly important given the stakes involved. As one speaker noted, if a customer-facing chatbot were hacked or manipulated to state incorrect pricing (like saying a Toyota Tundra costs $1), the company could face legal liability.

The Agentic Framework Architecture

A core contribution from Toyota’s team is their custom agentic framework that bridges LangChain and LlamaIndex. Rather than committing to a single framework, they recognized that each has distinct strengths:

LlamaIndex excels at document parsing and ingestion
LangChain provides robust retrieval capabilities through components like LCEL (LangChain Expression Language) and runnable pass-throughs

Their framework creates a common abstraction layer that leverages the best of both worlds—using LlamaIndex as the document processor and LangChain for retrieval operations. Importantly, the framework is designed to be extensible: if any team develops an improved document ingestion pipeline or retrieval mechanism, that code can be integrated into the shared framework for use across all of Toyota’s generative AI applications.

The middle layer of the architecture (highlighted in red during their presentation) represents their research contributions, with all components described as being in production rather than experimental.

Data Ingestion Pipeline Challenges

A recurring theme throughout the presentation was that data ingestion is the fundamental challenge in RAG applications. Toyota’s manufacturing operations generate documents in virtually every format imaginable: PDFs, text documents, videos, Excel files, XDW files, images, and complex nested structures including “tables inside tables inside tables with images.”

The team emphasized that while “everyone is talking about RAG,” the data ingestion pipeline is the real bottleneck. Poor ingestion quality leads to poor retrieval results, regardless of how sophisticated the retrieval and generation components might be. Their approach involves:

Creating a common framework for data ingestion that handles diverse document types
Developing specialized handling for complex structures like nested tables with embedded images
Implementing thoughtful chunking strategies (acknowledging there’s no one-size-fits-all approach for a company of Toyota’s scale)
Vectorizing content into vector databases with appropriate strategies for different use cases

Prompt Guardian: Security Layer

Security is a critical concern for enterprise LLM deployments, and Toyota addressed this through “Prompt Guardian”—a security layer built using smaller language models. The system is designed to intercept and evaluate incoming prompts before they reach the main application, protecting against prompt injection attacks and other adversarial inputs.

The team provided a concrete example: if Toyota deployed a customer-facing chatbot on toyota.com and an attacker successfully manipulated it to display incorrect pricing, the legal and financial consequences could be severe. Prompt Guardian serves as a defensive barrier against such attacks.

Production Application: Battery Brain

Battery Brain is a RAG application designed to support Toyota’s new battery manufacturing plant in North Carolina. The problem it addresses is well-documented in manufacturing: new production lines have high scrappage rates, and subject matter expertise is crucial for identifying issues early before they cascade into larger problems.

The challenge is that subject matter expertise—deep academic knowledge and practical experience in battery manufacturing—is scarce and typically concentrated in a few individuals. Battery Brain aims to democratize this knowledge by making it accessible to all team members, effectively turning everyone into “their own subject matter expert.”

Key technical aspects of Battery Brain include:

Hybrid search approach: Combining internal Toyota training documents with state-of-the-art external research, ensuring answers are grounded in both Toyota’s production standards and current academic understanding
Multilingual support: Training documents exist in both Japanese and English, requiring careful handling of cross-language search and retrieval
Data-agnostic ingestion: The system handles Excel documents, OneNote files, PDFs, images, and videos—with OneNote documents being particularly challenging due to inconsistent note-taking styles, handwritten content, and variable formatting
Rapid feedback integration: Users can easily report bugs or incorrect answers, enabling the team to quickly iterate on either the ingestion or LLM components

The OneNote challenge is worth highlighting as a real-world LLMOps issue: training sessions involve participants taking notes in personal styles, mixing images, handwritten notes, typed text, and blank pages. Developing a generalized ingestion strategy for such heterogeneous content required significant engineering effort.

Production Application: Gear Pal

Gear Pal addresses manufacturing line downtime, where every minute of stoppage costs “literally millions of dollars.” The application serves as a centralized, searchable repository of machine manuals and maintenance documentation for powertrain assets.

The problem is significant: some machines date back to the 1980s, maintenance manuals can be thousands of pages long, documentation may be in Japanese (requiring translation), and there are hundreds of machines with thousands of parts. When a machine goes down, it can take days to diagnose and repair.

Gear Pal allows maintenance personnel to quickly search for error codes and receive step-by-step remediation instructions. For example, when a six-axis CNC machine mill throws error 2086, instead of manually searching through documentation, users receive immediate guidance on what the error means and how to resolve it.

Technical highlights of Gear Pal include:

Multilingual document normalization: Machines are manufactured in Japan, Holland, and the US, resulting in manuals in different languages. Rather than searching across multiple languages (which created high latency), the team normalizes all documents to English at ingestion time, translating back to the user’s preferred language only at retrieval time
Golden query evaluation: A set of known good question-answer pairs allows the team to calibrate ingestion quality, retrieval performance, and cosine similarity thresholds
Machine-to-machine interface: Beyond human users, Gear Pal is designed to interface with robotic systems—if a robotic arm goes down and needs to look up a servo motor error, it can query Gear Pal directly for self-diagnosis

The team shared a compelling user story: a manufacturing line went down on a Sunday, and three people spent 90 minutes searching for the error code. On a whim, they tried Gear Pal and found the answer in 30 seconds. This kind of time-to-resolution improvement represents significant operational value.

The application has been in production for approximately one quarter, with rollout expanding from Kentucky to other manufacturing plants. Projected savings are in the seven figures per quarter per manufacturing line, though the team noted they’re still calibrating actual ROI. Development took approximately nine months total: six months of coding and three months of planning and organizational processes.

Production Application: Product API

The presentation also touched on a Product API application, which appears to be a customer-facing search system for vehicle features. The team encountered a specific problem with semantic search: when users asked about features of the 2025 Camry, distance-based similarity metrics would sometimes retrieve 2024 model information instead—a factually incorrect result with potential legal implications.

Their solution involved:

Extracting structured metadata (model name, model year) from queries
Using Prompt Guardian as a front-end filter
Implementing a tiered retrieval strategy with smaller vector stores for common questions (enabling faster response times) and larger stores for in-depth queries
Developing specialized chunking strategies for accurate retrieval

Project Cura: Knowledge Management System

Project Cura (Japanese for “storehouse”) represents Toyota’s approach to capturing, managing, and retrieving institutional knowledge—addressing the universal problem of information silos and repeated questions in large organizations.

The team experimented with hyperscaler solutions and various knowledge base products, finding that none adequately solved their problem. A key insight was that LLMs naturally structure information in question-answer formats, which informed their approach to knowledge capture.

The platform is divided into three components:

Knowledge Acquisition: Full capture, data ingestion, understanding, and translation—including live interview transcription that can join meetings and convert conversations into structured question-answer pairs
Knowledge Management: Integration with Microsoft Graph, SharePoint, OneDrive, and Confluence, with role-based access controls to prevent inappropriate information sharing
Knowledge Retrieval: Both native retrieval and extensible retrieval into other frameworks and models, designed to be agnostic to whatever systems Toyota may adopt in the future

The live interview feature is particularly innovative: it automatically identifies speakers, converts utterances into question-answer pairs, and relates content to meeting topics—even inferring questions when they aren’t explicitly stated. A self-service version allows users to record or type answers to questions, with AI-powered rewriting for clarity and question suggestions based on uploaded documents.

The team is experimenting with knowledge graphs and developing a “rate of decay” scoring algorithm that measures knowledge relevancy over time based on interaction frequency and cross-reference patterns across team members. Currently, approximately 50 users are actively testing the system.

Common Framework Benefits

A key architectural principle emphasized throughout the presentation is the value of a common RAG framework across use cases. Battery Brain, Gear Pal, and other applications share infrastructure, allowing the team to:

Accelerate development of new manufacturing generative AI applications
Apply lessons learned across domains (e.g., solving OneNote ingestion for Battery Brain benefits future applications)
Enable cross-domain retrieval (manufacturing teams can access knowledge worker content through the same interface)
Reduce maintenance burden through shared components

The team’s approach to product development in generative AI is notably capability-first rather than use-case-first, acknowledging that the technology is too new to predict all applications. They focus on building extensible capabilities and letting business users discover applications—using the metaphor of a Swiss army knife that can serve many purposes depending on who wields it.

Enterprise-Wide LLM Framework for Manufacturing and Knowledge Management

Industry

Technologies