42Q, a cloud-based Manufacturing Execution System (MES) provider, implemented an intelligent chatbot named Arthur to address the complexity of their system and improve user experience. The solution uses RAG and AWS Bedrock to combine documentation, training videos, and live production data, enabling users to query system functionality and real-time manufacturing data in natural language. The implementation showed significant improvements in user response times and system understanding, while maintaining data security within AWS infrastructure.
42Q is a cloud-based Manufacturing Execution System (MES) provider that developed an AI-powered expert chatbot called “Arthur” to help users navigate their complex software platform. The presentation, given by Claus Müller (Solutions Architect at 42Q) and Kristoff from AWS, details the phased approach to building and deploying this LLM-powered assistant, the technical architecture leveraging AWS services, and the ongoing considerations around expanding the chatbot’s capabilities.
The name “Arthur” is a reference to Douglas Adams’ “The Hitchhiker’s Guide to the Galaxy,” where 42 is famously the “answer to the ultimate question of life, the universe, and everything” - hence the company name 42Q. Arthur Dent, the main character who spends much of the book seeking answers, inspired the chatbot’s name.
MES systems like 42Q present several challenges for users:
These challenges created a clear use case for an intelligent assistant that could serve as an always-available expert on the 42Q system.
The first phase focused on creating a documentation-aware chatbot. The team took the following approach:
Data Ingestion: All existing documentation was loaded into the system. Critically, they also transcribed all training videos where experts explain how 42Q works, providing tips and guidance. This transcription proved particularly valuable as it captured not just the “how” but also the “why” behind various system configurations.
Integration: The chatbot was embedded directly into the 42Q portal, making it immediately available upon login without requiring separate authentication.
Multilingual Support: By design, the chatbot understands almost any language, leveraging the LLM’s inherent multilingual capabilities.
Context Retention: The system maintains conversation context, allowing users to ask follow-up questions without repeating earlier context.
Source Attribution: Every answer includes references to the original sources (documentation or training videos), building trust and enabling deeper exploration.
The second phase extended Arthur’s capabilities to understand not just the software but also the live production data. This was achieved by:
Database Connectivity: Arthur was connected to the 42Q MES database, enabling live queries against the instance the user is connected to.
API Integration: The chatbot uses APIs to retrieve current data including shop orders, part numbers, routes, defects, and other production information.
Dynamic Presentation: Users can request data in various formats - tables for easy copy-paste, flowcharts, summaries, or aggregations - all handled by the LLM’s output formatting capabilities.
A particularly interesting discovery emerged during Phase 2 development. When the team first connected Arthur to the database and received JSON responses, the chatbot immediately began explaining the data, interpreting status codes, suggesting missing information, and providing context. This was not explicitly programmed but emerged from the combination of the LLM’s reasoning capabilities and the training video content, which contained examples and guidance that the model could apply to interpret real data. For instance, when asked about defective components and whether they had been replaced, Arthur correctly interpreted that a “removed zero” indicator meant the component was repaired in place rather than replaced - knowledge it derived from training video examples.
The solution is built entirely on AWS services, which was a deliberate choice for data security and operational simplicity:
Amazon Bedrock: The core service for building the AI solution, providing access to foundation models, agents, and guardrails. The team specifically highlighted Bedrock’s model flexibility - they experimented with different models and found Anthropic’s Claude to provide the best results for their use case.
RAG (Retrieval Augmented Generation): All documentation and transcribed content is stored in S3 buckets and used to augment the LLM’s responses with domain-specific knowledge.
AWS Transcription Service: Used to convert training videos into text for ingestion into the RAG system.
Bedrock Agents: Enable the chatbot to call Lambda functions that interact with 42Q APIs, allowing live data queries.
API Gateway: Controls access to APIs and manages the integration between the chatbot and backend systems.
Lambda Functions: Serve as the bridge between Bedrock Agents and 42Q’s APIs, executing queries and returning data to the chatbot.
Data Security: A key architectural decision was ensuring all data stays within the customer’s AWS account. This addresses growing concerns about data privacy and prevents training data from being used to improve third-party models. The team emphasized this is “as secure as the data itself” since the chatbot infrastructure is co-located with the production data.
The team highlighted Bedrock’s model marketplace as a significant advantage. They can easily switch between models to optimize for:
Currently, Anthropic’s Claude provides the best results for their use case. The ability to run newer models like DeepSeek within their own AWS account, with guarantees that data won’t leave the account or be used for retraining, was noted as particularly valuable for manufacturing customers with strict data governance requirements.
The presentation touched on AWS Bedrock’s guardrails capabilities, which allow implementing responsible AI checks. The team can:
The deployment has generated positive feedback:
The presentation candidly discussed the next logical step: allowing Arthur to not just query and explain, but to take actions within the MES. Examples mentioned include releasing shop orders, manipulating production data, or stopping production lines.
However, the team is deliberately pausing before implementing this capability, raising important questions:
The speakers noted that while the technical capability exists, the organizational and safety considerations require careful deliberation. This represents a mature approach to LLMOps - recognizing that what is technically possible isn’t always what should be deployed immediately.
Several LLMOps best practices emerge from this case study:
Phased Deployment: Rolling out capabilities incrementally (documentation first, then data access, then potentially actions) allows for learning and adjustment at each stage.
Authentication Integration: Leveraging existing portal authentication rather than requiring separate chatbot credentials reduces friction and maintains security boundaries.
Single Unified Interface: Despite multiple data sources (documentation, videos, live data), maintaining a single chatbot interface simplifies the user experience.
Source Attribution: Always linking answers to original sources builds trust and enables verification.
Model Experimentation: The architecture allows easy model switching to optimize performance as new models become available.
Data Locality: Keeping all data within customer AWS accounts addresses enterprise security requirements.
Deliberate Capability Expansion: The team’s hesitation around Phase 3 demonstrates thoughtful consideration of AI’s appropriate role in critical production systems.
The case study demonstrates a pragmatic approach to deploying LLMs in an industrial context, balancing innovation with the reliability requirements of manufacturing environments.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.