ZenML

Building a Secure Enterprise AI Assistant with RAG and Custom Infrastructure

Hexagon 2025
View original source

Hexagon's Asset Lifecycle Intelligence division developed HxGN Alix, an AI-powered digital worker to enhance user interaction with their Enterprise Asset Management products. They implemented a secure solution using AWS services, custom infrastructure, and RAG techniques. The solution successfully balanced security requirements with AI capabilities, deploying models on Amazon EKS with private subnets, implementing robust guardrails, and solving various RAG-related challenges to provide accurate, context-aware responses while maintaining strict data privacy standards.

Industry

Tech

Technologies

Summary

Hexagon’s Asset Lifecycle Intelligence division embarked on a journey to develop HxGN Alix, an AI-powered digital worker designed to revolutionize how users interact with their Enterprise Asset Management (EAM) products. This case study represents a comprehensive example of enterprise LLMOps, covering strategy formulation, technology selection, implementation challenges, and operational considerations for deploying LLMs in production environments with stringent security requirements.

The primary motivation was to address the difficulty users faced when navigating extensive PDF manuals to find information about EAM products. The solution needed to operate within high-security environments, maintain data privacy, support multiple languages, and provide accurate, grounded responses to user queries.

Strategic Approach: Crawl, Walk, Run

Hexagon adopted a phased approach to their generative AI implementation, which is a sensible strategy for organizations new to production LLM deployments. The three phases were structured as follows:

The Crawl phase focused on establishing foundational infrastructure with emphasis on data privacy and security. This included implementing guardrails around security, compliance, and data privacy, setting up capacity management and cost governance, and creating the necessary policies, monitoring mechanisms, and architectural patterns for long-term scalability. This foundation-first approach is critical for enterprise LLMOps, as retrofitting security and compliance controls after deployment is significantly more difficult.

The Walk phase transitioned from proof of concept to production-grade implementations. The team deepened their technical expertise, refined operational processes, and gained real-world experience with generative AI models. They integrated domain-specific data to improve relevance while reinforcing tenant-level security for proper data segregation. This phase validated AI-driven solutions in real-world scenarios through iterative improvements.

The Run phase focused on scaling development across multiple teams in a structured and repeatable manner. By standardizing best practices and development frameworks, they enabled different products to adopt AI capabilities efficiently while focusing on high-value use cases.

Technology Stack Selection

Hexagon’s technology selection criteria reflected the balance between control, customization, cost, and compliance that enterprise LLMOps demands.

LLM Selection: Open Source vs. Commercial

The team evaluated multiple criteria for choosing between commercial and open source LLMs. They ultimately selected Mistral NeMo, a 12-billion parameter open source LLM built in collaboration with NVIDIA and released under the Apache 2.0 license. Key factors in this decision included:

Mistral NeMo offered a large context window of up to 128,000 tokens, optimization for function calling, and strong multilingual capabilities across English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

Infrastructure Choices

For compute and deployment, the team leveraged Amazon EKS (Elastic Kubernetes Service), utilizing their existing production cluster which already had the required safety, manageability, and DevOps integration. This approach allowed them to use existing investments in infrastructure and tooling while maintaining high availability and scalability.

They selected Amazon EC2 G6e.48xlarge instances powered by NVIDIA L40S GPUs, described as the most cost-efficient GPU instances for deploying generative AI models under 12 billion parameters.

Amazon S3 provided secure storage for product documentation and user data, while Amazon Bedrock served as a fallback solution using the Mistral 7B model with multi-Region endpoints to handle Regional failures and maintain service availability.

RAG Implementation and Challenges

A significant portion of the case study details the implementation of Retrieval Augmented Generation (RAG), which was essential for grounding the model’s responses in accurate documentation and reducing hallucinations.

Chunking Challenges

The team encountered the common problem of context destruction when chunking documents. Applying standard chunking methods to tables or complex structures risks losing relational data, which can result in critical information not being retrieved. To address this, they used the hierarchical chunking capability of Amazon Bedrock Knowledge Bases, which helped preserve context in the final chunks.

Document Format Handling

Hexagon’s product documentation, accumulated over decades, varied greatly in format with many non-textual elements such as tables. Tables are particularly difficult to interpret when directly queried from PDFs or Word documents. The team used the FM parsing capability of Amazon Bedrock Knowledge Bases, which processed raw documents with an LLM before creating final chunks, ensuring data from non-textual elements was correctly interpreted.

Handling LLM Boundaries

User queries sometimes exceeded system capabilities, such as requests for comprehensive lists of product features. Because documentation is split into multiple chunks, the retrieval system might not return all necessary documents. The team created custom documents containing FAQs and special instructions for these edge cases, adding them to the knowledge base as few-shot examples to help the model produce more accurate and complete responses.

Grounding and Hallucination Mitigation

To address the inherent tendency of LLMs to produce potentially inaccurate outputs, the team used a combination of specialized prompts along with contextual grounding checks from Amazon Bedrock Guardrails. This dual approach helps ensure responses are factually grounded in the retrieved documentation.

Conversational Context Management

Users often engage in brief follow-up questions like “Can you elaborate?” or “Tell me more.” When processed in isolation by the RAG system, these queries yield no results. The team tested two approaches:

Prompt-based search reformulation has the LLM first identify user intent and generate a more complete query for the knowledge base. While this requires an additional LLM call, it yields highly relevant results and keeps the final prompt concise.

Context-based retrieval with chat history sends the last five messages from chat history to the knowledge base, allowing broader results with faster response times due to only one LLM round trip.

The first method worked better with large document sets by focusing on highly relevant results, while the second approach was more effective with smaller, focused document sets.

Security and Compliance

Security was paramount throughout the implementation. The team used isolated private subnets to ensure code interacting with models wasn’t connected to the internet, enhancing information protection for users.

Critically, because user interactions are in free-text format and might include personally identifiable information (PII), the team designed the system to not store any user interactions in any format. This approach provides complete confidentiality of AI use, adhering to strict data privacy standards.

Amazon Bedrock Guardrails provided the framework for enforcing safety and compliance, enabling customization of filtering policies to ensure AI-generated responses align with organizational standards and regulatory requirements. The guardrails include capabilities to detect and mitigate harmful content, define content moderation rules, restrict sensitive topics, and establish enterprise-level security for generative AI interactions.

Development Lifecycle Adjustments

The case study highlights important considerations for adapting traditional software development practices to generative AI systems:

Testing challenges are significant because generative AI systems cannot rely solely on unit tests. Prompts can return different results each time, making verification more complex. The team had to develop new testing and QA methodologies to ensure consistent and reliable responses.

Performance variability is another concern, with LLM responses varying significantly in latency from 1-60 seconds depending on the user’s query, unlike traditional APIs with predictable response times.

Continuous monitoring was implemented to track performance metrics and user interactions, allowing for ongoing optimization of the AI system.

Amazon Bedrock Prompt Management simplified the creation, evaluation, versioning, and sharing of prompts within the engineering team to optimize responses from foundation models.

Critical Assessment

While the case study presents a comprehensive approach to enterprise LLMOps, it’s worth noting some considerations:

The content is published on AWS’s blog and co-authored by AWS solutions architects, so naturally emphasizes AWS services. Organizations should evaluate whether equivalent capabilities exist in other cloud providers or open-source alternatives.

The quantitative results are somewhat limited—the case study describes qualitative improvements in user experience and workflow efficiency but doesn’t provide specific metrics on accuracy improvements, cost savings, or productivity gains. This makes it difficult to objectively assess the ROI of the implementation.

The selection of Mistral NeMo as a 12B parameter model is interesting, as it sits in a middle ground between larger commercial models and smaller, more deployable open-source options. The trade-offs between model size, cost, and capability are important considerations that could benefit from more detailed analysis.

Overall, this case study provides valuable insights into the practical challenges of deploying LLMs in enterprise environments, particularly around RAG implementation, security considerations, and the need to adapt development practices for AI systems. The phased approach and emphasis on foundational security before scaling represent sound practices for organizations embarking on similar initiatives.

More Like This

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47