A large energy supplier faced challenges with technical help desk operations supporting 5,000 weekly calls from meter technicians in the field, with average handling times exceeding 5 minutes for the top 10 issue categories representing 60% of calls. Infosys Topaz partnered with AWS to build a generative AI solution using Amazon Bedrock's Claude Sonnet model to create a knowledge base from call transcripts, implement retrieval-augmented generation (RAG), and deploy an AI assistant with role-based access control. The solution reduced average handling time by 60% (from over 5 minutes to under 2 minutes), enabled the AI assistant to handle 70% of previously human-managed calls, and increased customer satisfaction scores by 30%.
This case study documents how Infosys Topaz leveraged Amazon Bedrock and AWS services to transform technical help desk operations for a large energy supplier. The energy company operates meter installation, exchange, service, and repair operations where field technicians frequently call technical support agents for guidance on issues they cannot resolve independently. The volume of approximately 5,000 calls per week (20,000 monthly) presented significant operational challenges and costs.
The implementation represents a comprehensive LLMOps solution that addresses the full machine learning operations lifecycle, from data ingestion and preprocessing to model deployment, monitoring, and continuous improvement. The solution demonstrates enterprise-scale deployment of generative AI with proper security controls, access management, and performance optimization.
The solution architecture implements a sophisticated LLMOps pipeline centered around Amazon Bedrock’s Claude Sonnet model. The system processes call transcripts stored in JSON format in Amazon S3, utilizing an event-driven architecture where AWS Lambda functions are triggered when new transcripts are uploaded. This automated ingestion pipeline represents a critical component of the MLOps workflow, ensuring continuous model improvement through fresh training data.
The preprocessing pipeline employs AWS Step Functions to orchestrate a multi-stage workflow. Raw CSV files containing contact IDs, participant roles (agent or customer), and conversation content are processed through several Lambda functions. A key innovation in the LLMOps approach is the automated conversation classification system, where Claude Sonnet performs zero-shot chain-of-thought prompting to determine conversation relevance. This classification step filters out disconnected or meaningless conversations, ensuring only high-quality training data enters the knowledge base.
The knowledge base construction process demonstrates sophisticated prompt engineering techniques. Rather than simple summarization, the system generates structured outputs including conversation summaries, problem identification, and resolution steps. This structured approach to knowledge extraction enables more effective retrieval and response generation in production scenarios.
The solution implements a production-ready retrieval-augmented generation (RAG) system using Amazon OpenSearch Serverless as the vector database. The choice of OpenSearch Serverless provides scalable, high-performing vector storage with real-time capabilities for adding, updating, and deleting embeddings without impacting query performance. This represents a crucial LLMOps consideration for maintaining system availability during model updates and knowledge base expansion.
Embedding generation utilizes Amazon Titan Text Embeddings model, optimized specifically for text retrieval tasks. The chunking strategy employs a chunk size of 1,000 tokens with overlapping windows of 150-200 tokens, representing a carefully tuned approach to balance context preservation with retrieval accuracy. The implementation includes sentence window retrieval techniques to improve result precision.
The knowledge base schema demonstrates thoughtful data modeling for production LLM applications. Each record contains conversation history, summaries, problem descriptions, resolution steps, and vector embeddings, enabling both semantic search and structured query capabilities. This dual-mode access pattern supports different user interaction patterns and improves overall system flexibility.
The LLMOps implementation includes comprehensive production deployment considerations. The user interface, built with Streamlit, incorporates role-based access control through integration with DynamoDB for user management. Three distinct personas (administrator, technical desk analyst, technical agent) have different access levels to conversation transcripts, implemented through separate OpenSearch Serverless collections.
Performance optimization includes sophisticated caching mechanisms using Streamlit’s st.cache_data() function for storing valid results across user sessions. The FAQ system demonstrates intelligent query frequency tracking, where user queries are stored in DynamoDB with counter columns to identify the most common questions. This data-driven approach to user experience optimization represents a key aspect of production LLMOps.
User feedback collection is systematically implemented through like/dislike buttons for each response, with feedback data persisted in DynamoDB. This continuous feedback loop enables model performance monitoring and provides data for future model fine-tuning efforts. The system tracks multiple metrics including query volume, transcript processing rates, helpful responses, user satisfaction, and miss rates.
The LLMOps implementation addresses enterprise security requirements through multiple layers of protection. AWS Secrets Manager securely stores API keys and database credentials with automatic rotation policies. S3 buckets utilize AWS KMS encryption with AES-256, and versioning is enabled for audit trails. Personally identifiable information (PII) handling includes encryption and strict access controls through IAM policies and AWS KMS.
OpenSearch Serverless implementation ensures data encryption both at rest using AWS KMS and in transit using TLS 1.2. Session management includes timeout controls for inactive sessions, requiring re-authentication for continued access. The system maintains end-to-end encryption across the entire infrastructure with regular auditing through AWS CloudTrail.
The production deployment demonstrates significant operational improvements that validate the LLMOps approach. The AI assistant now handles 70% of previously human-managed calls, representing a substantial automation achievement. Average handling time for the top 10 issue categories decreased from over 5 minutes to under 2 minutes, representing a 60% improvement in operational efficiency.
The continuous learning aspect of the LLMOps implementation shows measurable improvement over time. Within the first 6 months after deployment, the percentage of issues requiring human intervention decreased from 30-40% to 20%, demonstrating the effectiveness of the knowledge base expansion and model improvement processes. Customer satisfaction scores increased by 30%, indicating improved service quality alongside operational efficiency gains.
The solution demonstrates sophisticated prompt engineering techniques throughout the pipeline. Classification prompts for conversation relevance use expanded, descriptive language rather than simple keywords. For example, instead of “guidelines for smart meter installation,” the system uses “instructions, procedures, regulations, and best practices along with agent experiences for installation of a smart meter.” This approach to prompt optimization represents a key LLMOps practice for improving model performance in production.
The zero-shot chain-of-thought prompting approach for conversation classification enables the model to first summarize conversations before determining relevance, improving classification accuracy. The structured output generation for problem identification and resolution steps demonstrates how prompt engineering can enforce consistent, useful formats for downstream applications.
The serverless architecture using Lambda functions and Step Functions provides automatic scaling capabilities to handle varying transcript volumes. The event-driven processing model ensures efficient resource utilization while maintaining responsive processing of new call transcripts. OpenSearch Serverless collections enable horizontal scaling of vector search capabilities as the knowledge base grows.
The role-based access control implementation using multiple OpenSearch collections provides both security isolation and performance optimization. Different user roles access different subsets of the knowledge base, reducing query complexity and improving response times. This approach demonstrates how LLMOps considerations must balance functionality, security, and performance in production deployments.
The caching strategy includes both in-memory and persistent storage options with configurable data persistence duration and invalidation policies. Cache updates can occur based on data changes or time intervals, providing flexibility to balance data freshness with performance requirements. This represents a mature approach to production LLM system optimization.
Overall, this case study demonstrates a comprehensive LLMOps implementation that addresses the full lifecycle of generative AI deployment in enterprise environments, from data pipeline automation to production monitoring and continuous improvement. The measurable business impact validates the technical approach and provides a model for similar enterprise AI implementations.
Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Rocket Companies, a Detroit-based FinTech company, developed Rocket AI Agent to address the overwhelming complexity of the home buying process by providing 24/7 personalized guidance and support. Built on Amazon Bedrock Agents, the AI assistant combines domain knowledge, personalized guidance, and actionable capabilities to transform client engagement across Rocket's digital properties. The implementation resulted in a threefold increase in conversion rates from web traffic to closed loans, 85% reduction in transfers to customer care, and 68% customer satisfaction scores, while enabling seamless transitions between AI assistance and human support when needed.