ZenML

Optimizing Production LLM Chatbot Performance Through Multi-Model Classification

IDIADA 2025
View original source

IDIADA developed AIDA, an intelligent chatbot powered by Amazon Bedrock, to assist their workforce with various tasks. To optimize performance, they implemented specialized classification pipelines using different approaches including LLMs, k-NN, SVM, and ANN with embeddings from Amazon Titan and Cohere models. The optimized system achieved 95% accuracy in request routing and drove a 20% increase in team productivity, handling over 1,000 interactions daily.

Industry

Automotive

Technologies

Overview

IDIADA (Applus+ IDIADA) is a global partner to the automotive industry with over 30 years of experience in product development, design, engineering, testing, and homologation services. In 2021, they established a Digital Solutions department to drive innovation through digital tools. Within this context, they developed AIDA (Applus Idiada Digital Assistant), an intelligent chatbot powered by Amazon Bedrock that serves as a virtual assistant for IDIADA’s workforce.

This case study focuses on a specific LLMOps challenge: how to efficiently classify and route incoming user requests to appropriate processing pipelines. As AIDA’s usage grew, the diversity of requests expanded from simple queries to complex tasks including document translations, internal service inquiries, and file uploads. The team needed to categorize these interactions to route them to specialized pipelines for more efficient handling.

The Classification Challenge

The core production challenge was developing a robust classifier that could categorize user interactions into three main groups: Conversation (general inquiries), Services (specific functionality requests), and Document_Translation (text translation needs). The goal was to improve response efficiency, accuracy, and resource allocation by having dedicated processing pipelines for each category.

The team worked with a dataset of 1,668 pre-classified human interactions, split 40/60 between training (666 examples) and testing (1,002 examples). This deliberate emphasis on the test set indicates a focus on generalization performance, which is crucial for production deployment.

Technical Approaches Evaluated

LLM-Based Simple Prompt Classifier

The first approach used Anthropic’s Claude 3 Sonnet model through Amazon Bedrock with a carefully crafted prompt. The implementation used LangChain for integration, Boto3 for AWS connectivity, and structured output parsing to extract classifications in JSON format.

Key implementation details include setting max_tokens to 50 and temperature to 0.0 for deterministic output. The prompt included class definitions and a hierarchical classification approach, prioritizing the Conversation class first (since 99% of interactions are simple questions) before considering other classes. The code included error handling that defaulted to ‘Conversation’ if parsing failed, which is a pragmatic production consideration.

LLM-Based Example Augmented Inference (RAG)

The second approach enhanced the simple prompt with examples for each class, following RAG principles. Instead of relying solely on compressed definitions, the model received quasi-definitions by extension through diverse examples.

The implementation faced significant scalability challenges. The Conversation class had 500 examples (limited due to infrastructure constraints), Services had 26 examples, and Document_Translation had 140 examples. The team noted that while more examples could improve performance, they hit bottlenecks including quota issues with Amazon Bedrock, high delays, throttling, and connection shutouts. This resulted in unacceptable response times of approximately 18 seconds per classification, compared to 1.2 seconds for the simple prompt approach.

K-NN-Based Classifier with Embeddings

Recognizing that many interactions share similar patterns, the team shifted to distance-based classification using embeddings. They tested two embedding models through Amazon Bedrock:

Amazon Titan Text Embeddings G1: Generates 1,536-dimensional vectors with multilingual support. The k-NN classifier with k=3 performed optimally based on F1 score analysis across different k values.

Cohere Multilingual Embeddings Model: Generates 1,024-dimensional vectors with strong multilingual capabilities. The optimal k value was 11. Notably, Cohere’s model has a 1,500 character limit for input text, which required truncating longer interactions.

The team performed dimensionality reduction to visualize the embedding spaces in 2D, finding that both models produced similar but rotated vector spaces with distinct class clusters, making them well-suited for distance-based classification.

SVM-Based Classifier

Building on the geometric structure revealed by the embeddings, the team implemented SVM classifiers using scikit-learn. They performed grid search with 10-fold cross-validation using weighted F1 score as the evaluation metric.

The optimal parameters for both embedding models were: C=1 (indicating balanced model complexity), class_weight=None (suggesting minor classes don’t significantly impact the decision boundary), and kernel=‘linear’ (indicating categories are linearly separable in the high-dimensional space).

ANN-Based Classifier

The team also evaluated a simple neural network approach with a three-layer architecture (16 neurons → 8 neurons → 3 neurons), using ReLU activations and softmax output. Min/max normalization was applied to input vectors. Training included early stopping based on categorical accuracy (patience=25), and one-hot encoding was used for outputs to avoid ordinal assumptions.

The training process used a multi-threaded approach with 20 repetitions to find optimal weights, running for up to 500 epochs with batch size 64 and learning rate 0.01.

Results and Production Implications

The comprehensive evaluation revealed striking differences in both performance and efficiency:

ApproachConversation F1Services F1Doc Translation F1Runtime
Simple LLM0.810.220.461.2s
LLM with examples0.860.130.6818s
k-NN + Titan0.980.570.880.35s
k-NN + Cohere0.960.720.720.35s
SVM + Titan0.980.690.820.3s
SVM + Cohere0.990.800.930.3s
ANN + Titan0.980.600.870.15s
ANN + Cohere0.990.770.960.15s

The classical ML approaches with embeddings dramatically outperformed the pure LLM-based classifiers across all metrics. The SVM and ANN approaches with Cohere embeddings achieved the best balance of performance and efficiency, with the ANN being the fastest at 0.15 seconds per inference.

Interestingly, adding examples to the LLM actually hurt performance on the Services class (dropping from 0.22 to 0.13) while significantly increasing latency from 1.2s to 18s. This demonstrates that naive RAG approaches don’t always improve outcomes and can introduce production-breaking latency issues.

Production Deployment Outcomes

The optimized AIDA system now handles over 1,000 interactions per day with a claimed 95% accuracy rate in routing requests to appropriate pipelines. IDIADA reports a 20% increase in team productivity, though it’s worth noting these figures come from the company itself and should be interpreted with appropriate context.

Key LLMOps Insights

This case study provides several valuable insights for LLMOps practitioners:

Embedding + Classical ML can outperform LLM inference for classification: The combination of pre-trained embeddings with simple classifiers achieved better accuracy and 4-60x faster inference than direct LLM classification. This pattern is worth considering when the task is well-defined classification rather than open-ended generation.

RAG approaches have scalability limits: The example-augmented approach hit infrastructure constraints including API quotas, throttling, and latency issues that made it impractical for production use, despite potential accuracy improvements with more examples.

Multilingual considerations matter: Cohere’s multilingual model outperformed Amazon Titan for this multilingual use case, though it came with input length limitations that required handling.

Model evaluation methodology: The team’s use of 40/60 train/test splits, 10-fold cross-validation for hyperparameter tuning, and multi-threaded training repetitions demonstrates production-grade evaluation practices.

Pipeline architecture: The classification layer enables routing to specialized downstream pipelines (conversation handling, service orchestration, translation), showing how LLMOps often involves orchestrating multiple components rather than single model deployment.

Looking forward, IDIADA plans to offer AIDA as an integrated product for customer environments, which would require additional considerations around multi-tenancy, customization, and deployment flexibility that are common in productized AI solutions.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Migration of Credit AI RAG Application from Multi-Cloud to AWS Bedrock

Octus 2025

Octus, a leading provider of credit market data and analytics, migrated their flagship generative AI product Credit AI from a multi-cloud architecture (OpenAI on Azure and other services on AWS) to a unified AWS architecture using Amazon Bedrock. The migration addressed challenges in scalability, cost, latency, and operational complexity associated with running a production RAG application across multiple clouds. By leveraging Amazon Bedrock's managed services for embeddings, knowledge bases, and LLM inference, along with supporting AWS services like Lambda, S3, OpenSearch, and Textract, Octus achieved a 78% reduction in infrastructure costs, 87% decrease in cost per question, improved document sync times from hours to minutes, and better development velocity while maintaining SOC2 compliance and serving thousands of concurrent users across financial services clients.

document_processing question_answering summarization +45

Scaling AI-Assisted Developer Tools and Agentic Workflows at Scale

Slack 2025

Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.

code_generation question_answering summarization +46