Convirza transformed their call center analytics platform from using traditional large language models to implementing small language models (specifically Llama 3B) with adapter-based fine-tuning. By partnering with Predibase, they achieved a 10x cost reduction compared to OpenAI while improving accuracy by 8% and throughput by 80%. The system analyzes millions of calls monthly, extracting hundreds of custom indicators for agent performance and caller behavior, with sub-0.1 second inference times using efficient multi-adapter serving on single GPUs.
Convirza is a company with over two decades of experience in phone conversation analytics, having analyzed over a billion calls since its founding in 2001. Their core business involves recording and analyzing phone conversations for their clients to extract actionable insights that drive revenue and coaching opportunities. The company has evolved significantly from their origins using analog recording devices and manual human review to becoming an AI-driven digital company starting in 2014. This case study, presented by Convirza’s CTO Moadi and VP of AI Jeppi at what appears to be a Predibase event, details their transition to small language models (SLMs) and their partnership with Predibase to solve significant LLMOps challenges.
The company serves major brands and analyzes millions of calls monthly, measuring hundreds of different data points they call “indicators.” These indicators cover agent performance metrics (proper greetings, asking for the business, scheduling appointments) and caller/customer signals (buying signals, lead quality). Each indicator essentially answers a specific question about what happened on a call and is measured numerically to drive business outcomes like conversion rates.
Convirza’s customers demanded increasingly sophisticated insights from their phone calls. The arrival of ChatGPT and large language models raised expectations both from external clients and internal stakeholders. New capabilities like explaining why certain scores were given, extracting relevant passages, and providing detailed feedback on agent performance became expected. However, with millions of phone calls processed monthly and a continuously growing number of custom indicators needed for different industries and clients, using large commercial LLMs like OpenAI’s GPT models proved unsustainable from both a cost and accuracy perspective.
The company also needed to handle unpredictable traffic patterns with seasonal variations and sudden traffic bursts, requiring infrastructure that could scale up and down rapidly. Phone calls themselves vary dramatically in length—from a couple of minutes to an hour—requiring the system to handle variable-length text inputs efficiently.
Convirza’s AI evolution is instructive for understanding their current approach. They were traditionally an AWS shop using SageMaker to power over 60 different models for near real-time data extraction and classification. Their first language model was BERT in 2019, which was state-of-the-art at the time. In 2021, they transitioned to Longformer to handle extended context lengths needed for longer phone call transcripts.
However, this architecture had significant limitations. Each model was deployed on its own auto-scaling infrastructure, which meant that as they scaled to more models (indicators), costs increased significantly and infrastructure management became increasingly complex. Training Longformer models was also extremely time-consuming, taking hours or even days to achieve reasonable accuracy.
About seven months before this presentation (placing it in late 2023 or early 2024), Convirza began researching whether small language models could outperform larger commercial offerings when fine-tuned for their specific use cases. They fine-tuned several SLMs using LoRA (Low-Rank Adaptation) and compared results against OpenAI’s models.
Their findings were significant: fine-tuned small language models were considerably more accurate than OpenAI when trained with high-quality, curated data. Specifically, Llama 3.1 8B stood out for its exceptional ability to follow instructions out of the box without fine-tuning and its large context window compared to Longformer. The key insight was that SLMs have substantial world knowledge already “baked in,” meaning they could be trained with fewer epochs on smaller but highly curated datasets to achieve better performance in much shorter time frames.
Convirza initially experimented with the open-source LoRaX project for serving multiple LoRA adapters efficiently. After running a proof-of-concept (POC) with Predibase for about a month and a half, they determined that a commercial partnership made more commercial sense, allowing them to focus on delivering actionable insights rather than managing complex multi-cloud scalable infrastructure.
The POC had ambitious goals:
Moving to Predibase’s LoRaX-based adapter serving significantly simplified Convirza’s training pipeline. The new workflow runs on commodity hardware since the heavy lifting happens on Predibase’s infrastructure:
This simplified deployment model has subtle but important consequences for their operations. AB testing and canary releases become trivially easy because newly trained adapters and existing adapters can run simultaneously without incurring additional infrastructure costs. The previous architecture forced difficult decisions about when to decommission old infrastructure when running parallel models.
The production results exceeded their POC expectations significantly:
Perhaps most importantly, the cost scaling characteristics are dramatically better. When comparing cost growth as the number of indicators increases, both OpenAI and Longformer costs increase rapidly. Predibase costs grow much more gently—the marginal cost of a single additional adapter is practically zero. Cost increases are primarily related to maintaining the required throughput and latency rather than the number of adapters.
Convirza enhanced their monitoring capabilities as part of this transition. They monitor throughput, latency, and signals of data drift using a combination of Predibase’s dashboard and AWS tools. The ability to easily run AB tests and canary releases has improved their operational confidence in deploying new or updated adapters.
The hybrid infrastructure setup allows Convirza to maintain some GPU instances in their own VPC while leveraging Predibase for additional scale during traffic bursts. The right side of their presentation showed call volume data with seasonal patterns and traffic peaks, demonstrating the need for infrastructure that can scale rapidly to maintain near real-time actionable insights.
While the results presented are impressive, several aspects warrant consideration:
Nevertheless, the core technical approach—using LoRA adapters with small language models to serve many custom classification/extraction tasks on shared infrastructure—is a well-established pattern that genuinely offers cost and flexibility advantages over deploying separate models or using large commercial APIs for each task.
This case study illustrates several important LLMOps patterns:
Convirza’s journey from 60+ separately deployed models to 60+ adapters on minimal GPU infrastructure represents a significant operational simplification that is becoming increasingly common as organizations mature their LLM deployment strategies.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.