Panel discussion with experts from various companies exploring the challenges and solutions in deploying voice AI agents in production. The discussion covers key aspects of voice AI development including real-time response handling, emotional intelligence, cultural adaptation, and user retention. Experts shared experiences from e-commerce, healthcare, and tech sectors, highlighting the importance of proper testing, prompt engineering, and understanding user interaction patterns for successful voice AI deployments.
This case study is derived from a panel discussion on voice AI agents featuring practitioners from multiple organizations: Kiara from Prosus (a global technology company with approximately one billion customers across e-commerce, food delivery, fintech, and classifieds), Monica from Google’s speech synthesis group (with 14+ years experience), Rex (a solopreneur with background at Amazon Alexa), and Tom from Canonical (which builds observability tools for voice AI agents). The discussion provides valuable insights into the operational challenges of deploying voice AI agents in production environments, drawing from real-world experiences across various industries including e-commerce, edtech, and healthcare.
A central theme throughout the discussion is that voice AI agents present fundamentally different challenges compared to text-based LLM applications. As Rex noted, LLMs are already challenging due to their non-deterministic nature, making testing and enforcing predictability difficult. Voice adds multiple additional layers of complexity on top of this foundation.
Kiara shared a concrete example from Prosus’s work building an e-commerce agent that could browse the web and perform tasks for users. When they attempted to use the same text prompts that worked for their text-based agent with OpenAI’s real-time API for voice, it “failed miserably.” The key insight is that with text, everything is nicely structured and visible, allowing for significant control. Voice interactions lack this structure and introduce entirely new dimensions of user experience that must be carefully managed.
One of the most significant production challenges discussed was latency management. Users become frustrated if the agent doesn’t respond immediately, which has major implications for tool calling in LLM-powered agents. When an agent needs to perform an action or retrieve information, the traditional approach of waiting for the tool to complete before responding simply doesn’t work in voice contexts.
The Prosus team developed several strategies to address this:
This represents a significant departure from typical LLM application architectures where tool calls can happen synchronously without immediate user feedback requirements.
The panelists identified user input handling as a critical operational concern. Voice input can be vague and difficult to interpret, especially when dealing with postal addresses, names, and spelling. A key observation was that current voice AI systems lack self-awareness about their limitations—they don’t recognize when they’ve misheard or misunderstood input.
Production strategies to address this include:
Testing emerged as a major topic, with panelists acknowledging that voice agents behave completely differently in production compared to controlled development environments. The gap between demo quality and production quality is substantial.
Kiara described Prosus’s approach to building robust test sets:
Tom from Canonical emphasized the importance of observability in production, noting that his company helps builders “see what’s happening in production when real humans are interacting with their agents.” This reflects the broader challenge that controlled testing cannot fully replicate the unpredictability of real user interactions—as Tom colorfully quoted, “Everybody’s got a plan until they get punched in the face.”
Monica from Google brought a unique perspective on the technical challenges of voice synthesis and emotional responsiveness. She highlighted that achieving a natural-sounding voice is only the beginning—production agents must also interpret and respond to emotional nuances in user speech.
The challenge extends beyond simple empathy. As Rex noted, while prompts typically include instructions to be “empathetic,” true emotional responsiveness requires a range of appropriate responses. Sometimes empathy isn’t the right response; the agent needs to match the emotional context of the conversation appropriately.
Monica raised an important point about the “uncanny valley” in voice AI—users are currently uncertain about whether they’re talking to humans or AI, and there’s often reluctance to engage fully. She suggested this will change over time but represents a current barrier to adoption.
A recurring theme was the importance of clear product definition before building voice agents. Monica emphasized being “eternally frustrated with very vague product definitions” because voice is so complex that trying to build a comprehensive solution often leads to failure.
The recommended approach is to:
Rex raised an advanced challenge that goes beyond single-call optimization: building voice agents for repeat dialogue. Creating agents that users want to call back represents a much harder problem than optimizing individual conversations. This requires developing something akin to a relationship—the agent needs to remember context across sessions and provide compelling enough experiences that users want to return.
This represents a frontier challenge in voice AI operations, as current systems are primarily optimized for single-session interactions.
Rex shared a concrete example of a production voice AI application: an avatar-based system for prostate cancer surgery recovery support. The system used:
This example illustrates how voice AI can provide meaningful value in healthcare contexts where access to human specialists is limited.
The discussion referenced several tools and platforms relevant to voice AI operations:
The ecosystem reflects the specialized tooling required for voice AI applications, which differs significantly from text-based LLM application stacks.
The panel discussion reveals that voice AI agents require specialized operational considerations beyond standard LLM deployment practices. Success requires attention to latency at every step of the pipeline, careful management of tool calling patterns, robust testing across languages and emotional states, and clear product scoping. The technology is described as exciting but immature—significant craftsmanship is required to move from prototype to production-ready systems. The panelists encourage experimentation while acknowledging that the field is rapidly evolving and best practices are still being established.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.
A comprehensive overview of lessons learned from building GenAI applications over 1.5 years, focusing on the complexities and challenges of deploying LLMs in production. The presentation covers key aspects of LLMOps including model selection, hosting options, ensuring response accuracy, cost considerations, and the importance of observability in AI applications. Special attention is given to the emerging role of AI agents and the critical balance between model capability and operational costs.