ZenML

Voice AI Agent Development and Production Challenges

Various (Canonical, Prosus, DeepMind) 2023
View original source

Panel discussion with experts from various companies exploring the challenges and solutions in deploying voice AI agents in production. The discussion covers key aspects of voice AI development including real-time response handling, emotional intelligence, cultural adaptation, and user retention. Experts shared experiences from e-commerce, healthcare, and tech sectors, highlighting the importance of proper testing, prompt engineering, and understanding user interaction patterns for successful voice AI deployments.

Industry

Tech

Technologies

Overview

This case study is derived from a panel discussion on voice AI agents featuring practitioners from multiple organizations: Kiara from Prosus (a global technology company with approximately one billion customers across e-commerce, food delivery, fintech, and classifieds), Monica from Google’s speech synthesis group (with 14+ years experience), Rex (a solopreneur with background at Amazon Alexa), and Tom from Canonical (which builds observability tools for voice AI agents). The discussion provides valuable insights into the operational challenges of deploying voice AI agents in production environments, drawing from real-world experiences across various industries including e-commerce, edtech, and healthcare.

The Fundamental Challenge: Voice is Different from Text

A central theme throughout the discussion is that voice AI agents present fundamentally different challenges compared to text-based LLM applications. As Rex noted, LLMs are already challenging due to their non-deterministic nature, making testing and enforcing predictability difficult. Voice adds multiple additional layers of complexity on top of this foundation.

Kiara shared a concrete example from Prosus’s work building an e-commerce agent that could browse the web and perform tasks for users. When they attempted to use the same text prompts that worked for their text-based agent with OpenAI’s real-time API for voice, it “failed miserably.” The key insight is that with text, everything is nicely structured and visible, allowing for significant control. Voice interactions lack this structure and introduce entirely new dimensions of user experience that must be carefully managed.

Latency and Tool Calling Challenges

One of the most significant production challenges discussed was latency management. Users become frustrated if the agent doesn’t respond immediately, which has major implications for tool calling in LLM-powered agents. When an agent needs to perform an action or retrieve information, the traditional approach of waiting for the tool to complete before responding simply doesn’t work in voice contexts.

The Prosus team developed several strategies to address this:

This represents a significant departure from typical LLM application architectures where tool calls can happen synchronously without immediate user feedback requirements.

Handling User Input Ambiguity

The panelists identified user input handling as a critical operational concern. Voice input can be vague and difficult to interpret, especially when dealing with postal addresses, names, and spelling. A key observation was that current voice AI systems lack self-awareness about their limitations—they don’t recognize when they’ve misheard or misunderstood input.

Production strategies to address this include:

Testing and Evaluation Strategies

Testing emerged as a major topic, with panelists acknowledging that voice agents behave completely differently in production compared to controlled development environments. The gap between demo quality and production quality is substantial.

Kiara described Prosus’s approach to building robust test sets:

Tom from Canonical emphasized the importance of observability in production, noting that his company helps builders “see what’s happening in production when real humans are interacting with their agents.” This reflects the broader challenge that controlled testing cannot fully replicate the unpredictability of real user interactions—as Tom colorfully quoted, “Everybody’s got a plan until they get punched in the face.”

Voice Quality and Emotional Responsiveness

Monica from Google brought a unique perspective on the technical challenges of voice synthesis and emotional responsiveness. She highlighted that achieving a natural-sounding voice is only the beginning—production agents must also interpret and respond to emotional nuances in user speech.

The challenge extends beyond simple empathy. As Rex noted, while prompts typically include instructions to be “empathetic,” true emotional responsiveness requires a range of appropriate responses. Sometimes empathy isn’t the right response; the agent needs to match the emotional context of the conversation appropriately.

Monica raised an important point about the “uncanny valley” in voice AI—users are currently uncertain about whether they’re talking to humans or AI, and there’s often reluctance to engage fully. She suggested this will change over time but represents a current barrier to adoption.

Product Definition and Scope Management

A recurring theme was the importance of clear product definition before building voice agents. Monica emphasized being “eternally frustrated with very vague product definitions” because voice is so complex that trying to build a comprehensive solution often leads to failure.

The recommended approach is to:

Retention and Multi-Session Interactions

Rex raised an advanced challenge that goes beyond single-call optimization: building voice agents for repeat dialogue. Creating agents that users want to call back represents a much harder problem than optimizing individual conversations. This requires developing something akin to a relationship—the agent needs to remember context across sessions and provide compelling enough experiences that users want to return.

This represents a frontier challenge in voice AI operations, as current systems are primarily optimized for single-session interactions.

Real-World Application Example

Rex shared a concrete example of a production voice AI application: an avatar-based system for prostate cancer surgery recovery support. The system used:

This example illustrates how voice AI can provide meaningful value in healthcare contexts where access to human specialists is limited.

Platform and Tool Ecosystem

The discussion referenced several tools and platforms relevant to voice AI operations:

The ecosystem reflects the specialized tooling required for voice AI applications, which differs significantly from text-based LLM application stacks.

Key Takeaways for LLMOps Practitioners

The panel discussion reveals that voice AI agents require specialized operational considerations beyond standard LLM deployment practices. Success requires attention to latency at every step of the pipeline, careful management of tool calling patterns, robust testing across languages and emotional states, and clear product scoping. The technology is described as exciting but immature—significant craftsmanship is required to move from prototype to production-ready systems. The panelists encourage experimentation while acknowledging that the field is rapidly evolving and best practices are still being established.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra 2025

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

customer_support chatbot speech_recognition +36

Practical Lessons Learned from Building and Deploying GenAI Applications

Bolbeck 2023

A comprehensive overview of lessons learned from building GenAI applications over 1.5 years, focusing on the complexities and challenges of deploying LLMs in production. The presentation covers key aspects of LLMOps including model selection, hosting options, ensuring response accuracy, cost considerations, and the importance of observability in AI applications. Special attention is given to the emerging role of AI agents and the critical balance between model capability and operational costs.

chatbot translation speech_recognition +25