ZenML

Building Production-Ready Conversational AI Voice Agents: Latency, Voice Quality, and Integration Challenges

Deepgram 2024
View original source

Deepgram, a leader in transcription services, shares insights on building effective conversational AI voice agents. The presentation covers critical aspects of implementing voice AI in production, including managing latency requirements (targeting 300ms benchmark), handling end-pointing challenges, ensuring voice quality through proper prosody, and integrating LLMs with speech-to-text and text-to-speech services. The company introduces their new text-to-speech product Aura, designed specifically for conversational AI applications with low latency and natural voice quality.

Industry

Tech

Technologies

Overview

This case study comes from a presentation by Michelle, a Product Manager at Deepgram, who leads their text-to-speech product called Aura. Deepgram has established itself as a leader in the transcription (speech-to-text) space over the past several years and is now expanding into text-to-speech to enable complete conversational AI voice agent solutions. The company works with notable clients including Spotify, NASA, and others, demonstrating significant enterprise adoption of their speech technologies.

The presentation provides valuable insights into the operational challenges of building voice-based conversational AI systems that incorporate LLMs. This is a particularly relevant LLMOps topic as it addresses the integration of LLMs into real-time, latency-sensitive production environments where user experience depends heavily on system responsiveness and output quality.

The Conversational AI Voice Agent Architecture

Michelle describes a typical conversational AI voice agent architecture that connects multiple components in a pipeline. When a customer calls (for example, via phone for appointment booking, customer support, outbound sales, or interviews), the system must:

This architecture creates several LLMOps challenges because the system involves multiple AI components that must work together seamlessly in real-time. One common use case mentioned is customer support triage, where an AI agent first collects information from the caller before routing to a human agent, which requires reliable and natural-sounding conversational capabilities.

Latency: The Critical Production Constraint

One of the most significant operational challenges highlighted is latency. Research cited in the presentation indicates that in human two-way conversation, the maximum tolerable latency before the interaction feels unnatural is around 300 milliseconds. This is described as the benchmark to aim for, though the speaker acknowledges this is still difficult to achieve when an LLM is included in the pipeline.

This latency constraint has major implications for LLMOps. Since LLMs are autoregressive models that output tokens word by word, teams must carefully consider how to chunk words and sentences together before sending them to the text-to-speech system. Some approaches mentioned include:

The choice of chunking strategy is described as model-dependent, suggesting that teams need to experiment and optimize based on their specific LLM and text-to-speech combinations. Deepgram’s new Aura product targets sub-250 millisecond latency specifically for conversational voice applications, acknowledging that this is a key market need.

Endpoint Detection: Knowing When Users Stop Speaking

A nuanced but critical operational challenge is endpoint detection—determining when a user has finished speaking and expects a response. This is more complex than it might initially appear because it’s not purely a text-based problem. The speaker notes that endpoint detection must consider:

During the Q&A portion, an audience member asked whether endpoint detection could simply be handled as a probabilistic task for an LLM. Michelle explained that this isn’t sufficient because the task depends heavily on audio characteristics like tone, not just the text content. A user saying “and then…” with a thinking pause is different from completing a sentence, and text alone cannot capture this distinction.

Deepgram’s speech-to-text product includes endpoint detection capabilities that can be used as part of the transcription pipeline, which helps builders of conversational AI agents address this challenge. This represents an important consideration for LLMOps teams: the pre-processing of inputs before they reach the LLM can significantly impact system behavior and user experience.

Voice Quality and Prosody Optimization

Another key operational consideration is voice quality, which in research terminology is called prosody. This encompasses naturalness elements including rhythm, pitch, intonation, and pauses. Michelle emphasizes that different text-to-speech models are optimized for different use cases:

This distinction is operationally important because choosing the wrong voice model for your use case will result in output that sounds unnatural or inappropriate. Teams building conversational AI agents are advised to consider:

Making LLM Output Sound Conversational

A significant LLMOps challenge highlighted is that LLM outputs by default do not sound like natural conversation. The text generated by models is typically written-language style, which sounds artificial when converted to speech. Several techniques are mentioned that practitioners have used to address this:

Prompt Engineering for Conversational Tone: Users have found success by prompting the LLM to generate output as if speaking in a conversation rather than writing. This prompt engineering approach helps produce more naturally-spoken language patterns.

Incorporating Pauses and Filler Words: Human conversations naturally include elements like “um,” “uh,” breathing sounds, and thinking pauses. Michelle notes that adding punctuation like ”…” (three dots) can signal pauses, and some text-to-speech vendors support breaths and pause markers in the input. This may seem counterintuitive for those focused on making AI responses “clean,” but it actually improves naturalness in voice applications.

Slang and Colloquial Language: An audience question raised the challenge of incorporating slang into conversational AI. Michelle acknowledged this is still a developing area, noting that slang sounds authentic when combined with appropriate accents and requires careful attention to training data. Simply changing the words used may not be sufficient—the entire vocal presentation needs to match.

Operational Complexity and Integration Challenges

The Q&A session highlighted just how complex conversational voice AI is in production. The moderator expressed surprise at the “vast jungle of complexity” involved in doing speech properly. This underscores an important LLMOps reality: integrating LLMs into voice applications introduces challenges that go well beyond typical text-based LLM deployments.

Key integration challenges for production systems include:

Commercial Context and Product Positioning

It’s worth noting that this presentation has a commercial aspect—Deepgram is promoting their new Aura text-to-speech product. The claimed specifications include sub-250ms latency and “comparable” cost, though specific benchmarks against competitors aren’t provided. The speaker invites interested parties to contact her for preview access.

While the technical insights shared appear genuine and valuable for practitioners, readers should be aware that the presentation naturally emphasizes challenges that Deepgram’s products are designed to address. Independent validation of the specific performance claims would be advisable for teams evaluating solutions.

Implications for LLMOps Practitioners

This case study highlights several important considerations for teams deploying LLMs in voice-based applications:

The end-to-end latency budget is far more constrained than typical web-based LLM applications. The 300ms conversational benchmark requires careful optimization across every component in the pipeline.

Input pre-processing (including endpoint detection) is as important as LLM prompting and output handling. Getting the boundaries right for when to trigger LLM inference significantly impacts user experience.

LLM outputs need domain-specific post-processing or prompting when consumed as audio rather than text. Techniques that seem counterproductive for text (adding filler words, pauses) may improve the final user experience.

The choice of supporting models and services (speech-to-text, text-to-speech) should be made holistically based on the target use case, not just on individual component benchmarks. A model optimized for narration may perform worse than a conversational-optimized model in a customer support context, even if it scores better on general metrics.

More Like This

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra 2025

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

customer_support chatbot speech_recognition +36

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building a Multi-Provider GenAI Gateway for Enterprise-Scale LLM Access

Grab 2025

Grab developed an AI Gateway to provide centralized, secure access to multiple GenAI providers (including OpenAI, Azure, AWS Bedrock, and Google VertexAI) for their internal developers. The gateway handles authentication, cost management, auditing, and rate limiting while providing a unified API interface. Since its launch in 2023, it has enabled over 300 unique use cases across the organization, from real-time audio analysis to content moderation, while maintaining security and cost efficiency through centralized management.

content_moderation translation speech_recognition +25