Company
Deepgram
Title
Building Production-Ready Conversational AI Voice Agents: Latency, Voice Quality, and Integration Challenges
Industry
Tech
Year
2024
Summary (short)
Deepgram, a leader in transcription services, shares insights on building effective conversational AI voice agents. The presentation covers critical aspects of implementing voice AI in production, including managing latency requirements (targeting 300ms benchmark), handling end-pointing challenges, ensuring voice quality through proper prosody, and integrating LLMs with speech-to-text and text-to-speech services. The company introduces their new text-to-speech product Aura, designed specifically for conversational AI applications with low latency and natural voice quality.
## Overview This case study comes from a presentation by Michelle, a Product Manager at Deepgram, who leads their text-to-speech product called Aura. Deepgram has established itself as a leader in the transcription (speech-to-text) space over the past several years and is now expanding into text-to-speech to enable complete conversational AI voice agent solutions. The company works with notable clients including Spotify, NASA, and others, demonstrating significant enterprise adoption of their speech technologies. The presentation provides valuable insights into the operational challenges of building voice-based conversational AI systems that incorporate LLMs. This is a particularly relevant LLMOps topic as it addresses the integration of LLMs into real-time, latency-sensitive production environments where user experience depends heavily on system responsiveness and output quality. ## The Conversational AI Voice Agent Architecture Michelle describes a typical conversational AI voice agent architecture that connects multiple components in a pipeline. When a customer calls (for example, via phone for appointment booking, customer support, outbound sales, or interviews), the system must: - Transcribe the caller's voice into text using speech-to-text - Process and potentially route this text to an LLM for generating a response - Convert the LLM's text response back to speech using text-to-speech - Deliver the audio response back to the customer This architecture creates several LLMOps challenges because the system involves multiple AI components that must work together seamlessly in real-time. One common use case mentioned is customer support triage, where an AI agent first collects information from the caller before routing to a human agent, which requires reliable and natural-sounding conversational capabilities. ## Latency: The Critical Production Constraint One of the most significant operational challenges highlighted is latency. Research cited in the presentation indicates that in human two-way conversation, the maximum tolerable latency before the interaction feels unnatural is around 300 milliseconds. This is described as the benchmark to aim for, though the speaker acknowledges this is still difficult to achieve when an LLM is included in the pipeline. This latency constraint has major implications for LLMOps. Since LLMs are autoregressive models that output tokens word by word, teams must carefully consider how to chunk words and sentences together before sending them to the text-to-speech system. Some approaches mentioned include: - Using vendors that offer real-time input streaming capabilities - Developing custom chunking logic based on conversational data patterns - Balancing between latency optimization and voice quality (sending output per-word may result in more "chopped up" sounding speech) The choice of chunking strategy is described as model-dependent, suggesting that teams need to experiment and optimize based on their specific LLM and text-to-speech combinations. Deepgram's new Aura product targets sub-250 millisecond latency specifically for conversational voice applications, acknowledging that this is a key market need. ## Endpoint Detection: Knowing When Users Stop Speaking A nuanced but critical operational challenge is endpoint detection—determining when a user has finished speaking and expects a response. This is more complex than it might initially appear because it's not purely a text-based problem. The speaker notes that endpoint detection must consider: - The user's tone of voice - The context of the conversation - Whether pauses represent thinking time versus completion of thought - Subtle speech patterns like trailing off or hesitation During the Q&A portion, an audience member asked whether endpoint detection could simply be handled as a probabilistic task for an LLM. Michelle explained that this isn't sufficient because the task depends heavily on audio characteristics like tone, not just the text content. A user saying "and then..." with a thinking pause is different from completing a sentence, and text alone cannot capture this distinction. Deepgram's speech-to-text product includes endpoint detection capabilities that can be used as part of the transcription pipeline, which helps builders of conversational AI agents address this challenge. This represents an important consideration for LLMOps teams: the pre-processing of inputs before they reach the LLM can significantly impact system behavior and user experience. ## Voice Quality and Prosody Optimization Another key operational consideration is voice quality, which in research terminology is called prosody. This encompasses naturalness elements including rhythm, pitch, intonation, and pauses. Michelle emphasizes that different text-to-speech models are optimized for different use cases: - Some are optimized for movie narration - Some for reading news articles - Some for video advertisements - Some (like Deepgram's Aura) for conversational dialogue This distinction is operationally important because choosing the wrong voice model for your use case will result in output that sounds unnatural or inappropriate. Teams building conversational AI agents are advised to consider: - Voice branding guidelines for their organization - Reference voices that exemplify desired characteristics - Specific characteristics in tone, emotion, and accent - Target demographic for the voice persona ## Making LLM Output Sound Conversational A significant LLMOps challenge highlighted is that LLM outputs by default do not sound like natural conversation. The text generated by models is typically written-language style, which sounds artificial when converted to speech. Several techniques are mentioned that practitioners have used to address this: **Prompt Engineering for Conversational Tone**: Users have found success by prompting the LLM to generate output as if speaking in a conversation rather than writing. This prompt engineering approach helps produce more naturally-spoken language patterns. **Incorporating Pauses and Filler Words**: Human conversations naturally include elements like "um," "uh," breathing sounds, and thinking pauses. Michelle notes that adding punctuation like "..." (three dots) can signal pauses, and some text-to-speech vendors support breaths and pause markers in the input. This may seem counterintuitive for those focused on making AI responses "clean," but it actually improves naturalness in voice applications. **Slang and Colloquial Language**: An audience question raised the challenge of incorporating slang into conversational AI. Michelle acknowledged this is still a developing area, noting that slang sounds authentic when combined with appropriate accents and requires careful attention to training data. Simply changing the words used may not be sufficient—the entire vocal presentation needs to match. ## Operational Complexity and Integration Challenges The Q&A session highlighted just how complex conversational voice AI is in production. The moderator expressed surprise at the "vast jungle of complexity" involved in doing speech properly. This underscores an important LLMOps reality: integrating LLMs into voice applications introduces challenges that go well beyond typical text-based LLM deployments. Key integration challenges for production systems include: - Managing the latency budget across multiple components (speech-to-text, LLM processing, text-to-speech) - Ensuring graceful handling of edge cases in speech recognition and endpoint detection - Tuning prompts and output formatting for vocal rather than visual consumption - Selecting and configuring appropriate voice models for the target use case - Handling real-time streaming and chunking decisions that balance latency and quality ## Commercial Context and Product Positioning It's worth noting that this presentation has a commercial aspect—Deepgram is promoting their new Aura text-to-speech product. The claimed specifications include sub-250ms latency and "comparable" cost, though specific benchmarks against competitors aren't provided. The speaker invites interested parties to contact her for preview access. While the technical insights shared appear genuine and valuable for practitioners, readers should be aware that the presentation naturally emphasizes challenges that Deepgram's products are designed to address. Independent validation of the specific performance claims would be advisable for teams evaluating solutions. ## Implications for LLMOps Practitioners This case study highlights several important considerations for teams deploying LLMs in voice-based applications: The end-to-end latency budget is far more constrained than typical web-based LLM applications. The 300ms conversational benchmark requires careful optimization across every component in the pipeline. Input pre-processing (including endpoint detection) is as important as LLM prompting and output handling. Getting the boundaries right for when to trigger LLM inference significantly impacts user experience. LLM outputs need domain-specific post-processing or prompting when consumed as audio rather than text. Techniques that seem counterproductive for text (adding filler words, pauses) may improve the final user experience. The choice of supporting models and services (speech-to-text, text-to-speech) should be made holistically based on the target use case, not just on individual component benchmarks. A model optimized for narration may perform worse than a conversational-optimized model in a customer support context, even if it scores better on general metrics.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.