ZenML

Building and Evaluating Production Voice Agents: From Custom Infrastructure to Platform Solutions

Nomore Engineering 2023
View original source

A team explored building a phone agent system for handling doctor appointments in Polish primary care, initially attempting to build their own infrastructure before evaluating existing platforms. They implemented a complex system involving speech-to-text, LLMs, text-to-speech, and conversation orchestration, along with comprehensive testing approaches. After building the complete system, they ultimately decided to use a third-party platform (Vapi.ai) due to the complexities of maintaining their own infrastructure, while gaining valuable insights into voice agent architecture and testing methodologies.

Industry

Healthcare

Technologies

This case study provides a comprehensive look at the challenges and considerations involved in deploying LLM-powered voice agents in production, specifically in the context of healthcare appointment scheduling in Poland. The study is particularly valuable as it presents both the journey of building a custom solution and the eventual pragmatic decision to use a platform solution, offering insights into both approaches.

The team initially identified a critical problem in Polish primary care where phone-based appointment scheduling was causing significant bottlenecks, with patients often unable to reach receptionists, especially during peak seasons. Their journey began with evaluating off-the-shelf solutions, specifically bland.ai, which while impressive in its quick setup, presented limitations in terms of model intelligence, costs, and Polish language support.

The technical architecture they developed consisted of four main components:

One of the most significant challenges they encountered was building a robust conversation orchestrator. This component needed to handle real-world conversation complexities such as interruptions, mid-sentence pauses, and backchannel communications (“mhm,” “uh-huh”). The team implemented sophisticated logic to manage these scenarios, demonstrating the complexity of deploying LLMs in real-world voice applications.

The testing approach they developed is particularly noteworthy from an LLMOps perspective. They implemented both manual and automated testing strategies, focusing on two key areas: agent quality and conversation quality. The automated testing infrastructure they built included:

The team’s experience with the telephony layer, primarily using Twilio’s Media Streams API, highlights the importance of considering infrastructure components when deploying LLMs in production. They built a simulation environment to test their voice agent locally without incurring Twilio costs, showing pragmatic development practices.

After completing their custom solution, they performed a critical evaluation of whether maintaining their own platform made sense. They identified three scenarios where building a custom platform would be justified:

Their evaluation of existing platforms (bland.ai, vapi.ai, and retell.ai) provides valuable insights into the current state of voice agent platforms. They compared these solutions across several critical metrics:

The case study reveals interesting architectural differences between platforms. Bland.ai’s self-hosted approach achieved better latency (1.5s vs 2.5s+) and availability but sacrificed flexibility in agent logic and evaluation capabilities. Vapi.ai and Retell.ai offered more control over the LLM component but faced challenges with latency and availability due to their multi-provider architecture.

From an LLMOps perspective, their final decision to use Vapi.ai instead of maintaining their custom solution highlights important considerations for production deployments:

Their insights into the future of AI voice agents suggest several important trends for LLMOps practitioners to consider:

This case study effectively demonstrates the complexity of deploying LLMs in production voice applications and the importance of carefully evaluating build-versus-buy decisions in LLMOps projects. It also highlights the critical role of testing and evaluation in ensuring production-ready AI systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production-Ready Customer Support AI Agents: Challenges and Solutions

Gradient Labs

Gradient Labs shares their experience building and deploying AI agents for customer support automation in production. While prototyping with LLMs is relatively straightforward, deploying agents to production introduces complex challenges around state management, knowledge integration, tool usage, and handling race conditions. The company developed a state machine-based architecture with durable execution engines to manage these challenges, successfully handling hundreds of conversations per day with high customer satisfaction.

customer_support high_stakes_application regulatory_compliance +22

Evolving ML Infrastructure for Production Systems: From Traditional ML to LLMs

Doordash 2025

A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.

question_answering classification structured_output +37