Elyos AI: Building Low-Latency Voice AI Agents for Home Services

Company

Elyos AI

Title

Building Low-Latency Voice AI Agents for Home Services

Industry

Tech

Link

https://www.youtube.com/watch?v=Zs8onmsPJgw

Year

2025

Summary (short)

Elyos AI built end-to-end voice AI agents for home services companies (plumbers, electricians, HVAC installers) to handle customer calls, emails, and messages 24/7. The company faced challenges achieving human-like conversation latency (targeting sub-400ms response times) while maintaining reliability and accuracy for complex workflows including appointment booking, payment processing, and emergency dispatch. Through careful orchestration, they optimized speech-to-text, LLM, and text-to-speech components, implemented just-in-time context engineering, state machine-based workflows, and parallel monitoring streams to achieve consistent performance with approximately 85% call automation (15% requiring human involvement).

Tags

## Overview

Elyos AI, presented by co-founder and CTO Panos, has developed specialized voice AI agents for the home services vertical, including plumbers, electricians, HVAC installers, and security companies. The system integrates directly with customers' CRM and ERP systems to provide end-to-end customer experiences including appointment booking, invoicing, scheduling, and payment processing. The company emphasizes building highly specialized workflows tailored to their vertical rather than general-purpose solutions.

The motivating use case involves emergency scenarios, such as a family discovering their boiler is broken on a Saturday evening with no hot water or heating and the first available appointment being 36 hours away. Customers need immediate service, often within 1-4 hours. Alias AI's agents answer all calls, emails, and messages 24/7, performing issue triage (sometimes resolving problems remotely), booking emergency appointments, sending invoices, taking payments, and triggering on-call processes to notify available engineers.

## Latency Challenges and Architecture

The fundamental challenge in building voice agents is achieving human-like conversation latency. Humans typically converse with approximately 100-millisecond turn-taking, but AI agents require significantly more time. Panos explains that voice agents, particularly those using cascade architecture (as opposed to end-to-end real-time models), consist of three main components: speech-to-text (ASR), an LLM for reasoning, and text-to-speech (TTS).

In terms of latency benchmarks, the team found that speech-to-text models consistently perform in the 100-300 millisecond range across providers like Deepgram, Speechmatics, Gladia, and AssemblyAI. LLMs show more variability in time-to-first-token, with different models and providers performing differently—notably, GPT-4o and GPT-4o.1 were observed to be slower than GPT-4.1. Text-to-speech latency ranges from 200-600 milliseconds depending on voice selection, language, and hosting location. In a best-case P90 scenario, the total latency ranges from 500-1200 milliseconds, which presents significant challenges for natural conversation.

## The Four Pillars of Orchestration

Alias AI's solution centers on sophisticated orchestration organized around four key pillars: latency, consistency, context, and recovery.

### Latency Optimization

Beyond optimizing the core components, several orchestration strategies prove critical. The system must accommodate warm starts, ensuring workers are ready when calls arrive and have already determined which providers to use. Infrastructure management is crucial—deployment strategies must handle graceful connection draining since calls can last 15-20 minutes, making simple fallback mechanisms impractical.

Regional clustering emerged as a key technique, hosting pipeline components close to both customers and telephony providers. Panos emphasizes avoiding situations where, for example, using Twilio's default US region forces traffic to route from Europe to US East and back, adding unnecessary latency. Keeping tools close to the orchestration layer minimizes network latency, even in microservices architectures. When using frameworks like Livekit, reusing code directly from worker definitions rather than making network calls proves beneficial.

LLM provider routing represents another critical optimization. Different providers and even different regional endpoints for the same provider (e.g., OpenAI's native endpoints versus Azure OpenAI endpoints) exhibit varying latency depending on time of day. Implementing monitoring to dynamically route between endpoints based on current performance significantly improves consistency.

Text-to-speech quality monitoring emerged as unexpectedly important. While teams typically focus on monitoring LLM time-to-first-token, inconsistent TTS generation creates "clanky" conversations that feel unnatural. Alias AI built in-house systems to assess generated voice quality and monitor all calls post-conversation. Investing early in comprehensive observability enables teams to quickly identify pain points and receive better support from providers when issues arise with specific request IDs and detailed context.

### Consistency and Determinism

While LLMs are inherently stochastic, production voice agents require making them as deterministic as possible. Alias AI defines expected outcomes for workflows and incorporates humans in the loop when things deviate from expectations. A key principle is minimizing the number of flows per journey—being very concise rather than attempting ten different things in one runtime reduces variability and unpredictable failures.

Interestingly, Panos advises avoiding unnecessary RAG (retrieval-augmented generation). With modern models being fast and capable of reasoning, just-in-time context techniques often work better than complicated RAG systems. The reasoning is that RAG inherently introduces accuracy limitations—fetching the wrong information risks introducing cascading errors throughout the conversation. When context needs can be made deterministic or injected based on identified intent, those approaches prove more reliable.

The team emphasizes benchmarking workflows against human agents rather than assuming human superiority. Head-to-head comparisons often reveal surprising results about where AI agents actually outperform humans in terms of consistency and accuracy.

### Context Engineering

Context engineering represents one of the most sophisticated aspects of Alias AI's approach. The "just-in-time context" technique starts with minimal prompts, stripping down to the very basics even when using sub-agents. Context is then injected dynamically based on intent classification running behind the scenes. This approach avoids putting everything in context simultaneously, which can overwhelm models and introduce confusion.

The system treats workflows as state machines, carefully managing what context is available at each stage. For example, in an emergency call-out workflow, the first step provides context to identify whether the situation qualifies as an emergency, along with information about the top 5-20 relevant job types. Once the determination is made that it is an emergency, that context is removed and replaced with simply "this is an emergency" as a deterministic fact. The conversation history about how that conclusion was reached is no longer needed.

This "just-in-time in, just-in-time out" approach actively cleans context that's no longer necessary. Tool call results represent a particular challenge—teams commonly leave tool call results in context long after they're needed, which can confuse models. When an agent needs to call the same tool again with different parameters, having previous results with their own tool call IDs remaining in context creates ambiguity about which result to use.

Sub-agents introduce additional complexity because the main agent isn't always aware of what happens within sub-agents until tasks complete, risking context loss. Alias AI maintains control of general context "somewhere"—either in the backend or in a different model—ensuring continuity. They employ smart summarization that understands what's important versus what can be discarded based on the specific scenario, with clean handoffs defining expected outcomes when different workers or sub-agents complete tasks.

For voice specifically, tracking sentiment and tone throughout conversations proves critical for understanding how interactions are progressing and when intervention might be necessary. The system validates that context remains current and relevant as conversations evolve, especially important since context can quickly become stale.

### Recovery and Error Handling

The recovery pillar focuses on graceful degradation and error handling. Alias AI tracks state in their backend, approaching everything as state machines. Workflows and journeys are implemented as state machines with clear error states that guide recovery processes and increase the probability of successful outcomes.

Post-runtime reconciliation runs after conversations complete. Since the system knows the intended outcomes, one or more models analyze conversations to ensure everything executed correctly and determine reconciliation steps if not. This approach doesn't annoy users when input is unclear—for example, speech-to-text models often struggle with emails and UK postcodes. When the system detects difficulty after a couple of attempts, it follows an "uncertainty path" rather than repeatedly asking users to spell out information.

Human escalation paths are essential and take two forms. First, direct transfer to human agents, which is particularly important when callers are stressed and want that option. Second, human-in-the-loop approval where a human provides instructions or approves specific steps without taking over the entire conversation. This proves especially valuable for high-value or sensitive operations.

A parallel monitoring stream runs a fast LLM as a judge alongside conversations, continuously verifying that conversations remain on track. If the monitoring model detects deviation, it uses context to pull the primary model back to expected behavior. This real-time quality assurance layer catches issues before they compound.

## Technical Implementation Details

### Cascade vs. End-to-End Architectures

When asked about cascade versus end-to-end (E2E) architectures, Panos notes that real-time E2E models began improving significantly with recent releases of Gemini and GPT-4o-mini from OpenAI. However, earlier versions exhibited poor performance on multi-turn conversations and function calling. Even with improvements, limitations remain for different languages and accents. Operating in the UK, Alias AI needs to handle diverse accents, and cascade architecture allows them to use custom-trained ASR models that significantly outperform general-purpose models.

While E2E models theoretically should deliver better latency by eliminating round trips, the technology isn't yet mature enough for their requirements. The team maintains close relationships with both OpenAI and Google teams and tracks ongoing improvements.

### Handling Interruptions

Interruptions represent one of the hardest problems in voice AI. The challenge isn't the interruption interaction itself but managing how text-to-speech triggers and cancels. Since most agents operate in half-duplex mode (only one participant speaks at a time), the key is making generated voice pause naturally rather than cutting off abruptly mid-word or mid-letter.

Default interruption handling in frameworks like Livekit cancels entire TTS generation immediately. Newer models, including offerings from Cartesia, support better pause behavior. The technique involves implementing an observer pattern that monitors conversations and cancels TTS only when complete words or sentences (depending on desired granularity) have been spoken. This doesn't eliminate interruptions but manages the perception of interruptions, making them feel more natural—similar to human conversation where one person completes a word before the other begins speaking.

### State Management and Workflows

The workflow engine built in-house combines the capabilities of automation tools like Zapier or n8n with custom-built tool functions. The system gives agents enough freedom to move from step A to step B even when inputs vary slightly while optimizing for high reliability, especially for sensitive operations like invoicing and payments that must be 100% correct.

For sensitive workflows, agents don't automatically update state—that's handled in the backend state machine. Each workflow step defines success criteria and failure handling, with feedback pushed back to the LLM. LLMs excel at reading error messages and adapting behavior, enabling iterative improvement within conversations. This approach combines deterministic steps (executing specific code through functions) with stochastic steps (additional model invocations for reasoning).

A concrete example illustrates the approach: When a customer calls to book a boiler service, the first deterministic step identifies the customer. Without correct identification, proceeding to book an appointment is impossible. After identification, another deterministic step creates the job record. Then a stochastic LLM-driven step analyzes customer location, contract details, and coverage to determine response time and cost. Finally, deterministic steps handle appointment booking, invoicing, and payment processing. This mix of deterministic and stochastic operations provides both flexibility and reliability.

### Multi-Language Support

While Alias AI primarily operates in English for UK and US customers, they've experimented with other languages including Greek and French. Models generally perform much better in English, and performance varies significantly by language. The cascade architecture advantage extends to language support—different ASR providers excel at different languages (e.g., Gladia performs better on French due to French founders prioritizing that language).

Prompting and model understanding varies substantially across languages, with Greek presenting particular challenges. One successful approach uses real-time translation services like DeepL or similar tools to translate on the fly. This keeps function calling and deterministic logic in English while only translating user-facing speech, significantly improving reliability compared to running everything in non-English languages.

### Training Custom TTS Models

For custom text-to-speech voices, 2-3 hours of high-quality audio typically suffices, though the specific requirement depends on language and intended use. The critical factor is simulating the tone and conversation style where the voice will be used—a customer service voice differs significantly from other applications in terms of tone, interpretation, and pace.

Audio quality must be high, and ideally the recording environment should match deployment conditions. For phone-based agents, recording in similar acoustic environments improves how natural the voice sounds over phone systems. These environmental factors significantly impact tone and flow in ways that matter for production deployment.

### Framework Selection

The team experimented with various orchestration frameworks but found most too slow for voice applications. Many frameworks perform adequately for email or WhatsApp communication where latency tolerance is higher, but they add both complexity and latency to real-time voice pipelines. Frameworks are inherently opinionated, requiring careful evaluation before introducing them to latency-critical paths.

For non-voice applications (email, text communication), using frameworks like LangChain or similar tools is perfectly acceptable. However, for voice agents targeting sub-second response times, the overhead often proves prohibitive. This led Alias AI to build custom solutions optimized specifically for their latency requirements.

## Key Metrics and Monitoring

Panos emphasizes several critical metrics beyond the commonly tracked time-to-first-token. Groundedness represents a super-critical metric—ensuring the agent doesn't hallucinate and follows defined scripts and workflows. Conversation quality metrics include interruption handling and word repetition. Outcome metrics measure whether the agent actually accomplished its intended task.

General sentiment tracking proves valuable for both understanding caller experience and how callers felt their request was handled. This provides insights for rapid improvement. Tracking the most common failure modes enables teams to focus optimization efforts where they'll have the greatest impact. Automating analysis and improvement based on these failure patterns accelerates quality improvements.

The system monitors all calls post-conversation through their in-house quality assessment tools, enabling continuous learning and refinement of both workflows and model behavior.

## Real-World Performance

In production, approximately 15% of calls require human involvement, meaning roughly 85% full automation. However, "requiring human involvement" doesn't always mean transfer—it includes cases where humans should review the interaction afterward or provide approval for specific steps. The percentage varies significantly by customer based on their systems, workflows, and business models.

Some customers achieve 0% transfers because their systems and workflows enable complete automation. Others prefer human involvement for high-value interactions, such as solar installation quotes involving $20,000-30,000 projects where customers prefer speaking with humans. Most common scenarios requiring human touch include complex complaints, questions about systems not integrated with Alias AI (e.g., records from five years ago in a legacy system), and other edge cases that deterministic systems cannot easily handle.

## Critical Success Factors and Lessons Learned

Several key insights emerge from Alias AI's experience building production voice agents for a demanding vertical:

The importance of vertical specialization cannot be overstated. Rather than building horizontal general-purpose agents, focusing deeply on home services workflows enabled optimizations and reliability that generic solutions couldn't achieve. Understanding domain-specific failure modes, typical customer stress levels during emergency calls, and industry-specific terminology and processes all contribute to success.

Orchestration matters more than individual component selection. While choosing good ASR, LLM, and TTS providers is important, how those components are orchestrated—including warm starts, regional clustering, dynamic provider routing, and context management—determines overall system performance and reliability.

The "just-in-time" context approach represents a significant departure from conventional thinking about maintaining comprehensive conversation history. Actively pruning context and injecting only what's needed for the current step reduces model confusion, improves latency, and increases reliability. This requires sophisticated understanding of workflow state and careful prompt engineering.

Treating workflows as state machines rather than relying entirely on LLM autonomy provides the safety and reliability required for production systems, especially those handling payments and other sensitive operations. The combination of deterministic and stochastic steps offers flexibility where it's valuable while maintaining control where it's critical.

Real-time monitoring and parallel quality assessment catch problems before they cascade. Running a fast LLM as a judge in parallel enables course correction during conversations rather than only learning from post-call analysis.

The cascade architecture, despite theoretical advantages of E2E models, currently provides better production reliability due to maturity of components and ability to optimize each independently. This may change as E2E models improve, but for now, the ability to select best-in-class ASR for specific accents and languages, optimize LLM selection by region and time, and fine-tune TTS quality outweighs the latency benefits E2E models promise but don't yet consistently deliver.

Observability from day one is essential. Early investment in comprehensive monitoring enables rapid iteration, better provider support relationships, and data-driven optimization decisions. Understanding exactly where failures occur and why enables targeted improvements rather than guessing.

Finally, benchmarking against human agents rather than theoretical perfection provides realistic performance targets and often reveals areas where AI agents already exceed human consistency and accuracy, validating the approach and identifying opportunities for expanding automation scope.

Start deploying reproducible AI workflows today