ZenML

AI-Powered Voice Agents for Proactive Hotel Payment Verification

Perk 2025
View original source

Perk, a business travel management platform, faced a critical problem where virtual credit cards sent to hotels sometimes weren't charged before guest arrival, leading to catastrophic check-in experiences for exhausted travelers. To prevent this, their customer care team was making approximately 10,000 proactive phone calls per week to hotels. The team built an AI voice agent system that autonomously calls hotels to verify and request payment processing. Starting with a rapid prototype using Make.com, they iterated through extensive prompt engineering, call structure refinement, and comprehensive evaluation frameworks. The solution now successfully handles tens of thousands of calls weekly across multiple languages (English, German), matching or exceeding human performance while dramatically reducing manual workload and uncovering additional operational insights through systematic call classification.

Industry

Tech

Technologies

Overview

Perk is a business travel management platform with a distinctive value proposition: business travelers can book accommodations without using personal credit cards, as Perk handles payment directly through virtual credit cards (VCCs) or credit lines. The company’s mission centers on eliminating “shadow work”—the non-core tasks that consume approximately seven hours per week per person across organizations.

The case study focuses on solving a high-impact but relatively infrequent problem: hotels sometimes fail to note or charge the virtual credit cards Perk sends them. While the process generally works well, when it fails, travelers arrive after exhausting journeys to find their rooms unpaid or even released. This represents one of the most painful possible customer experiences, directly undermining Perk’s core value proposition.

To proactively prevent these scenarios, Perk’s customer care team was making approximately 10,000 calls per week to hotels within 48 hours of guest check-in, verifying that VCCs had been received and requesting immediate payment processing. The team recognized this as an ideal candidate for AI automation and built a voice agent system that now handles tens of thousands of calls weekly across multiple languages.

Initial Exploration and Prototyping

The journey began approximately two years before production deployment when the team experimented with voice AI technologies. Philip, a senior software engineer, recalls seeing a Google demonstration about six or seven years prior showing an AI assistant booking appointments, which planted the seed for voice automation. When ChatGPT and OpenAI’s capabilities exploded around two years ago, the team began experimenting with connecting OpenAI’s language models to text-to-speech services like ElevenLabs and Deepgram, creating proof-of-concept demonstrations like booking breakfast additions to hotel reservations.

This early experimentation proved crucial when the virtual credit card problem surfaced during road mapping sessions with Steve (Product Manager) and Gabby (Engineering Manager). The team immediately recognized the connection between their experimental voice technology and this real production problem. As Gabby noted, the technology had evolved significantly over those two years in terms of reasoning capability, speed, and latency, making it viable for actual production use.

The initial prototype was developed during an AI workshop and hackathon for non-technical employees. Within approximately one to one-and-a-half weeks, the team had a working prototype built entirely on Make.com, a no-code workflow automation platform. This approach was deliberately chosen to accelerate development without requiring backend engineering changes or code deployment to production systems.

Technical Architecture and Integration Approach

The Make.com workflow integrated multiple systems and services:

This architecture’s key advantage was that it operated completely external to Perk’s core platform, essentially replicating human agent workflows without requiring code changes or infrastructure deployment. As Gabby emphasized, they created what amounted to an automation workflow using AI where needed rather than simply “putting AI on everything.”

Prompt Engineering and Conversation Design Evolution

The team’s journey through prompt engineering and conversation design reveals important lessons about production LLM deployment:

Initial Single-Agent Approach

The first iteration used a single comprehensive prompt giving the AI agent its complete mission with general guidelines. Internal testing quickly revealed this approach had poor reliability and success rates. The agent was too unconstrained, leading to unpredictable behavior and frequent failures.

One memorable example: when the team gave the agent a role as “an experienced travel agent,” it took this too literally and began believing it could change bookings, cancel reservations, and exercise authority beyond its intended scope. The team learned that narrowing scope and being extremely specific about permitted actions was critical.

Structured Multi-Stage Approach

The breakthrough came from breaking the call into distinct moments or stages, each with its own focused prompt:

Each stage maintained context from previous stages to ensure conversational coherence while having narrowly defined responsibilities. This structural approach dramatically improved reliability and success rates.

Text-to-Speech Challenges

The team encountered surprising challenges with voice rendering that text-based LLM work doesn’t face:

These were addressed through extensive prompt iteration and markup language specifications.

Handling Edge Cases and Off-Ramps

Initial prompts were too mission-focused, creating what the team humorously described as a “Karen” personality—the AI would persistently demand to speak to managers if the initial contact couldn’t help. The team had to build in appropriate “off-ramps” for scenarios where the objective genuinely couldn’t be completed, allowing graceful conversation termination rather than aggressive persistence.

Evaluation and Quality Assurance Framework

The evaluation approach demonstrates sophisticated LLMOps practice, evolving through multiple iterations:

Early Manual Evaluation

In the earliest phase with Make.com prototypes, Steve and Gabby would literally sit in a room watching transcripts appear slowly on a screen, unable to even hear the call, just seeing text output. They would manually assess each call, noting failures and patterns.

As call volume increased to handfuls per day, the entire team would listen to every single call, maintaining a Google spreadsheet with detailed annotations about what went right, wrong, and why. This manual phase was crucial for understanding the problem space, identifying failure modes, and building intuition about conversation dynamics.

Classification Evaluation

The team developed success criteria in collaboration with their accommodations team, identifying three primary success outcomes:

They also cataloged numerous failure scenarios: random disconnections, inability to navigate IVR, hotel staff not understanding the request, call getting stuck in loops, and more.

Using this taxonomy, they built an LLM-based classifier that evaluates call transcripts and assigns outcomes. Critically, they created a ground truth dataset from their manually labeled calls, allowing them to run evaluations against known correct classifications whenever they modified prompts or logic.

The classification evaluation initially was extremely strict—only accepting exact phrasings as success. Through iteration and examining false negatives, they learned to recognize natural language variations. For example, “Yeah, I’m on that for you, don’t worry about it” should register as success, as should “We’ll charge it tomorrow” when tomorrow is the check-in date.

A significant evolution was moving from single-reason classification to multi-reason classification. The team discovered that multiple failure modes could occur in a single call (stuck in IVR AND hotel said they won’t charge VCC). Version 2 of their classifier uses multiple flags for different aspects of the call rather than forcing selection of a single reason.

Behavioral Evaluation

Beyond success/failure classification, the team implemented behavioral evaluation using LLM-as-judge approaches. For every call, they evaluate:

Each dimension receives scores that are tracked over time as metrics. This allowed the team to detect when prompt changes inadvertently degraded behavioral quality. In one instance, they made a small change that increased repetitive behavior, caught it in the metrics, and rolled back the change.

Production Monitoring and Sampling

Even with automated evaluations, the team continues regular sampling of actual calls. They recognize that non-deterministic natural language behavior cannot be fully captured by any dataset, no matter how comprehensive. After deploying changes, they continue listening to sample calls to catch edge cases and unanticipated behaviors.

They maintain a Slack channel where every call initially posted outcomes, allowing daily team review sessions. At scale with tens of thousands of calls weekly, they focus on failure cases, reviewing samples to understand what went wrong and whether it represents a pattern requiring prompt iteration.

Ground Truth Evolution

The team recognizes their ground truth dataset must evolve. As they encounter new samples and edge cases not covered by existing data, they add them to maintain evaluation coverage of the actual production distribution.

Philip notes the balance between automated evaluation (fast feedback loops for iteration) and human sampling (catching what automated systems miss). The evaluations build confidence for deployment but don’t replace human judgment entirely.

Production Deployment and Scaling

The production deployment strategy was notably pragmatic:

Gradual Rollout

Rather than a big-bang launch, the team gradually increased call volume. They started with a few calls per day, then dozens, then hundreds, eventually scaling to thousands. This gradual approach allowed them to maintain quality oversight and catch issues before they affected large volumes.

The Make.com workflow’s ticket-reading approach made this easy to control—they could simply adjust how many Zendesk tickets the workflow processed, effectively throttling call volume while monitoring quality.

Multi-Language Expansion

After validating in the US and Canada (English), they expanded to the UK (also English), then Germany (their largest international market). The Germany expansion revealed an important technical insight: they initially kept prompts in English while having the AI speak German. Data quality suffered significantly. When they translated prompts to German with help from a data engineer, results improved dramatically. The team hypothesized that the LLM was translating English prompts to German, generating responses, then translating back to English, losing nuance and accuracy in the process. This learning established a principle: prompts should be in the native language of the desired output.

Spanish is now in the roadmap, with potential for many more languages given Perk’s European customer base.

Infrastructure Migration

While the system launched to production using Make.com, the team is now migrating most functionality to their own infrastructure. However, they note that the Make.com “prototype” effectively became the production system and served them well at scale. The migration appears driven by long-term infrastructure control and integration needs rather than Make.com’s inability to handle the load.

Outcomes and Impact

Quantitative Results

Qualitative Benefits

Steve emphasizes the scale advantage: the AI can make far more calls than the overwhelmed customer care team could. Importantly, Perk didn’t have customer care teams of sufficient size in many international markets (Germany, Spain, Sweden, France), so the AI enabled proactive outreach that was previously impossible.

Perhaps unexpectedly, the system generates valuable operational data. By systematically classifying every call, Perk now has rich data about failure modes in their payment process: hotels that didn’t receive VCCs, hotels that don’t accept virtual credit cards, and other process breakdowns they weren’t aware of. This data is driving a roadmap of future improvements to address root causes.

Team and Organizational Impact

The success created immediate internal demand. Steve reports that the day they announced the initial release (handling about 500 calls per week at that point), his Slack filled with messages from teams across the company saying “We have this other problem where half the team is calling places all day.” The solution found clear product-market fit internally, with expanding use cases planned for the next year.

Key Lessons and LLMOps Insights

Rapid Prototyping with No-Code Tools

The Make.com approach allowed the team to build, iterate, and even deploy to production without traditional software development cycles. This dramatically accelerated learning and time-to-value. The ability to connect to existing systems (Zendesk, Twilio) and orchestrate complex workflows visually was crucial for early experimentation.

Importance of Narrowing Scope

A consistent theme throughout the case study is the need to constrain AI agent behavior. Single, broad prompts failed reliably. Breaking tasks into focused stages with specific responsibilities dramatically improved performance. This aligns with broader LLMOps best practices around task decomposition.

Evaluation as a Practice, Not a Phase

The team’s evaluation approach evolved from manual review to automated classification to behavioral scoring, but never eliminated human sampling. They treat evaluation as an ongoing practice that enables confident iteration rather than a one-time validation checkpoint. The ground truth dataset evolves with production experience, and metrics are monitored continuously to detect degradation.

Philip’s journey—learning about evaluations through trial, error, and suffering pain—illustrates that rigorous, data-driven iteration separates successful LLM products from failed experiments. The team’s daily practice of reviewing metrics, examining failures, and sampling calls embodies the scientific mindset required for production LLM systems.

Multimodal and Real-Time Challenges

Voice AI introduces challenges beyond text-based systems: pronunciation, pacing, latency, and the unforgiving nature of real-time human conversation. The team had to learn text-to-speech markup language and iterate on voice rendering in ways that wouldn’t occur with text applications. Interestingly, they report that latency hasn’t been a significant problem with current model speeds—the technology has evolved to handle real-time conversation adequately.

Language and Localization

The team’s learning about keeping prompts in native languages rather than English reveals an important technical consideration for multilingual deployments. The hypothesis about translation overhead degrading performance should inform architecture decisions for global products.

Balancing Automation and Human Judgment

Despite sophisticated automation, the team maintains human oversight through sampling and maintains multiple evaluation approaches (classification, behavioral, manual review). They recognize the limitations of any single evaluation method and use a portfolio approach to maintain quality confidence.

Product-Market Fit Within Organizations

The rapid internal demand for similar solutions after the initial launch suggests that voice AI agents for routine but voluminous calling tasks represent a significant opportunity across many business functions. The “shadow work” framing is apt—these are tasks that must be done but don’t represent core value creation, making them ideal automation candidates.

Technical Stack Summary

While the full production stack is evolving, the key components include:

Future Directions

The team is focused on:

The case study represents a mature LLMOps implementation that moved from experimentation to production impact through disciplined iteration, comprehensive evaluation, and pragmatic technology choices that prioritized speed of learning over architectural purity.

More Like This

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed 2026

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation +36

Building Production AI Agents with Advanced Testing, Voice Architecture, and Multi-Model Orchestration

Sierra 2025

Sierra, an AI agent platform company, discusses their comprehensive approach to deploying LLMs in production for customer service automation across voice and chat channels. The company addresses fundamental challenges in productionizing AI agents including non-deterministic behavior, latency requirements, and quality assurance through novel solutions like simulation-based testing that runs thousands of parallel test scenarios, speculative execution for voice latency optimization, and constellation-based multi-model orchestration where 10-20 different models handle various aspects of each conversation. Their outcome-based pricing model aligns incentives with customer success, while their hybrid no-code/code platform enables both business and technical teams to collaboratively build, test, and deploy agents. The platform serves large enterprise customers across multiple industries, with agents handling millions of customer interactions in production environments.

customer_support chatbot speech_recognition +36

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57