## Overview
Xelix built an AI-powered help desk system designed specifically for accounts payable (AP) teams to handle the overwhelming volume of vendor communications. The company positions itself as an "accounts payable control center" with multiple AI-enabled products, and the help desk represents their newest offering—an AP-specific ticketing system that leverages generative AI to automate responses to common vendor inquiries.
The core problem addressed is that AP teams at large organizations receive upwards of 1,000 emails per day from vendors and suppliers, covering diverse requests ranging from invoice status inquiries and payment reminders to statement reconciliation requests and vendor onboarding documentation. These communications lack structure, with vendors often sending duplicate messages, and AP professionals spend considerable time manually responding to each inquiry by looking up invoice statuses in ERP systems, tracing payment workflows internally, and crafting responses.
The team took a measured approach to building this system, starting nearly two years ago with manual analysis of actual customer email data. Product manager Tal and the team spent extensive time manually tagging thousands of emails to understand the distribution of request types, finding that roughly 20% of mailbox volume fell into specific, automatable categories. This manual tagging process served multiple purposes: it built deep empathy for the customer's workload (taking a full day to tag just 250 emails helped them appreciate teams handling 1,000+ daily), provided real production-representative data to inform design decisions, and helped identify which request types were both high-volume and time-consuming to handle.
## Development Methodology and Product Evolution
The help desk product was the first at Xelix to use the Carcpatio development methodology, which emphasizes "daily slices"—production-ready code increments completed each day that validate hypotheses and enable rapid iteration. While not literally shipping customer-facing features daily from day one, this approach meant treating each day's work as a hypothesis to be validated, such as "can we authenticate with an email client?" or "can we retrieve and store emails?" This enabled tight feedback loops and continuous validation of technical decisions.
The team released a beta version of the help desk integrated within their existing statements module, which had an established user base. This strategic decision provided immediate access to real users who could provide feedback on what worked and what didn't, accelerating the product development cycle.
One particularly interesting aspect of the product evolution was the UX journey. Initially, the team designed an interface that closely mimicked Outlook and traditional email clients, reasoning that this familiarity would reduce onboarding friction and training requirements for AP professionals migrating from Outlook. While this initially worked well for adoption, it brought problematic behaviors from traditional email systems—like losing emails in folders and difficulty tracking conversations—and discouraged users from adopting the AI-enabled and ticketing features that differentiated the platform.
The team pivoted to a dual-interface approach. They maintained the traditional inbox view but developed a more robust ticket-centric display that foregrounded ticketing fields, AI-generated insights, and enriched data rather than the email metaphor. Interestingly, they observed a natural progression where new customers would start with the familiar inbox view but migrate to the more advanced ticket view as they became comfortable with the platform's AI capabilities. This reflects sophisticated thinking about change management and gradual user adoption of new workflows.
## Technical Architecture: The Email Processing Pipeline
The heart of the system is what the team calls the "email processing pipeline," which executes a series of both parallel and sequential tasks when an email arrives. This pipeline represents a carefully orchestrated sequence of AI and traditional ML tasks designed to enrich incoming communications with the context needed to generate accurate responses.
### Data Enrichment and Preprocessing
When a support ticket (which corresponds to an email thread) arrives, the system first performs preprocessing to extract basic metadata: attachments, sender information, and email content. This initial step feeds into several downstream processes.
One of the most challenging aspects is vendor identification. The system uses a hierarchical approach drawing from three data sources: the email domain, the text content of the email, and attached documents. The team acknowledged that vendor matching is "one of the trickiest problems" they face, complicated by factors like shared email domains that serve multiple vendors, subsidiaries with different names than parent companies, vendor renamings, and naming inconsistencies between how a vendor identifies itself versus how they're recorded in the customer's master vendor data. For example, a vendor might appear as "Company LLC" in the ERP system but sign emails and documents as just "Company."
The vendor identification process is particularly critical because the system integrates with customers' ERP databases to retrieve invoice and payment information. Xelix's unique advantage is having access to this structured financial data through their other product modules, which enables context-augmented generation that general-purpose AI assistants cannot provide. If the system cannot confidently identify the vendor, it won't generate a response, serving as a safety mechanism to prevent sending sensitive financial information to unverified parties.
Beyond vendor identification, the pipeline performs classification to categorize emails into types relevant to accounts payable: invoice reminders, payment status inquiries, statement reconciliation requests, bank detail updates, and others. Currently, the system only generates automated responses for invoice and reminder categories, though it provides data enrichment and supporting information for other categories.
### Invoice Matching and Retrieval
For emails categorized as invoice-related inquiries, the system attempts to extract invoice information from the email content and attachments—invoice numbers, amounts, dates, and other identifying details. This extraction step is crucial because the quality of what can be extracted directly impacts the confidence of downstream matching.
The system then performs machine learning-powered matching between the extracted invoice information and records in the customer's ERP system. This matching process generates probability scores indicating confidence in each potential match. Claire, the AI engineer, explained that they assess "given the entire context of what we know about the ticket, what we know about the vendor, what the best matches are, how confident are we that we've matched everything very well."
The system can handle complex scenarios like emails mentioning multiple invoices, statements bundled with multiple invoices, or threads where different emails reference different invoices. It surfaces matched invoices in a side panel with status information, amounts, dates, and confidence indicators. Importantly, when matches are imperfect—such as when an amount or date doesn't align exactly but the invoice appears to be the best match—the system still presents the match but flags it with a warning and lower confidence score.
When no good match can be found, the system provides a fallback status indicating either that the invoice doesn't exist in the system or that it's presented in such a different format that confident matching isn't possible. This transparency helps the AP professional understand what action they need to take.
### Response Generation
The response generation component sits at the end of the pipeline and represents what Claire calls a "context-augmented LLM" rather than a fully autonomous agent. This is a critical architectural decision: the LLM doesn't decide what actions to take or what information to retrieve. Instead, it receives validated, structured data from upstream pipeline steps and is tasked solely with synthesizing that information into a coherent, professional response.
This narrow, constrained approach to the LLM's role reflects the team's prioritization of accuracy and reliability when dealing with financial data. By separating concerns—using dedicated ML models and algorithms for extraction and matching, and reserving the LLM purely for language generation—they can validate and test each component independently and maintain tighter control over the system's behavior.
The system pre-generates responses so users don't have to wait the approximately 12 seconds generation takes. When an AP professional opens a ticket, they immediately see both the vendor's inquiry and a ready-to-send generated response. This design choice reflects sophisticated thinking about user experience and the importance of minimizing perceived latency in production systems.
### Confidence Scoring
A central element of the production system is the confidence scoring mechanism that helps users decide whether to send a generated response as-is, edit it, or discard it and write their own. The confidence score aggregates multiple factors:
- **Extraction quality**: How much relevant information was present in the original email? Was it just a vague "where's my invoice 123?" or did it include a PDF with vendor name, amount, date, and other identifying details?
- **Match confidence**: What probability did the ML matching models assign to the invoice matches?
- **Thread complexity**: How complicated is the email thread? Are there multiple tickets, multiple invoices mentioned, or a convoluted conversation history?
- **Overall context completeness**: Do we have all the information needed to handle this ticket comprehensively?
This multi-dimensional confidence score represents a practical solution to the inherent uncertainty in LLM-powered systems. Rather than trying to achieve 100% accuracy before deployment, the team built transparency into the system so users can make informed decisions about trust and verification.
## Email Threading and State Management
An often-overlooked complexity in building email-based systems is threading and state management. The team maintains synchronization with customers' Outlook inboxes, ensuring eventual consistency between the help desk platform and the email client. This bidirectional sync means emails sent or received in either system appear in both.
The system must intelligently determine which email thread an incoming message belongs to and which ticket within that thread it relates to. This is complicated by the concept of closed tickets: a single email thread might spawn multiple tickets over time. For example, if a vendor inquiry is resolved and the ticket marked as completed, but then a month later the vendor replies to the same email thread with a new question, the system needs to recognize this as a new issue and create a separate ticket rather than reopening the old one.
The team emphasized that despite assumptions that certain edge cases "surely don't happen," they invariably do—like colleagues starting entirely new conversations by replying to old email threads. Building robust handling for these scenarios is essential for production reliability.
## Evaluation and Monitoring
The team employs multiple layers of evaluation and monitoring to assess system performance and guide iterative improvements.
### Technical Metrics
On the technical side, they track:
- **Extraction accuracy**: How well does the system extract invoice details from emails?
- **Match quality**: What are the confidence distributions of ML matching models?
- **Coverage**: What percentage of tickets can the system handle versus those that require purely human intervention?
### Usage Metrics
For understanding user behavior and adoption, they monitor:
- **Generation usage rate**: How often are generated responses being used?
- **Edit distance**: When users send a generated response, how much did they modify it? If they send it verbatim, that indicates high trust and utility. If they heavily edit or discard it, that signals opportunities for improvement.
- **Feature adoption**: Which customers are using AI features and how extensively?
These metrics feed back into product development conversations. For instance, if certain customers consistently heavily edit responses, the team can reach out to understand what's not working and iterate on the phrasing or information presentation.
### Product Metrics
Beyond the AI-specific metrics, the team tracks broader product health indicators:
- **Platform stickiness**: What percentage of responses are sent through the help desk versus through Outlook or other channels? This indicates whether the platform is becoming the default workspace.
- **Ticket handling metrics**: Average handling time, resolution rates, and identification of stalled tickets. A key question is whether AI feature adoption correlates with reduced handling times.
- **Automatic closure rate**: What percentage of tickets never need to be opened because they're automatically identified as spam or low-priority and resolved via custom rules?
This multi-layered evaluation approach reflects mature thinking about LLMOps—recognizing that technical accuracy metrics alone don't capture the full picture of whether an AI system is delivering value in production.
## Workflow Automation Beyond Response Generation
While response generation gets the most attention, the platform also automates complete workflows for certain request types. For statement reconciliation requests, the system can automatically kick off the reconciliation process, perform the entire reconciliation by matching statement lines to invoices, and present completed results to the user who simply needs to review and send them to the vendor. This represents a more ambitious automation that goes beyond drafting text to actually executing business logic on behalf of the user.
## Scoping and Iteration Strategy
The team's approach to scoping reflects pragmatic wisdom about building AI products. Rather than attempting to handle every possible email type from day one—which would represent an overwhelming "blank page problem"—they focused narrowly on invoice and payment reminder inquiries. This category was selected because it:
- Represents significant volume (roughly 20% of incoming emails)
- Takes considerable time for humans to handle (requiring lookups across multiple systems)
- Aligns with Xelix's unique strengths (access to ERP data and AP domain expertise)
- Has relatively structured information requirements
The team explicitly chose not to build a "does it all AI" but rather a vertical, narrowly scoped system that could achieve high reliability within its domain. Claire emphasized they're building "multiple particular responders" rather than a single general-purpose agent, reflecting an architecture that trades breadth for depth and accuracy.
This scoping strategy enabled faster time to value and more focused iteration. The team can now expand by building additional vertical pipelines for other email categories, potentially using more agentic architectures to route emails to the appropriate specialized pipeline.
## Challenges and Tradeoffs
The team was remarkably candid about challenges and tradeoffs:
**Vendor identification remains hard**: Despite sophisticated multi-source matching, edge cases abound. The system sometimes can't determine which vendor from a set of candidates is correct, requiring fallback to invoice information or human verification.
**Threading complexity**: Email threading and multi-ticket scenarios introduce significant state management complexity that's easy to underestimate.
**Balancing familiarity and innovation**: The UX evolution from Outlook-like to ticket-centric views illustrates the tension between minimizing change management friction and surfacing new capabilities.
**Code quality versus velocity**: The Carcpatio methodology's emphasis on daily shipping creates pressure around maintaining code quality, proper feature flagging, and reliable releases as the team scales.
**Resource efficiency**: Currently the system generates responses for all invoice/reminder emails, but many of these are for tickets users never open (identified as spam or low-priority). The team recognizes the need to be more targeted about which emails warrant the computational expense of running the full pipeline.
## Future Directions
Looking ahead, the team identified two main evolution paths:
**More agentic flows**: Moving from the current highly structured vertical pipeline toward systems that can make more autonomous decisions about what information to retrieve and what actions to take, while still maintaining the narrow vertical approach by building multiple specialized responders rather than one general system.
**Expanded coverage with targeted generation**: Building out additional response pipelines for more email categories while simultaneously becoming smarter about predicting which emails users actually care about responding to, avoiding wasted computation on emails destined for spam folders.
**Custom categories and similarity matching**: Enabling customers to define their own email categories and using similarity matching to handle novel request types that don't fit predefined categories.
## LLMOps Maturity
This case study reflects relatively mature LLMOps practices for a production system:
- **Systematic scoping**: Using real data analysis to identify high-value use cases rather than chasing maximum breadth
- **Multi-stage pipelines**: Separating extraction, matching, and generation into discrete, testable components rather than relying on monolithic LLM calls
- **Confidence scoring**: Building transparency and human oversight into the system through multi-factor confidence metrics
- **Pre-generation**: Optimizing user experience by generating responses proactively rather than on-demand
- **Comprehensive monitoring**: Tracking technical, usage, and business metrics to understand system performance holistically
- **Iterative development**: Using rapid feedback loops (Carcpatio methodology) and beta releases to validate assumptions quickly
- **Safety mechanisms**: Implementing fallbacks like refusing to generate responses when vendor identification fails
- **Constrained LLM usage**: Using LLMs for language synthesis rather than decision-making, keeping control over business logic
The team's emphasis on accuracy over breadth, human-in-the-loop workflows, and transparent confidence scoring demonstrates an appropriate level of caution when deploying LLMs in financial workflows where errors have real business consequences. Their willingness to discuss challenges and tradeoffs provides valuable insights for others building similar systems.