Fujitsu: Low-Latency Intelligence Extraction from Audio Streams in Contact Centers

Overview

Fujitsu North America developed a comprehensive production LLM system to address critical operational inefficiencies in contact center operations. The presentation by Deepu Singh at the AI Engineer 2026 online track details how the company engineered a solution to transform messy, real-world audio streams into structured business intelligence with minimal human intervention. The case study is particularly valuable as it addresses the often-overlooked reality that production AI systems must handle imperfect inputs, including overlapping speech, emotional content, and multi-channel audio, rather than the clean text prompts commonly featured in demonstrations.

The business problem was stark and well-quantified. Contact centers face severe operational challenges with over 50% identifying hiring, training, and productivity as critical barriers. High stress is the primary reason operators leave the profession, creating a negative spiral where understaffing leads to higher stress for remaining operators, which drives more turnover and further understaffing. The specific inefficiency targeted was after-call work, where operators spent an average of 6.3 minutes on post-processing tasks like typing notes, summarizing calls, and selecting disposition codes for every 6.5-minute call. This nearly 1:1 ratio meant operators spent as much time on administrative data entry as actually serving customers. Additionally, the quality of this documentation was highly inconsistent since it relied on individual operator memory and writing skills.

Technical Architecture

The solution architecture consists of four major stages that form a low-latency pipeline transforming conversational audio into structured business intelligence. Each stage addresses specific engineering challenges encountered when deploying LLMs in production environments dealing with audio data.

Voice Capture Layer

The first stage handles audio intake with a focus on quality assurance, as poor audio quality directly impacts downstream LLM performance. The system performs real-time audio intake from telephony systems while applying noise filters to remove background office chatter and other attenuation sources. Audio level normalization ensures consistent input quality across different calls and recording conditions.

The most critical aspect of this stage is channel mapping, which splits stereo audio to isolate the agent on one channel and the customer on another. This separation is essential for downstream processing because if audio is mixed into a single mono track with overlapping speakers, the AI struggles to attribute statements correctly, which ruins the quality of the summary. This represents an important production consideration often overlooked in prototype systems.

Security measures are implemented at this early stage through buffer management and early-stage PII masking techniques. Since audio streams can contain credit card numbers, passwords, and other personally identifiable information, the system ensures sensitive data never reaches LLM memory banks. While this adds some architectural complexity and latency overhead, it represents a non-negotiable requirement for production deployment in regulated industries.

Speech-to-Text Engine

The second stage converts audio to text with a specific accuracy target. Through empirical testing, Fujitsu determined that speech-to-text accuracy must exceed 90% for the downstream LLM summarization to be effective. The system employs advanced acoustic modeling to map phonemes and filter out regional dialects, addressing the reality that contact centers handle diverse caller populations.

Domain-specific language modeling is applied using specialized dictionaries. For example, in insurance contexts, the engine must distinguish between homophones like “term life” and “term” which sound nearly identical but have completely different meanings. This domain adaptation is crucial for maintaining accuracy in vertical-specific deployments.

Post-processing includes inverse text normalization and automatic punctuation. When customers say “$5,000,” the engine outputs numerical formatting rather than spelling out the words. This numerical formatting dramatically improves the LLM’s ability to extract entities in subsequent stages, demonstrating how seemingly minor formatting decisions in early pipeline stages can significantly impact overall system performance.

Generative AI Core

The third stage represents the brain of the system where LLM-based intelligence extraction occurs. Rather than simply throwing raw transcripts at an LLM with basic summarization requests, Fujitsu engineered a highly orchestrated approach with multiple specialized layers.

The orchestration layer uses specific prompt templates based on empirical findings that basic summarization requests produce messy narrative paragraphs. Instead, the system employs few-shot learning libraries to instruct the LLM to output separate bullet points, with one list for customer inquiries and a separate list for operator actions. This structured prompting ensures outputs match expected formats and can be reliably parsed downstream.

The reasoning layer handles intent extraction by providing the LLM with predefined taxonomies of customer call reasons such as cancellation, new application, or claim status inquiry. The system instructs the LLM to not only classify the transcript but also output the reasoning behind its classification choice. This explainability component is critical for production systems where humans need to validate and trust AI decisions.

The trust layer implements two key production safeguards. Token optimization keeps latency low, which is essential for a system that needs to process calls in near-real-time to maintain workflow efficiency. Automated hallucination checks ensure generated summaries remain strictly grounded in the actual transcript content rather than introducing fabricated information. This represents a critical LLMOps consideration since hallucinations could introduce false information into business records and CRM systems.

The workflow processes transcripts with time indexing and confidence scoring from earlier stages, performs speaker separation leveraging the channel mapping from stage one to stitch dialogue together logically as a conversation flow, conducts context deduction where the LLM identifies entities like account numbers and product names while running sentiment analysis and intent recognition, and produces structured output as clean JSON schemas matching predefined CRM templates with information categorized into bullet points rather than unstructured text walls.

Customer Data Sync Layer

The final stage translates AI insights into actionable business system updates. An API gateway acts as a schema mapper, taking JSON output from the LLM and mapping fields like customer intent and resolution status directly to corresponding fields in the company’s CRM system via REST APIs.

Importantly, the system retains human oversight rather than full automation. Operators see AI-generated summaries auto-populated on their screens, perform quick visual validation, make minor edits if necessary, and confirm. This human-in-the-loop design balances automation efficiency with the need for accountability and accuracy in customer records.

Simultaneously, structured data flows into business intelligence models that aggregate voice-of-customer data for management dashboards and automatically flag candidates for new FAQ entries. This demonstrates how LLM-based systems can serve multiple downstream purposes beyond the immediate use case, creating compound value.

Production Outcomes and Metrics

The operational impact was immediate and measurable. After-call work time dropped from 6.3 minutes to 3.1 minutes, representing a 50% reduction in processing time. Across 500 seats handling thousands of daily calls, this translates to massive operational savings equivalent to reclaiming dozens of full-time equivalent headcounts purely from efficiency gains.

Data entry quality improved dramatically, moving from highly subjective and variable outputs dependent on individual operator mood and memory to standardized, uniform outputs generated through logic-based processing. Inquiry categorization and call reason tagging became consistent, creating a reliable voice-of-customer dataset for management analysis.

The cognitive load reduction from eliminating repetitive administrative typing tasks directly addressed the stress factors identified as primary drivers of operator turnover. While specific retention metrics were not provided, the connection between reduced stress and improved staff stability was explicitly noted as an outcome.

Engineering Constraints and Challenges

The presentation candidly acknowledged ongoing constraints and limitations, which provides valuable insight into production LLMOps realities. Speech-to-text accuracy remains a continuous challenge because the entire generative AI summary depends on transcript quality. When the STT engine fails with heavy accents or poor audio quality, the LLM has nothing reliable to work with. This represents an ongoing optimization battle where improvements in one component can yield multiplicative benefits downstream.

Initial setup costs present economic challenges despite strong long-term ROI. Running complex LLM reasoning operations on long transcripts, potentially 20 minutes or more, consumes significant API tokens. During initial scaling phases, these costs can be substantial. The team continuously works on token optimization techniques to reduce per-call processing costs without sacrificing output quality.

Security and compliance requirements add architectural complexity. Robust PII masking before data hits cloud endpoints is non-negotiable, but the additional security layers increase latency and add overhead. The team continues working to streamline these protections without compromising security posture, illustrating the inherent tension between security requirements and performance optimization in production LLM systems.

Future Roadmap

The roadmap demonstrates how successful initial deployments can expand into adjacent use cases. Phase one focuses on explainable AI for operator coaching, analyzing audio post-call to provide instant private feedback on soft skills, empathy levels, and information accuracy. This moves beyond summarization into continuous improvement and training.

Phase two targets predictive staffing by feeding categorized intent data into time series analytics. This enables workforce management to accurately forecast call volume spikes based on specific topics and optimize shift scheduling, demonstrating how LLM-generated structured data can feed traditional analytical systems.

Phase three addresses operator well-being through real-time sentiment and acoustic analysis to detect when customers become abusive. The system can trigger alerts to supervisors or seamlessly transfer calls to AI voice agents to protect human operator mental health during difficult conversations. This represents an important human-centric application of AI that goes beyond pure efficiency metrics.

LLMOps Observations

This case study exemplifies several important LLMOps principles for production deployments. The system demonstrates end-to-end pipeline thinking where each stage is engineered to support downstream components, with decisions about audio processing and formatting directly impacting LLM performance. The emphasis on structured outputs through careful prompt engineering ensures outputs are database-ready rather than requiring additional parsing and transformation.

The human-in-the-loop design acknowledges that full automation may not be appropriate for business-critical record keeping, while still capturing most efficiency gains. The attention to hallucination detection and output grounding reflects mature thinking about LLM limitations in production contexts. The candid discussion of ongoing constraints around accuracy, cost, and security illustrates real-world tradeoffs rather than presenting an idealized view.

The multi-purpose use of generated data for both immediate operator efficiency and broader business intelligence demonstrates how thoughtful schema design can create compound value. The roadmap showing evolution from summarization to coaching to predictive analytics illustrates how initial LLM deployments can expand organically into adjacent use cases once foundational infrastructure is established.

While the presentation naturally emphasizes successes and may overstate some benefits, the technical architecture appears sound and the metrics provided are specific and verifiable. The acknowledgment of constraints and ongoing optimization work lends credibility to the overall case study.

Low-Latency Intelligence Extraction from Audio Streams in Contact Centers

Industry

Technologies