## Overview
Swisscom, Switzerland's leading telecommunications and IT provider, presented a comprehensive case study on deploying production LLMs in their customer service contact centers in collaboration with AWS. The presentation was delivered by Oliver (Senior Data Scientist at AWS), Parvas (Senior Data Scientist at Swisscom), and Marcel (Senior Software Engineer at Swisscom), reflecting a close partnership over several months. This case study is particularly valuable as it demonstrates a complete LLMOps journey from experimentation through production deployment in a highly latency-sensitive, high-volume environment.
The business context centers around Swisscom's vision called "Stellar Bridge," which aims to provide proactive, intelligent customer service available anytime and anywhere. The company needed to transform their contact center operations to offer personalized assistance while maintaining the quality of human interactions. The challenge was significant: handling hundreds of requests per minute during peak times and outages, maintaining sub-second latency for voice interactions (where users are inherently impatient), and ensuring high accuracy across diverse customer queries. Unlike many GenAI deployments where latency can be more forgiving, voice-based customer service demands near-instantaneous responses, making this a particularly challenging production environment.
## Requirements and Model Selection Framework
Swisscom established several critical requirements that shaped their entire approach. Customizability was paramount—they needed models tailored to excel at specific dialogue tasks with high accuracy. Latency consistency with defined upper limits was non-negotiable for maintaining reliable performance in voice interactions. Scalability to handle hundreds of requests per minute at peak times was essential. Additionally, they wanted complete control over updates and ownership of the model lifecycle, allowing them to decide when to upgrade to newer model versions and avoid forced prompt adjustments due to provider-driven model deprecations. Access to open-source models provided flexibility to adapt to evolving needs and market changes.
The presenters outlined a thoughtful three-category framework for LLM selection that goes beyond simply choosing the latest frontier model. The first category focuses on model characteristics: parameter count (more parameters generally mean more capability but impact latency), fine-tunability, commercial licensing (proprietary vs. open-source), training data composition (particularly important for language-specific services), and expected input/output lengths. The second category examines the use case itself: task complexity (simple classification vs. multi-page document processing with complex outputs) and inference load patterns (occasional queries vs. continuous high-volume traffic). The third category considers team capabilities and skills: whether the team has expertise in fine-tuning, model maintenance, and production deployment, or whether they're better suited to API-based solutions. This holistic framework is a valuable contribution as it acknowledges that the "best" model isn't always the largest or most capable on public benchmarks, but rather the one that best fits the specific production requirements.
## Data Integration and Synthetic Data Generation
The presenters discussed various methods for injecting business data and logic into LLMs, emphasizing that multiple approaches can be combined. They covered continued pre-training for domain-specific vocabulary (relevant for telecommunications terminology), retrieval augmented generation (RAG) for document-based knowledge, agentic approaches for connecting data sources and APIs, supervised fine-tuning with prompt-completion pairs, and prompt engineering with few-shot learning. Importantly, they noted that these methods aren't mutually exclusive—Swisscom employed prompt engineering, fine-tuning, and was evaluating RAG approaches concurrently.
The architecture for their voice channel reveals a sophisticated orchestration approach. Customer voice input is routed through a cloud-based contact center for transcription. The transcribed text is sent to a dialogue engine that acts as the orchestrator, holding business logic and process descriptions. The LLM is delegated the specific task of predicting the next dialogue action, while the dialogue engine remains responsible for performing API calls, injecting RAG responses, and accessing other data sources. This separation of concerns is architecturally sound—the LLM focuses on dialogue prediction while the orchestration layer handles integration complexity. The response is converted back to speech using text-to-speech (TTS) services.
For training data generation, Swisscom employed a teacher-student approach using synthetic data. They started with production-like data that had been anonymized and stored securely in S3. A larger LLM acted as a "teacher" to generate synthetic prompt-completion pairs that effectively infused business logic into the smaller student model. This approach addresses the common challenge of obtaining sufficient high-quality training data while maintaining customer privacy. The use of synthetic data generation is a pragmatic solution, though the presentation doesn't detail how they validated that the synthetic data adequately represented real-world dialogue complexity or how they addressed potential distribution drift between synthetic training data and actual production queries.
## Model Experimentation and Fine-tuning Strategy
The experimentation phase leveraged Amazon SageMaker's capabilities extensively. Data was stored in S3, and importantly, AWS provided pre-built container images on ECR (Elastic Container Registry), eliminating the need to tediously build, test, and rebuild Docker images. This significantly accelerated prototyping. Using Jupyter notebooks, the team loaded data, evaluated open-source models for accuracy, and critically, also tested inference performance including latency and concurrent request handling capacity. This dual focus on both model accuracy and operational performance metrics early in the experimentation phase is a best practice often overlooked in ML projects.
The fine-tuning approach employed several efficiency optimizations. Rather than full fine-tuning, they implemented Low-Rank Adaptation (LoRA), which adds approximately 1% of additional weights to the original model while keeping base model weights frozen. Research has demonstrated that LoRA approaches the accuracy of full fine-tuning while offering substantial efficiency gains in training time and compute requirements. They used supervised fine-tuning with prompt-completion pairs, teaching the model in a manner analogous to traditional supervised machine learning. The flexibility of instance selection in SageMaker allowed them to experiment with different instance sizes, finding the optimal balance between training speed and cost. During experimentation, they achieved latencies around 400 milliseconds, providing confidence they would meet their stringent latency requirements in production.
## Production Pipeline and Model Registry
Recognizing that Jupyter notebooks aren't production-ready, the team built repeatable model building pipelines using SageMaker Pipelines. This orchestration framework coordinated training jobs, evaluation jobs, and model registration to the SageMaker Model Registry. The use of pre-built images from ECR continued into production pipelines, maintaining consistency with experimentation.
Key benefits of this pipeline approach included consistency and repeatability through configuration files. When Swisscom releases new products or wants to add new dialogue flows, they can simply point to new datasets via config files rather than manually reconfiguring training. The full lineage tracking provided by SageMaker Pipelines ensures they know exactly which data version produced which model version, enabling proper model governance and debugging capabilities. The Model Registry allows version comparison and performance metric tracking over time, which is essential for understanding model drift and regression. This infrastructure directly addresses their requirement for lifecycle management control—they can decide when to retrain, which model version to promote, and when to deploy updates without being subject to external provider decisions.
## Deployment Infrastructure and Tooling
Marcel's presentation on deployment revealed a sophisticated infrastructure-as-code approach. They had two primary deployment goals: validating performance against the real application and establishing an easy, reproducible deployment process. SageMaker deployment involves two main configurations: model config (specifying inference image location, model artifacts, and model parameters) and endpoint config (defining instance types, instance counts, networking/VPC settings, and scaling options). Rather than managing these configurations manually through "click ops," they adopted the AWS Cloud Development Kit (CDK).
AWS CDK allows infrastructure to be defined as code using familiar programming languages—they chose Python to leverage existing team skills, though CDK supports multiple languages. Infrastructure is defined as objects, can be unit tested like application code, and CDK synthesizes these definitions into CloudFormation templates (YAML files). CloudFormation then handles the actual provisioning of these "stacks." This approach brings software engineering best practices to infrastructure management, enabling version control, code review, automated testing, and reproducible deployments.
The production deployment currently runs on G6E family instance types with two instances continuously available. The fine-tuned Llama 3.1 8B model utilizes approximately 90% of available GPU memory on these instances. The application itself runs on Amazon EKS (Elastic Kubernetes Service), and they use IAM (Identity and Access Management) with pod identity for security. They implement security best practices, have monitoring through CloudWatch (with the ability to export to internal monitoring systems), and can scale to handle traffic variability. Marcel acknowledged areas for future improvement, particularly making the currently fixed two-instance setup more dynamic with autoscaling, and potentially exploring LoRA adapters for even greater efficiency. His philosophy of "starting simple and adding complexity" is pragmatic for production systems.
## Production Performance and Validation
The production results are impressive and validate their approach. Initial results during gradual traffic ramping were promising, and they now serve 50% of all contact center voice channel traffic through the fine-tuned model. Most significantly, they achieved a median latency below 250 milliseconds in production—substantially better than the 400ms achieved during experimentation and well within the requirements for voice interactions. As Marcel noted, users calling a hotline won't tolerate 10-second waits, making this latency achievement critical for user experience.
The model achieves accuracy comparable to the larger teacher model despite being much smaller (8B parameters vs. presumably a much larger frontier model). The cost model is infrastructure-based with hourly charging rather than per-token or per-request pricing, which provides predictable costs and can be more economical for high-volume use cases. They've mitigated the risk of forced prompt adjustments due to provider-driven model deprecations, maintaining stability and control. The deployment offers flexibility to switch instance types, change model architectures, and scale for unexpected traffic peaks such as service outages when contact center volume spikes dramatically.
## Model Lifecycle and Continuous Improvement
Swisscom views this production deployment as the beginning rather than the end of their journey. Their approach to continuous improvement involves monitoring model behavior in production, identifying unforeseen behaviors or edge cases, and feeding these observations back into the training data generation loop. This creates a continuous learning system where production experience directly improves model training, helping maintain high accuracy across diverse and evolving use cases. This feedback loop is essential for production LLM systems but is often overlooked in favor of one-time training approaches.
The combination of pipeline automation, model registry, and infrastructure-as-code means they can efficiently iterate on models. When new training data is generated from production learnings, they can retrain models through automated pipelines, compare new versions against existing baselines in the model registry, and deploy updates through version-controlled CDK configurations. This represents a mature MLOps approach adapted for LLMs.
## Monitoring and Operations
Both model building and deployment pipelines are version-controlled in Git, ensuring complete traceability of changes. They apply functional testing, load testing, and selective manual testing between development and production stages. CloudWatch provides out-of-the-box monitoring with metrics visualization and alerting, with the flexibility to export to internal monitoring systems for centralized observability. Security best practices are implemented throughout, including IAM with pod identity for service-to-service authentication from EKS to SageMaker endpoints.
The operational model reflects a balance between automation and control. While they have automated pipelines for training and deployment, they maintain human oversight at critical decision points such as model version promotion and production deployment. This is appropriate for a customer-facing system where errors could directly impact customer experience and brand reputation.
## Key Insights and Critical Assessment
The presenters offered valuable takeaways from their experience. First, they emphasized defining requirements and scope before selecting an LLM. The tendency to default to frontier models based on public benchmark performance doesn't necessarily lead to the best solution for specific use cases—their success with an 8B parameter model in a context where many might assume a much larger model is necessary proves this point. Second, integrating business data is essential for moving beyond individual productivity gains to actual business workflow automation. Third, at some point teams must commit to a technology stack and LLM choice rather than constantly chasing the newest models, but should remain open to re-evaluating once stable in production.
From a critical perspective, this case study presents a compelling success story but leaves some questions unanswered. The reliance on synthetic data generated by a teacher model raises questions about data quality and distribution shift—how do they validate that synthetic training data adequately captures real-world complexity? How do they detect and address cases where synthetic data diverges from actual customer interactions? The presentation doesn't detail their evaluation methodology beyond accuracy metrics; understanding how they assess dialogue quality, appropriateness, and edge case handling would be valuable.
The cost comparison claims should be viewed with some nuance. While they assert that infrastructure-based hourly pricing is more cost-efficient than per-token pricing, this depends heavily on utilization patterns. For truly high-volume, sustained traffic, this is likely true, but for variable or lower-volume workloads, the economics might differ. The fixed two-instance deployment means they're paying for capacity even during low-traffic periods, though Marcel acknowledged this as an area for improvement with autoscaling.
The accuracy claim of matching the teacher model is presented without detailed metrics or evaluation methodology. What specific metrics were used? How was the evaluation set constructed? Were there certain dialogue types or customer intents where the smaller model underperformed? More transparency here would strengthen the case study's credibility.
Nevertheless, this case study demonstrates sophisticated LLMOps practices: synthetic data generation with teacher-student approaches, efficient fine-tuning with LoRA, comprehensive experimentation including operational metrics, production-grade pipelines with lineage tracking, infrastructure-as-code for reproducible deployments, and continuous improvement through production feedback loops. The sub-250ms latency achievement in a high-volume production environment is genuinely impressive and demonstrates that smaller, fine-tuned models can meet stringent operational requirements when properly optimized. The holistic approach covering model selection, training, deployment, monitoring, and lifecycle management provides a valuable reference architecture for organizations deploying LLMs in latency-sensitive, high-volume production environments.