## Overall Summary
This case study presents two complementary national-scale AI deployments in the UK public sector, both operating at massive scale to serve 67 million citizens. Capita, a major government services provider, transformed contact center operations using AWS Connect and Amazon Bedrock to automate customer service interactions, while the Government Digital Service (GDS) built GOV.UK Chat, a RAG-based information retrieval system that represents the UK's first national-scale knowledge base implementation. Both organizations faced the unique challenge of deploying AI in high-stakes environments where accuracy, safety, and trust are non-negotiable, and where mistakes could have life-changing consequences for citizens.
The presentations were delivered at AWS re:Invent 2025 by Daniel Temple (Head of Architecture for UK Public Sector at AWS), Nikki Powell from Capita, and Gemma Hyde from GDS. Their combined experiences offer valuable insights into the operational realities of deploying LLMs in production at true national scale, with particular emphasis on the tradeoffs between speed, safety, and citizen trust.
## Capita's Contact Center Transformation
### Business Context and Problem Statement
Capita operates contact centers serving UK government services and was facing significant operational challenges. Prior to their AI transformation, 75% of customers found their IVR (Interactive Voice Response) systems frustrating, 67% of customers abandoned calls before reaching a human agent, and costs ranged between £5-9 per contact—unsustainable figures for public sector budgets. The contact centers were handling 100,000+ daily interactions with traditional human-only approaches that were both costly and inconsistent in quality.
The organization needed to dramatically reduce costs while improving service quality, but they recognized that technology alone wouldn't solve the problem. They adopted a "people-empowered AI philosophy" that emphasizes augmenting human teams rather than replacing them entirely. This is particularly important in public sector work where vulnerable users and complex cases require human judgment and empathy.
### Technical Implementation
Capita's AI stack is built entirely on AWS services with an "AWS unless" philosophy—only looking at alternatives if AWS cannot meet a specific client requirement. Their architecture includes several key layers:
**Core Infrastructure**: Amazon Bedrock serves as the foundation, integrated with Claude models for conversational AI capabilities. AWS Connect provides the contact center orchestration layer, handling call routing, virtual agents, and agent assistance features.
**Conversational AI Pipeline**: They replaced static, menu-based IVR systems with conversational AI that can understand natural language across multiple languages to serve diverse UK populations. The system handles speech recognition and intent understanding, routing calls appropriately between virtual agents and human agents based on complexity and user needs.
**Agent Assistance Tools**: For interactions that do require human agents, Capita deployed AWS Contact Lens for real-time call analytics and Amazon Q for agent assistance. These tools provide human agents with policy and process information in real-time, allowing agents to focus on communication skills rather than memorizing procedures.
**Quality Assurance**: They implemented automated quality evaluations that provide real-time feedback to agents during and after calls, dramatically improving the speed and consistency of quality management.
**Data Integration and Analytics**: Capita uses Amazon MQ for message queuing and integrates data from multiple sources into Snowflake for cross-channel reporting and continuous improvement insights. This allows them to pull together data from client systems, their own operations, and AWS services to identify patterns and opportunities for optimization.
### Deployment Approach and Phased Rollout
Capita followed a careful phased approach to deployment. They started with limited scope—initially restricting the service to 1,000 calls per day to validate the technology and processes. This foundation phase focused on proving the concept and building confidence. They then moved to a limited release phase, opening up a single line of service in the contact center to real customer interactions while monitoring closely. The scaling phase involved expanding to multi-service capabilities across different contact center lines, and finally they reached an optimization phase where they're using generative AI to generate insights across business lines and create new service offerings.
### Results and Impact
The results have been substantial, though the presentation acknowledged that these are claims from the service provider that should be evaluated in context. Capita reports:
- 35% productivity improvements already achieved, with targets to reach 50% by 2027
- 40% reduction in case management handling time
- 20% improvement in average handle time
- 20% increase in retention and upsell conversion rates
- 20% reduction in first contact resolution (meaning more issues resolved on first contact)
- 15% improvement in customer satisfaction scores
- Customer satisfaction (CSAT) scores are at their highest levels in nine years
They're targeting 95% automation rate and 94% of customers directed to self-service channels by 2027 (with Nikki Powell joking that her boss wants 2026, highlighting the pressure to deliver results quickly).
### Cultural and Operational Challenges
Perhaps most insightful were the discussions around organizational culture and change management. Capita emphasized that their delivery team includes not just technical experts but also people who worked in contact centers themselves. This operational perspective proved crucial for understanding how changes would impact frontline workers and citizens. The cultural shift from "technical teams deliver, operations teams deal with it" to "operational teams lead from the front with technical alongside" was described as massive but essential.
They also embraced a "fail fast, fix faster" mentality, acknowledging that you cannot plan for absolutely everything when deploying AI at scale. The key is recognizing issues quickly and addressing them, rather than treating any failure as catastrophic.
## GDS GOV.UK Chat Implementation
### Business Context and Vision
The Government Digital Service maintains GOV.UK, the official UK government website that serves as the single source of truth for all government information. The site contains 850,000+ pages of content covering everything from driving licenses to tax guidance to business support, and it receives millions of visits weekly. While having consolidated government information in one place has been valuable (GOV.UK is 13 years old and considered a global benchmark for government digital services), the sheer volume presents its own challenges.
GDS's vision for GOV.UK Chat is rooted in their "Blueprint for Modern Digital Government" launched in January 2025. Their goals are straightforward: make lives easier for citizens by saving them time and reducing effort to interact with government, and harness AI for public good. They emphasize that they are not chasing trends but solving real problems with real government content using people's everyday language. As Gemma Hyde stated, they talk about reducing the "time tax"—the approximately week and a half that citizens spend per year interacting with government on average.
### Technical Architecture
GOV.UK Chat represents the UK's first national-scale RAG (Retrieval Augmented Generation) implementation using Amazon Bedrock knowledge bases. The architecture involves several sophisticated components:
**Query Processing and Intent Classification**: When a user asks a question, the first step involves an LLM classifying the incoming query into predefined categories or intents. This classification determines the appropriate response strategy. They implement intent-aware routing with hard-coded responses for simple greetings, hard blocks for controversial or inappropriate attempts, redirects for requests that need to go to different channels (like Freedom of Information requests), and multi-turn flows for clarification and guidance. At this stage, any personal information is also removed.
**Content Retrieval**: The vector store contains hundreds of thousands of GOV.UK pages that can change daily. Content is split into chunks according to semantic hierarchy to improve relevance and granularity. They use Amazon OpenSearch as the search index for storing and querying semantically similar content to user questions. The search index is populated by GOV.UK content via Amazon MQ (message queue) and provided by the GOV.UK publishing API, ensuring that the knowledge base stays current with content changes.
**Answer Generation**: They currently use two distinct models on AWS Bedrock: Claude Sonnet 4 for answer generation and AWS Titan Embedding v2 for creating embeddings. The system generates answers based only on the retrieved authoritative GOV.UK content.
**Quality and Safety Guardrails**: Before any answer reaches a user, it passes through multiple safety checks. The LLM evaluates responses against predefined quality and safety standards using Amazon Bedrock Guardrails. They perform detailed analysis of questions and answer data to ensure high-quality responses. Critically, they have a philosophy that "the best answer or no answer"—if they cannot provide an accurate answer based on authoritative content, they don't provide one at all, which is markedly different from consumer LLM applications that try to always provide some response.
**User Interface**: The final answer is presented to users with careful design considerations around trust, including visual cues that make clear this is an AI-generated response, clear signposting of source content so users can verify information, and deliberate friction to ensure appropriate levels of trust rather than blind acceptance.
### Safety, Trust, and Guardrails
GDS spent enormous effort on safety and trust considerations, which they view as fundamental to their mission. They identified several key concerns that kept them up at night:
**Zero Tolerance for Harmful Content**: They have zero tolerance for offensive language, bias, hate speech, or any harmful content. This is non-negotiable in a government context.
**Adversarial and Off-Topic Queries**: They know people will use the system both intentionally and unintentionally in ways not intended, and they saw "quite a few" controversial attempts during their pilots. Their guardrails and intent classification system protect against this.
**Appropriate Trust Balance**: Interestingly, they faced a challenge where initial testing showed users had very high trust in results simply because they came from GOV.UK. While trust is essential, they needed to balance this with clarity about what the technology can and cannot do. They don't want blind trust but rather informed, appropriate trust. They've carefully designed the user experience to achieve this balance through visual cues, clear source attribution, and messaging about limitations.
**Life-Changing Stakes**: The information on GOV.UK can literally be life-changing for citizens, making accuracy paramount. Sometimes the right answer genuinely is no answer if they cannot confidently provide accurate information from authoritative sources.
**Transparency Requirements**: The UK has the Algorithmic Transparency Recording Service (ATRS) which requires government organizations to publish information about how they use algorithmic tools. This adds a layer of public accountability to their deployment.
They worked extensively with red teaming partners, particularly the UK's AI Security Institute (described as world-renowned), to uncover safety, usability, and performance issues throughout development.
### Evaluation and Testing Approach
GDS implemented a rigorous, evidence-backed approach to evaluation across three pillars:
**Automated Evaluation**: This serves as the backbone of iterative development, testing changes against metrics and sometimes using LLMs as judges to identify the best system configuration and impact on KPIs. This allows them to rapidly iterate on technical improvements.
**Manual Evaluation**: This provides deeper insights through diverse expert review. They conduct red teaming with security experts, work with subject matter experts from various government departments to validate content accuracy, and perform detailed error analysis to understand root causes—whether issues stem from the question asked, intent recognition, or content accuracy.
**Continuous Monitoring**: In live use, they monitor diagnostics and insights into performance and user behavior in real-time. They're working on refining monitoring systems to automatically flag answers that might need deeper human review and implementing systematic categorization of errors to identify patterns and track recurring issues.
Throughout their experiments, they've demonstrated progress and "hardening" on accuracy and hallucination metrics, which was crucial for building confidence to scale.
### Deployment Phases and User Testing
GDS followed a careful progression through discovery, public beta, and pilot phases, with each step guided by data. They developed a custom interface and invited over 10,000 internal users to test GOV.UK Chat. They conducted many rounds of iterative research including usability testing, internal accuracy validation with subject matter experts from various departments, diary studies, benchmarking, and analytics incorporating thousands of data points. They replatformed to AWS Bedrock for robust model hosting and orchestration and to allow nimble switching between models. Most recently, they piloted within the GOV.UK mobile app (which launched in 2025 and which Gemma Hyde is responsible for, describing it as "AI-enabled" as a serious commitment rather than a tagline).
### User Feedback and Impact
User testing revealed positive results. Users reported that GOV.UK Chat provided a quick and easy way to find information, made understanding requirements simpler, and reduced feelings of overwhelm. Specific user quotes included: "It's been a lot more learning than looking. I've found what would have taken me maybe up to an hour before in 15 minutes" and "Avoid the phone queues. Saves time searching the website."
These results align with their goal of reducing the time tax—the excessive time citizens spend interacting with government.
### Future Roadmap
GDS is excited about several areas for future development:
- Rolling out GOV.UK Chat to more citizens through the GOV.UK mobile app
- Strengthening multi-turn conversation capabilities, as real-world interactions involve back-and-forth rather than single question-answer pairs
- Exploring agentic AI with different agents supporting citizens in various ways, which they describe as "very, very cool" in early concept stages
- Systematically categorizing errors to operate effectively at scale
- Refining monitoring to automatically flag answers needing human review
## Common Patterns and Lessons Learned
### Shared Architectural Layers
Both implementations share critical architectural patterns:
**Core Foundation**: Both use Amazon Bedrock for foundational models and RAG capabilities, with Amazon Bedrock Guardrails for security and policy-compliant responses.
**Integration Layer**: Both implement enterprise-driven architectures with real-time protection and use services like API Gateway to seamlessly stitch services together.
**Security and Monitoring**: Both have comprehensive monitoring providing full visibility into services, ensuring every interaction is verified, monitored, and logged.
### Four-Phase Deployment Pattern
Both organizations followed similar deployment progressions:
**Foundation Phase**: Start small with limited scope—Capita with 1,000 calls per day, GDS with internal testing.
**Limited Release Phase**: Expand to controlled production use—Capita with a single contact center service line, GDS with 10,000 internal users.
**Scaling Phase**: Broaden significantly—Capita with multi-service capabilities, GDS with public pilot in the mobile app.
**Optimization Phase**: Use AI to generate new insights and services across the organization.
### Key Lessons Across Dimensions
**Technology Lessons**:
- Start early with guardrails rather than trying to retrofit them later
- Build for 100x scale from the start even if you're not there yet, as it speeds deployment and prevents painful rebuilds
- Monitor everything and automate all responses from day one
**Process Lessons**:
- Phase rollouts by complexity rather than volume—supporting more users with simpler use cases can be better than fewer users with complex cases
- Always test with real users to discover outliers and unexpected behaviors
- Bias for action and momentum over perfection—as one international government colleague told Gemma, "That's not a reason not to progress"
**People and Culture Lessons**:
- Maintain human in the loop for oversight, especially for vulnerable users and complex decisions
- Build transparency about how technology is used to gain trust and drive adoption
- Include operational teams from the start, not just technical experts—having contact center agents on delivery teams proved crucial for Capita
- Embrace "fail fast, fix faster" mentality—failure to fix is the real failure, not the initial problem
- Keep focus on the value to citizens and the country; don't lose sight of why you're doing this
### Balancing Competing Concerns
Both organizations highlighted important tradeoffs:
**Speed vs. Safety**: GDS specifically discussed how streaming answers would improve perceived latency, but their guardrail requirements make streaming challenging. They chose safety over speed.
**Trust vs. Appropriate Skepticism**: GDS wrestled with the challenge that users highly trusted GOV.UK-branded content, but they needed users to understand technology limitations. They designed deliberate friction into the experience.
**Automation vs. Human Touch**: Capita emphasized that while targeting 95% automation, humans remain essential for vulnerable users and complex cases. This isn't about replacing people but empowering them to focus where they add most value.
**Comprehensive Coverage vs. Quality**: GDS's "best answer or no answer" principle means they sometimes don't answer questions users expect answers to, prioritizing accuracy over coverage.
## Critical Assessment and Balanced Perspective
While both presentations demonstrated impressive technical implementations and reported strong results, it's important to note some caveats:
**Vendor Context**: This presentation occurred at an AWS event with AWS as a partner, so there's natural incentive to present positive results and emphasize AWS services. The reported metrics should be viewed as claims that would benefit from independent validation.
**Early Stage Results**: Both implementations are relatively recent. Capita's results are "already achieved" but they're still scaling, and GDS is in pilot phase. Long-term sustainability and performance remain to be proven.
**Selection Bias**: User feedback from pilots may not represent the full population's experience once these systems reach true national scale with more diverse users and use cases.
**Complexity of Measurement**: Metrics like "customer satisfaction" and "productivity improvements" can be measured in various ways, and the presentations didn't detail methodologies. A 35% productivity improvement sounds dramatic but depends heavily on how it's calculated.
**Challenge of No Answer**: GDS's approach of providing no answer when they lack confidence is admirable from a safety perspective, but the user experience impact at scale remains to be seen. Users may become frustrated if they frequently receive no response.
That said, the presentations demonstrated genuine thoughtfulness about the challenges of production LLM deployment, particularly around safety, trust, and human oversight. The emphasis on phased rollouts, extensive testing, red teaming, and continuous monitoring reflects mature LLMOps practices. The cultural and organizational insights about including operational teams and embracing appropriate failure were particularly valuable and often overlooked in technical presentations.
The UK government's willingness to share detailed technical approaches, including challenges and considerations, provides valuable learning for others deploying AI at scale in high-stakes environments. Their emphasis on transparency through mechanisms like ATRS and their openness about tradeoffs demonstrates a responsible approach to government AI deployment.