Two UK government organizations, Capita and the Government Digital Service (GDS), deployed large-scale AI solutions to serve millions of citizens. Capita implemented AWS Connect and Amazon Bedrock with Claude to automate contact center operations handling 100,000+ daily interactions, achieving 35% productivity improvements and targeting 95% automation by 2027. GDS launched GOV.UK Chat, the UK's first national-scale RAG implementation using Amazon Bedrock, providing instant access to 850,000+ pages of government content for 67 million citizens. Both organizations prioritized safety, trust, and human oversight while scaling AI solutions to handle millions of interactions with zero tolerance for errors in this high-stakes public sector environment.
This case study presents two complementary national-scale AI deployments in the UK public sector, both operating at massive scale to serve 67 million citizens. Capita, a major government services provider, transformed contact center operations using AWS Connect and Amazon Bedrock to automate customer service interactions, while the Government Digital Service (GDS) built GOV.UK Chat, a RAG-based information retrieval system that represents the UK’s first national-scale knowledge base implementation. Both organizations faced the unique challenge of deploying AI in high-stakes environments where accuracy, safety, and trust are non-negotiable, and where mistakes could have life-changing consequences for citizens.
The presentations were delivered at AWS re:Invent 2025 by Daniel Temple (Head of Architecture for UK Public Sector at AWS), Nikki Powell from Capita, and Gemma Hyde from GDS. Their combined experiences offer valuable insights into the operational realities of deploying LLMs in production at true national scale, with particular emphasis on the tradeoffs between speed, safety, and citizen trust.
Capita operates contact centers serving UK government services and was facing significant operational challenges. Prior to their AI transformation, 75% of customers found their IVR (Interactive Voice Response) systems frustrating, 67% of customers abandoned calls before reaching a human agent, and costs ranged between £5-9 per contact—unsustainable figures for public sector budgets. The contact centers were handling 100,000+ daily interactions with traditional human-only approaches that were both costly and inconsistent in quality.
The organization needed to dramatically reduce costs while improving service quality, but they recognized that technology alone wouldn’t solve the problem. They adopted a “people-empowered AI philosophy” that emphasizes augmenting human teams rather than replacing them entirely. This is particularly important in public sector work where vulnerable users and complex cases require human judgment and empathy.
Capita’s AI stack is built entirely on AWS services with an “AWS unless” philosophy—only looking at alternatives if AWS cannot meet a specific client requirement. Their architecture includes several key layers:
Core Infrastructure: Amazon Bedrock serves as the foundation, integrated with Claude models for conversational AI capabilities. AWS Connect provides the contact center orchestration layer, handling call routing, virtual agents, and agent assistance features.
Conversational AI Pipeline: They replaced static, menu-based IVR systems with conversational AI that can understand natural language across multiple languages to serve diverse UK populations. The system handles speech recognition and intent understanding, routing calls appropriately between virtual agents and human agents based on complexity and user needs.
Agent Assistance Tools: For interactions that do require human agents, Capita deployed AWS Contact Lens for real-time call analytics and Amazon Q for agent assistance. These tools provide human agents with policy and process information in real-time, allowing agents to focus on communication skills rather than memorizing procedures.
Quality Assurance: They implemented automated quality evaluations that provide real-time feedback to agents during and after calls, dramatically improving the speed and consistency of quality management.
Data Integration and Analytics: Capita uses Amazon MQ for message queuing and integrates data from multiple sources into Snowflake for cross-channel reporting and continuous improvement insights. This allows them to pull together data from client systems, their own operations, and AWS services to identify patterns and opportunities for optimization.
Capita followed a careful phased approach to deployment. They started with limited scope—initially restricting the service to 1,000 calls per day to validate the technology and processes. This foundation phase focused on proving the concept and building confidence. They then moved to a limited release phase, opening up a single line of service in the contact center to real customer interactions while monitoring closely. The scaling phase involved expanding to multi-service capabilities across different contact center lines, and finally they reached an optimization phase where they’re using generative AI to generate insights across business lines and create new service offerings.
The results have been substantial, though the presentation acknowledged that these are claims from the service provider that should be evaluated in context. Capita reports:
They’re targeting 95% automation rate and 94% of customers directed to self-service channels by 2027 (with Nikki Powell joking that her boss wants 2026, highlighting the pressure to deliver results quickly).
Perhaps most insightful were the discussions around organizational culture and change management. Capita emphasized that their delivery team includes not just technical experts but also people who worked in contact centers themselves. This operational perspective proved crucial for understanding how changes would impact frontline workers and citizens. The cultural shift from “technical teams deliver, operations teams deal with it” to “operational teams lead from the front with technical alongside” was described as massive but essential.
They also embraced a “fail fast, fix faster” mentality, acknowledging that you cannot plan for absolutely everything when deploying AI at scale. The key is recognizing issues quickly and addressing them, rather than treating any failure as catastrophic.
The Government Digital Service maintains GOV.UK, the official UK government website that serves as the single source of truth for all government information. The site contains 850,000+ pages of content covering everything from driving licenses to tax guidance to business support, and it receives millions of visits weekly. While having consolidated government information in one place has been valuable (GOV.UK is 13 years old and considered a global benchmark for government digital services), the sheer volume presents its own challenges.
GDS’s vision for GOV.UK Chat is rooted in their “Blueprint for Modern Digital Government” launched in January 2025. Their goals are straightforward: make lives easier for citizens by saving them time and reducing effort to interact with government, and harness AI for public good. They emphasize that they are not chasing trends but solving real problems with real government content using people’s everyday language. As Gemma Hyde stated, they talk about reducing the “time tax”—the approximately week and a half that citizens spend per year interacting with government on average.
GOV.UK Chat represents the UK’s first national-scale RAG (Retrieval Augmented Generation) implementation using Amazon Bedrock knowledge bases. The architecture involves several sophisticated components:
Query Processing and Intent Classification: When a user asks a question, the first step involves an LLM classifying the incoming query into predefined categories or intents. This classification determines the appropriate response strategy. They implement intent-aware routing with hard-coded responses for simple greetings, hard blocks for controversial or inappropriate attempts, redirects for requests that need to go to different channels (like Freedom of Information requests), and multi-turn flows for clarification and guidance. At this stage, any personal information is also removed.
Content Retrieval: The vector store contains hundreds of thousands of GOV.UK pages that can change daily. Content is split into chunks according to semantic hierarchy to improve relevance and granularity. They use Amazon OpenSearch as the search index for storing and querying semantically similar content to user questions. The search index is populated by GOV.UK content via Amazon MQ (message queue) and provided by the GOV.UK publishing API, ensuring that the knowledge base stays current with content changes.
Answer Generation: They currently use two distinct models on AWS Bedrock: Claude Sonnet 4 for answer generation and AWS Titan Embedding v2 for creating embeddings. The system generates answers based only on the retrieved authoritative GOV.UK content.
Quality and Safety Guardrails: Before any answer reaches a user, it passes through multiple safety checks. The LLM evaluates responses against predefined quality and safety standards using Amazon Bedrock Guardrails. They perform detailed analysis of questions and answer data to ensure high-quality responses. Critically, they have a philosophy that “the best answer or no answer”—if they cannot provide an accurate answer based on authoritative content, they don’t provide one at all, which is markedly different from consumer LLM applications that try to always provide some response.
User Interface: The final answer is presented to users with careful design considerations around trust, including visual cues that make clear this is an AI-generated response, clear signposting of source content so users can verify information, and deliberate friction to ensure appropriate levels of trust rather than blind acceptance.
GDS spent enormous effort on safety and trust considerations, which they view as fundamental to their mission. They identified several key concerns that kept them up at night:
Zero Tolerance for Harmful Content: They have zero tolerance for offensive language, bias, hate speech, or any harmful content. This is non-negotiable in a government context.
Adversarial and Off-Topic Queries: They know people will use the system both intentionally and unintentionally in ways not intended, and they saw “quite a few” controversial attempts during their pilots. Their guardrails and intent classification system protect against this.
Appropriate Trust Balance: Interestingly, they faced a challenge where initial testing showed users had very high trust in results simply because they came from GOV.UK. While trust is essential, they needed to balance this with clarity about what the technology can and cannot do. They don’t want blind trust but rather informed, appropriate trust. They’ve carefully designed the user experience to achieve this balance through visual cues, clear source attribution, and messaging about limitations.
Life-Changing Stakes: The information on GOV.UK can literally be life-changing for citizens, making accuracy paramount. Sometimes the right answer genuinely is no answer if they cannot confidently provide accurate information from authoritative sources.
Transparency Requirements: The UK has the Algorithmic Transparency Recording Service (ATRS) which requires government organizations to publish information about how they use algorithmic tools. This adds a layer of public accountability to their deployment.
They worked extensively with red teaming partners, particularly the UK’s AI Security Institute (described as world-renowned), to uncover safety, usability, and performance issues throughout development.
GDS implemented a rigorous, evidence-backed approach to evaluation across three pillars:
Automated Evaluation: This serves as the backbone of iterative development, testing changes against metrics and sometimes using LLMs as judges to identify the best system configuration and impact on KPIs. This allows them to rapidly iterate on technical improvements.
Manual Evaluation: This provides deeper insights through diverse expert review. They conduct red teaming with security experts, work with subject matter experts from various government departments to validate content accuracy, and perform detailed error analysis to understand root causes—whether issues stem from the question asked, intent recognition, or content accuracy.
Continuous Monitoring: In live use, they monitor diagnostics and insights into performance and user behavior in real-time. They’re working on refining monitoring systems to automatically flag answers that might need deeper human review and implementing systematic categorization of errors to identify patterns and track recurring issues.
Throughout their experiments, they’ve demonstrated progress and “hardening” on accuracy and hallucination metrics, which was crucial for building confidence to scale.
GDS followed a careful progression through discovery, public beta, and pilot phases, with each step guided by data. They developed a custom interface and invited over 10,000 internal users to test GOV.UK Chat. They conducted many rounds of iterative research including usability testing, internal accuracy validation with subject matter experts from various departments, diary studies, benchmarking, and analytics incorporating thousands of data points. They replatformed to AWS Bedrock for robust model hosting and orchestration and to allow nimble switching between models. Most recently, they piloted within the GOV.UK mobile app (which launched in 2025 and which Gemma Hyde is responsible for, describing it as “AI-enabled” as a serious commitment rather than a tagline).
User testing revealed positive results. Users reported that GOV.UK Chat provided a quick and easy way to find information, made understanding requirements simpler, and reduced feelings of overwhelm. Specific user quotes included: “It’s been a lot more learning than looking. I’ve found what would have taken me maybe up to an hour before in 15 minutes” and “Avoid the phone queues. Saves time searching the website.”
These results align with their goal of reducing the time tax—the excessive time citizens spend interacting with government.
GDS is excited about several areas for future development:
Both implementations share critical architectural patterns:
Core Foundation: Both use Amazon Bedrock for foundational models and RAG capabilities, with Amazon Bedrock Guardrails for security and policy-compliant responses.
Integration Layer: Both implement enterprise-driven architectures with real-time protection and use services like API Gateway to seamlessly stitch services together.
Security and Monitoring: Both have comprehensive monitoring providing full visibility into services, ensuring every interaction is verified, monitored, and logged.
Both organizations followed similar deployment progressions:
Foundation Phase: Start small with limited scope—Capita with 1,000 calls per day, GDS with internal testing.
Limited Release Phase: Expand to controlled production use—Capita with a single contact center service line, GDS with 10,000 internal users.
Scaling Phase: Broaden significantly—Capita with multi-service capabilities, GDS with public pilot in the mobile app.
Optimization Phase: Use AI to generate new insights and services across the organization.
Technology Lessons:
Process Lessons:
People and Culture Lessons:
Both organizations highlighted important tradeoffs:
Speed vs. Safety: GDS specifically discussed how streaming answers would improve perceived latency, but their guardrail requirements make streaming challenging. They chose safety over speed.
Trust vs. Appropriate Skepticism: GDS wrestled with the challenge that users highly trusted GOV.UK-branded content, but they needed users to understand technology limitations. They designed deliberate friction into the experience.
Automation vs. Human Touch: Capita emphasized that while targeting 95% automation, humans remain essential for vulnerable users and complex cases. This isn’t about replacing people but empowering them to focus where they add most value.
Comprehensive Coverage vs. Quality: GDS’s “best answer or no answer” principle means they sometimes don’t answer questions users expect answers to, prioritizing accuracy over coverage.
While both presentations demonstrated impressive technical implementations and reported strong results, it’s important to note some caveats:
Vendor Context: This presentation occurred at an AWS event with AWS as a partner, so there’s natural incentive to present positive results and emphasize AWS services. The reported metrics should be viewed as claims that would benefit from independent validation.
Early Stage Results: Both implementations are relatively recent. Capita’s results are “already achieved” but they’re still scaling, and GDS is in pilot phase. Long-term sustainability and performance remain to be proven.
Selection Bias: User feedback from pilots may not represent the full population’s experience once these systems reach true national scale with more diverse users and use cases.
Complexity of Measurement: Metrics like “customer satisfaction” and “productivity improvements” can be measured in various ways, and the presentations didn’t detail methodologies. A 35% productivity improvement sounds dramatic but depends heavily on how it’s calculated.
Challenge of No Answer: GDS’s approach of providing no answer when they lack confidence is admirable from a safety perspective, but the user experience impact at scale remains to be seen. Users may become frustrated if they frequently receive no response.
That said, the presentations demonstrated genuine thoughtfulness about the challenges of production LLM deployment, particularly around safety, trust, and human oversight. The emphasis on phased rollouts, extensive testing, red teaming, and continuous monitoring reflects mature LLMOps practices. The cultural and organizational insights about including operational teams and embracing appropriate failure were particularly valuable and often overlooked in technical presentations.
The UK government’s willingness to share detailed technical approaches, including challenges and considerations, provides valuable learning for others deploying AI at scale in high-stakes environments. Their emphasis on transparency through mechanisms like ATRS and their openness about tradeoffs demonstrates a responsible approach to government AI deployment.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.