British Telecom: Autonomous Network Operations Using Agentic AI

Overview

This case study documents British Telecom’s comprehensive deployment of agentic AI systems to achieve autonomous network operations at scale, representing one of the telecommunications industry’s most ambitious production implementations of LLM-based operational intelligence. The partnership between BT and AWS demonstrates how large language models and multi-agent architectures can transform traditional network operations from manual, siloed processes into data-driven, intent-based autonomous systems.

BT operates critical national infrastructure in the UK, including emergency services, serving approximately 22.5 million daily mobile users across 30 million provisioned subscribers. Their network consists of 20,000 macro sites with multi-carrier deployments, extensive small cell networks, and a fully containerized 5G standalone core running on Kubernetes distributed across the UK. The network generates petabytes of operational data with 4,000 KPIs monitored per tower, creating enormous operational complexity that has historically required manual intervention and large operational teams.

The business imperative driving this initiative is substantial: telecommunications operators spend up to one-fifth of their revenue on network operations, much of which involves manual processes, slow failure diagnosis, and reactive rather than proactive management. BT’s vision centers on three pillars—build, connect, and accelerate—with the AI operations work falling squarely in the acceleration pillar focused on cost reduction, faster service development, and improved customer experience.

Strategic Context and Challenges

The telecommunications industry faces three fundamental challenges that make autonomous networks both critical and achievable. First, the operational cost burden remains unsustainable as networks grow more complex with 5G standalone architectures. Second, 5G promised dynamic, programmable networks with visibility and control capabilities that have not been fully realized—agentic AI represents a potential path to fulfilling that promise. Third, operators possess vast amounts of underutilized data from network elements, devices, and user behavior that could drive both cost reduction and new revenue through data products and personalized services.

BT’s specific operational challenges illustrate the complexity. Their 100% containerized core network means constant chatter from nodes going up and down in the Kubernetes environment. When changes occur—and BT executes hundreds daily, totaling 11,000 weekly—even simple configuration adjustments like SRV record changes can cascade into failures elsewhere in the network. The team monitors a “museum of tools” accumulated over decades, with each network element vendor providing its own element manager and network management services, creating data silos and fragmented operational views.

The company identified four critical operational problem areas requiring transformation: understanding what caused failures when something goes wrong (root cause analysis), assessing the impact of failures and which customers are affected (service impact analysis), learning from data about node behavior and proper operation patterns, and automating responses to prevent recurrence of issues. These challenges span the network lifecycle from planning and engineering through deployment, service fulfillment, and ongoing operations—though the initial agentic AI work focuses primarily on operations and service fulfillment.

BT’s journey began with their “DDOps” (Data-Driven Operations) initiative, recognizing that data quality and consolidation must precede effective AI deployment. They acknowledged making early mistakes by jumping into AI use cases without proper data foundation. The five pillars of DDOps focus on: identifying what happened when failures occur, understanding impact and affected parties, learning from data patterns, automating remediation, and embedding continuous improvement mindsets where issues resolved once should not recur.

AWS Partnership and Technology Stack

BT selected AWS as their partner based on three factors: cultural alignment around customer-obsessed thinking (symbolized by AWS’s empty chair for the customer in meetings), access to advanced AI infrastructure and services that would take years to build independently, and AWS’s telecommunications domain expertise enabling rapid mutual understanding of technical requirements. This last point proved particularly valuable—when BT engineers discussed RAN optimization or specific KPIs, AWS teams could engage meaningfully without extensive translation, accelerating solution development.

The technology architecture leverages multiple layers of the AWS AI stack. At the infrastructure level, BT utilizes custom silicon options including AWS Trainium and Inferentia for cost-effective inference, alongside access to Nvidia GPUs when needed. Amazon SageMaker provides the model training and fine-tuning environment for custom machine learning workloads specific to telecommunications patterns.

The critical middle layer consists of AI and agent development software, particularly Amazon Bedrock and Bedrock Agent Core. Agent Core, announced months before this presentation and continuing to evolve, provides primitives for building production-grade agentic systems including session isolation, identity management, external system integration through Model Context Protocol (MCP), agent-to-agent communication (A2A), and enterprise-grade security and reliability features. This suite of primitives enables BT to construct complex multi-agent systems that can scale to their operational requirements.

The architecture follows a three-layer model designed to turn data into insights and intent into action. The bottom network layer contains AI-native network elements providing pre-curated data from machine learning models and agents positioned close to network infrastructure. The middle layer implements AI-powered data product lifecycles, ingesting raw data, curating it through data management primitives, and creating consumable data products. The top layer runs data-driven AI applications including both generative AI and hyper-optimized machine learning models serving specific use cases, coordinated through an agentic layer that translates operational intent into concrete actions.

Data Architecture and Engineering

The foundation of BT’s autonomous network initiative rests on comprehensive data engineering, reflecting hard-won lessons about attempting AI without proper data infrastructure. The raw data sources span performance counters, alarms, network topology, configurations, incidents, changes, and knowledge repositories—using the analogy presented, this represents the “flour and raw ingredients” that must be prepared before creating valuable insights.

The AI-powered data product lifecycle management layer implements several sophisticated capabilities. Agentic data engineering and feature engineering accelerate the traditionally labor-intensive ML lifecycle phases of data curation and feature extraction. An innovative agentic semantic layer provides a single source of truth for KPI definitions, alarm definitions, and correlation rules, allowing data products to reference these definitions at runtime rather than duplicating logic across systems. This semantic layer prevents the proliferation of inconsistent metric calculations that plagued previous architectures.

Data storage leverages open formats, specifically Apache Iceberg on Amazon S3, providing vendor-neutral data lakes with ACID transaction guarantees. Time series data is tiered across Amazon Redshift and ClickHouse based on temperature—cold, hot, and super-hot data each stored optimally for access patterns and cost. This tiering approach balances query performance requirements against storage economics at petabyte scale.

Network topology data resides in Amazon Neptune graph databases and Neptune Analytics, recognizing that telecommunications networks are fundamentally connected graph structures. Graph databases enable efficient traversal of network relationships and execution of graph analytics algorithms including breadth-first search, depth-first search, community detection, and centrality measures. These algorithms prove essential for understanding alarm propagation, identifying blast radius during incidents, and performing correlation analysis across network domains.

The architecture also incorporates Amazon Aurora for structured alarm and event data, geospatial representation of network infrastructure, and critically, vector stores for retrieval-augmented generation (RAG) patterns. The vector store indexes unstructured operational documentation, runbooks, historical incident reports, and tribal knowledge accumulated over decades of network operations. This vectorized knowledge base enables LLM-based agents to access relevant context when reasoning about current network conditions.

Data products emerging from this infrastructure include RAN and core KPIs and alarms, customer experience metrics from core network analytics, cross-domain network and service topology views, and vectorized operational documentation. A particularly interesting higher-order data product is the network digital twin—a comprehensive model combining network topology, service models, current performance metrics, and historical views. This digital twin provides agents with a queryable representation of network state for simulation, what-if analysis, and impact prediction.

Machine Learning for Anomaly Detection

The first major production use case addresses multivariate anomaly detection across BT’s radio access network. The existing approach relied on univariate anomaly detection with dynamic thresholds applied to individual KPIs. While functional, this method generated excessive noise with high false positive rates, creating alert fatigue for operations teams and obscuring genuine issues among spurious anomalies.

The enhanced approach employs temporal pattern clustering to group cells exhibiting similar behavioral characteristics. Cells in dense urban environments behave differently from macro cells in rural areas or small cells providing capacity infill—recognizing these patterns allows optimization of model architectures and training strategies for each scenario. The clustering analysis considers network topology, understanding which parts of the network should exhibit correlated behavior based on their physical and logical relationships.

Multiple model architectures are trained and compared for different scenarios: LSTM (Long Short-Term Memory) networks for sequential time series patterns, autoencoders for dimensionality reduction and reconstruction error-based anomaly detection, and transformer models for learning complex interdependencies between KPIs. The models learn which performance metrics correlate under normal operation and how they deviate during various failure modes, creating implicit causal graphs of KPI relationships.

The infrastructure supporting this capability is entirely serverless and managed. AWS Lambda handles data preparation orchestration, Amazon MSK (Managed Streaming for Kafka) processes streaming telemetry, and Amazon EMR executes batch processing jobs for historical analysis. Cell clustering and KPI clustering algorithms run within Amazon SageMaker using both analytics and machine learning approaches for temporal analysis. Model training, versioning, and registry management occur in SageMaker, with models stored as artifacts on S3 in Iceberg format alongside training metadata and lineage information.

Inference operates through SageMaker endpoints providing autoscaling based on load. Model evaluation generates objective metrics on false positive and false negative rates, feature importance rankings showing which KPIs most significantly contribute to detected anomalies, and performance characteristics across different network conditions. Critically, the system incorporates feedback loops from operational subject matter experts who validate detected anomalies, with their assessments feeding into supervised retraining cycles that progressively improve model accuracy.

This approach demonstrates important LLMOps principles around model lifecycle management, though notably these are traditional ML models rather than LLMs. The discipline of versioning, evaluation, feedback integration, and continuous improvement establishes patterns that extend to the LLM-based agentic components built atop this foundation.

Agentic Root Cause Analysis and Service Impact Assessment

The most sophisticated LLM deployment addresses the perennial challenge of turning a “sea of red” alarm dashboards into actionable insights identifying root causes and affected services. Traditional approaches relied on brittle rule-based correlation engines requiring constant maintenance as network topology evolved, or supervised ML models that performed poorly due to insufficient quality training data covering the long tail of failure scenarios.

The solution introduces a novel architectural pattern called “domain-specific community agents.” This design partitions the agent system along two dimensions: network domains (5G core, 5G RAN, transport layers like IP/MPLS, DWDM) and communities within each domain. Communities represent affinity groups of network nodes with close connectivity—essentially the blast radius within which failures propagate and alarms cascade. These communities often align with how networks are designed for resilience, with deliberate boundaries to contain failure impact.

Each community deploys dedicated agents that develop specialized knowledge of that network segment’s behavior, typical failure modes, and alarm correlation patterns. These agents collaborate with peer agents in adjacent communities within the same domain to correlate alarms across community boundaries. Inter-domain agents coordinate across network layers, essential for scenarios like transport failures causing cascading alarms in radio access networks or core network components.

The agent implementation leverages Amazon Bedrock Agent Core for the runtime environment, identity management, gateway functions, observability, and memory management. When alarms and anomalies arrive via MSK streaming, agents retrieve relevant network topology from Amazon Neptune graph databases. Graph analytics algorithms identify connected groups of alarms—nodes experiencing simultaneous or temporally related issues—providing the spatial context for correlation.

Agents then perform retrieval-augmented generation against two primary knowledge bases. The operations knowledge base contains vectorized runbooks, standard operating procedures, vendor documentation, and historical incident reports. The root cause analysis knowledge base grows over time through supervised learning, capturing validated RCA outcomes and the reasoning paths that led to correct diagnoses. This creates an institutional memory that accumulates tribal knowledge and makes it accessible to agents addressing future incidents.

The LLMs—currently using foundation models from Bedrock with plans for domain-specific fine-tuning—apply reasoning capabilities to synthesize alarm patterns, topology relationships, and retrieved knowledge into root cause hypotheses. The multi-agent architecture enables parallel exploration of multiple hypotheses across network domains and communities, with inter-agent communication consolidating findings into coherent explanations.

Service impact analysis builds on root cause identification by correlating affected network elements with customer experience metrics from the 5G core network. The system identifies how many subscribers are impacted, what types of services they’re using (voice, data, specific applications), and the severity of degradation they’re experiencing. This enables prioritization of remediation efforts and proactive customer communication rather than waiting for complaints.

The architecture integrates with trouble ticketing systems through APIs, automatically creating, updating, and closing tickets based on agent findings. Alarms are persisted in Amazon RDS, service impact metrics in S3, providing queryable history for trend analysis and compliance reporting. The observability capabilities built into Agent Core provide detailed tracing of agent reasoning, decision points, and inter-agent communications—essential for debugging agent behavior and building operator trust in autonomous decisions.

Optimization Use Cases and Closed-Loop Automation

Beyond reactive troubleshooting, the agentic architecture supports optimization workloads that proactively improve network performance. Coverage analysis and optimization represents a key use case leveraging the octagonal cell structure of mobile networks. The system analyzes signal strength, interference patterns, capacity utilization, and quality metrics within each cell’s coverage area, identifying opportunities to adjust parameters for better performance.

Intent-based orchestration allows operations teams to express high-level goals like “optimize this network sector for capacity” or “reduce interference in this region” rather than manually calculating specific parameter changes across potentially hundreds of configuration items. Agents translate these intents into specific actions: adjusting antenna tilt angles, modifying power levels, reconfiguring carrier aggregation combinations, or altering scheduling algorithms.

Dynamic network slicing represents a future application particularly relevant to 5G standalone networks. The vision envisions automatically provisioning network slices based on application requirements and user subscriptions—when a gaming subscriber launches a game, the network detects this and assigns them to a gaming-optimized slice with appropriate latency guarantees and bandwidth prioritization. Similarly, applications requiring enhanced security automatically provision slices with additional security controls. This requires real-time intent recognition, slice orchestration, and policy enforcement across distributed network functions.

The roadmap toward increased autonomy progresses from the current Level 4 maturity (closed-loop operations with AI-powered decision making but human oversight) toward Level 5 (fully autonomous operations). Each use case begins with agent recommendations reviewed by human operators before execution. As confidence builds through validated outcomes, the automation boundary gradually expands to include more decisions executed without human intervention, though always with comprehensive logging and rollback capabilities.

Model Fine-Tuning and Cost Optimization

An important evolution in the LLMOps journey involves domain-specific fine-tuning of foundation models. While initial deployments use base models from Amazon Bedrock, the team recognizes opportunities to improve both accuracy and economics through fine-tuning on telecommunications-specific data. Network operations involve specialized vocabulary, abbreviations, and conceptual relationships that general-purpose LLMs handle suboptimally.

Fine-tuning objectives include reducing token consumption for common reasoning patterns, improving accuracy on telecommunications-specific terminology and concepts, reducing latency through more efficient inference with smaller fine-tuned models, and potentially enabling deployment of smaller language models for specific agent functions where full foundation model capabilities aren’t required. The case study references ongoing experimentation and proof-of-concepts in this area, suggesting this represents active work rather than completed deployment.

The fine-tuning strategy must balance multiple considerations. Training data curation requires carefully selecting examples that represent desired agent behavior while avoiding bias toward overrepresented failure scenarios. Evaluation frameworks need domain-specific metrics beyond standard LLM benchmarks—does the model correctly identify network topology relationships, accurately interpret alarm codes, and provide reasoning aligned with expert network engineers? Data privacy and security considerations are paramount given that training data may contain customer information or network security details.

The architecture supports experimentation through Amazon SageMaker’s model training and versioning capabilities. Multiple fine-tuned variants can be evaluated in A/B testing scenarios, with performance metrics feeding back into model selection decisions. The Bedrock deployment model allows seamless substitution of custom fine-tuned models in place of foundation models through the same API interfaces, minimizing application changes when transitioning to optimized models.

Observability, Evaluation, and Trust

Building operator trust in autonomous agent decisions requires comprehensive observability and explainability. The Agent Core primitives provide detailed tracing of agent execution including which tools agents invoked, what information they retrieved from knowledge bases, how they reasoned about that information, and what actions they recommended or executed. This tracing enables several critical capabilities.

First, debugging agent behavior when outcomes don’t match expectations requires understanding the decision pathway. Operators can review execution traces to identify where agents misinterpreted data, retrieved irrelevant context, or applied flawed reasoning. These insights directly inform refinements to agent prompts, knowledge base curation, or tool implementations.

Second, building confidence requires demonstrating that agents reach correct conclusions through valid reasoning rather than lucky guesses. Even when agents produce correct root cause identifications, operators may distrust “black box” decisions. Providing visibility into the reasoning process—showing how the agent correlated specific alarm patterns with topology relationships and matched them to historical incident patterns—builds trust through transparency.

Third, continuous evaluation requires metrics beyond simple accuracy. The system tracks mean time to detect anomalies, mean time to identify root causes, accuracy of service impact predictions, false positive and false negative rates, percentage of incidents requiring human escalation, and operator satisfaction ratings with agent recommendations. These metrics provide multidimensional visibility into system performance and inform prioritization of improvement efforts.

The feedback loop from operational SMEs proves essential. When agents produce root cause hypotheses, operators validate whether those hypotheses led to successful remediation. This validation data feeds into the growing RCA knowledge base and into evaluation datasets for fine-tuning efforts. Over time, the system learns from its mistakes and from operator corrections, progressively improving performance through supervised learning cycles.

Deployment Architecture and Operational Patterns

The production deployment architecture reflects enterprise requirements for security, scalability, and reliability. Data ingestion from on-premises data centers uses hybrid connectivity with Amazon MSK for streaming and EMR for batch transfers. Event-driven architectures built on AWS Lambda and EventBridge enable responsive processing as network conditions change, scaling compute resources dynamically based on event volume.

The data catalog provides crucial governance capabilities, maintaining metadata about data lineage, quality metrics, and access controls. As data flows through ingestion, curation, and product creation pipelines, the catalog tracks transformations and dependencies. This enables impact analysis when schema changes occur, compliance reporting for data usage, and quality monitoring to detect degradation in source data.

Security considerations permeate the architecture. Identity and access management integrates with BT’s enterprise identity systems, ensuring agents operate with appropriate permissions. Data encryption at rest and in transit protects sensitive network information and customer data. AWS Nitro enclaves provide hardware-isolated compute environments for particularly sensitive processing workloads where even AWS operators cannot access data.

The multi-region distributed nature of BT’s core network influences the architecture. Network functions running in Kubernetes clusters across the UK generate data locally, requiring distributed data collection and aggregation strategies. The architecture balances centralized analytics—where comprehensive correlation across the entire network provides maximum insight—against edge processing for latency-sensitive use cases where waiting for centralized processing would delay critical decisions.

Cost optimization remains an important consideration at petabyte scale. The tiered storage strategy using ClickHouse for super-hot data (immediate operational queries), Redshift for hot data (recent historical analysis), and S3 with Iceberg for cold data (long-term retention and batch analytics) reflects careful analysis of access patterns and cost tradeoffs. Query optimization, appropriate indexing strategies, and lifecycle policies that automatically transition data between tiers keep storage costs manageable while maintaining required access performance.

Results, Benefits, and Future Roadmap

While the case study focuses more on architectural details and capabilities than quantified outcomes, several benefit categories are emphasized. Cost reduction targets removing operational expense through automation of manual tasks, consolidation of fragmented monitoring tools, and reduction in staffing requirements for routine network operations. The specific claim of potentially reducing operations costs from up to 20% of revenue suggests substantial financial impact if fully realized.

Service level agreement improvements come from faster detection of anomalies, quicker identification of root causes, and more accurate prediction of which issues will impact customers. Reducing mean time to detect and mean time to repair directly translates to improved uptime and customer experience. The proactive service impact analysis enables customer communication before complaints arise, potentially reducing support costs and churn.

Change efficiency gains address the risk inherent in BT’s 11,000 weekly network changes. Better understanding of dependencies and potential cascading impacts allows more confident change execution with reduced rollback rates. The digital twin capabilities enable “what-if” simulation of changes before execution, identifying potential issues in a safe environment rather than discovering them in production.

The transformation extends beyond technology to people and processes. BT explicitly acknowledges the need to evolve from traditional network engineering teams to software engineering teams that operate networks, and further to teams proficient in AI engineering for network operations. This cultural and skills transformation represents a multi-year journey requiring training, hiring, and organizational restructuring.

The roadmap emphasizes expanding coverage of autonomous capabilities across all network lifecycle phases. Initial focus on operations and troubleshooting will extend to planning and engineering (where should we place cell towers, what equipment should we procure), deployment and configuration (automated network element provisioning and testing), and advanced service fulfillment (dynamic slice creation and modification based on real-time demand).

Specific upcoming capabilities include enhanced coverage analysis and optimization leveraging the octagonal cell structure, automated RAN parameter tuning for capacity and coverage, dynamic network slicing with application-aware slice selection, and expansion beyond mobile networks to fixed networks including fiber and legacy copper infrastructure. The vision encompasses a fully autonomous network that heals itself, optimizes continuously based on usage patterns, and dynamically adapts to application requirements without human intervention.

Critical Assessment and Open Questions

This case study represents an ambitious and technically sophisticated approach to LLMOps in a complex operational environment, though several areas warrant balanced consideration. The presentation emphasizes capabilities and architecture over quantified results, making it difficult to assess actual operational impact. Claims about cost reduction and improved SLAs lack specific metrics or validation data, which is understandable for early-stage deployments but limits assessment of effectiveness.

The complexity of the solution raises questions about operational sustainability. The architecture spans numerous AWS services requiring specialized expertise—SageMaker, Bedrock, Agent Core, Neptune, EMR, MSK, Redshift, ClickHouse, and others. While AWS provides managed services reducing infrastructure burden, the overall system complexity could create new operational challenges even as it solves old ones. BT’s acknowledgment of needing to transform their workforce toward software and AI engineering skills reflects this reality.

The domain-specific community agent architecture is innovative but unproven at scale. While the conceptual approach of partitioning agents by network domain and community makes intuitive sense, the practicalities of coordinating potentially hundreds or thousands of specialized agents, managing their knowledge boundaries, and ensuring consistent behavior across agent populations presents significant engineering challenges. The case study doesn’t detail how these coordination challenges are addressed.

The reliance on base foundation models rather than fine-tuned alternatives represents a temporary state with acknowledged plans to evolve toward domain-specific models. This suggests the current system may not yet achieve optimal accuracy or cost-effectiveness, positioning this as work-in-progress rather than a mature production deployment. The ongoing experimentation with fine-tuning indicates recognition of these limitations.

Data quality challenges receive acknowledgment through the DDOps initiative and emphasis on data engineering, but the difficulty of maintaining clean, accurate network topology data and consistent alarm definitions across a heterogeneous multi-vendor network should not be underestimated. The “museum of tools” problem that led to fragmented data likely requires years of consolidation work to fully resolve, potentially limiting what autonomous agents can achieve in the interim.

Trust and explainability remain open challenges despite the observability capabilities. Operations teams accustomed to deterministic rule-based systems may resist trusting probabilistic AI recommendations, particularly for high-risk changes affecting critical infrastructure like emergency services. The transition from human-in-the-loop to truly autonomous operation requires not just technical capability but regulatory acceptance and organizational confidence that may take longer to achieve than technology development.

Nevertheless, this case study demonstrates serious production deployment of agentic AI in a genuinely complex operational environment rather than a controlled proof-of-concept. The partnership between BT and AWS brings together operational domain expertise and AI platform capabilities in ways that should accelerate learning and iteration. The architectural patterns around domain-specific community agents, semantic data layers, and digital twins represent potentially reusable approaches for other large-scale operational AI deployments beyond telecommunications.

Autonomous Network Operations Using Agentic AI

Industry

Technologies