ZenML

Journey Towards Autonomous Network Operations with AI/ML and Dark NOC

BT
View original source

BT is undertaking a major transformation of their network operations, moving from traditional telecom engineering to a software-driven approach with the goal of creating an autonomous "Dark NOC" (Network Operations Center). The initiative focuses on handling massive amounts of network data, implementing AI/ML for automated analysis and decision-making, and consolidating numerous specialized tools into a comprehensive intelligent system. The project involves significant organizational change, including upskilling teams and partnering with AWS to build data foundations and AI capabilities for predictive maintenance and autonomous network management.

Industry

Telecommunications

Technologies

Overview

This case study presents a conversation between Ajay Ravindranathan, a Principal Solutions Architect at AWS’s Telecom Industries Business Unit, and Reza Rahnama, the Managing Director of Mobile Networks at British Telecom (BT). The discussion centers on BT’s ambitious vision called “DarkKnock” — a journey toward autonomous network operations where the network can self-heal and self-optimize with minimal human intervention.

BT operates one of the largest mobile networks in the UK, encompassing thousands of radio sites, thousands of gateways, a massive IP network, and distributed core and IMS (IP Multimedia Subsystem) networks. The organization faces the classic telecommunications challenge: despite technological evolution from 2G through 5G, operational practices have remained largely unchanged. The same functional teams that managed HLR (Home Location Register) in 2G days now manage HSS (Home Subscriber Server) in 5G, with similar manual processes and siloed expertise.

The Problem Space

The core challenge BT faces is multi-dimensional. First, the network generates enormous amounts of data — described as “petabytes and petabytes” — flowing from various element managers into network management services. Currently, this data is monitored using basic SNMP protocols and analyzed through syslog reviews by siloed teams with specialized expertise. This approach is neither scalable nor efficient for modern network operations.

Second, BT operates with “hundreds and hundreds of tools that are very bespoke for a function.” Each module or function in the network has its own dedicated monitoring tools, creating fragmentation and preventing holistic network intelligence. The lack of integration means that when issues occur, it’s extremely difficult to correlate events across different network domains to understand root causes.

Third, there’s a significant organizational and skills gap. The team needs to transform from a traditional network engineering organization into what Rahnama describes as “a software team that runs the mobile network.” This requires upskilling personnel to understand both the domain expertise (radio, core, transmission, IMS, devices) and new technology paradigms around data engineering, ML, and AI.

The Solution Approach and AI/ML Strategy

BT’s approach with AWS follows a phased methodology that prioritizes foundational work before advanced AI implementation. This is a notably pragmatic approach that contrasts with many organizations that rush to implement AI without proper data foundations.

Phase 1: Data Foundation

The first and most critical phase involves “fixing the data.” This means taking the raw data output from the network, cleaning it, and creating “a sensible data architecture.” As Rahnama emphasizes, this work happens “way before we predict anything” — a crucial acknowledgment that AI and ML are only as good as the data they’re trained on.

AWS is helping BT create what they call a “network graph” or “knowledge graph” — essentially a unified topology that represents the relationships between different network elements. This involves:

Phase 2: The AI Continuum and Agentic Framework

The AWS representative introduces an important concept: the “AI continuum.” This recognizes that different types of AI models serve different purposes, and production systems need to orchestrate multiple approaches. The framework includes:

The philosophy articulated is to “use the right tool for the right job” — acknowledging that some problems are better solved with traditional ML while others benefit from generative AI capabilities. This is a mature LLMOps perspective that avoids the trap of treating Gen AI as a universal solution.

Target Use Cases

The discussion outlines several specific applications for the AI system:

Automated Data Analysis: Currently, humans spend significant time “looking at huge amount of data and trying to figure out what’s wrong with the network.” The goal is to automate this analysis, having the system identify problems and potentially make decisions autonomously.

Root Cause Analysis: When network events occur, determining the underlying cause is currently extremely difficult given the network’s complexity. AI can help correlate events across network domains to pinpoint issues faster.

Service Impact Analysis: Understanding how network issues affect customer services and applications.

Anomaly Detection: Particularly important for both operational stability and cybersecurity. The example given involves detecting if “a container just got access through something that maybe a basic RBACs could stop it.”

Predictive Maintenance: The “holy grail” according to Rahnama — predicting failures before they occur and preventing them. This is explicitly tied to continuous service availability.

Network Optimization: Using predictive analysis to optimize the radio network performance.

Application-Aware Networking: An advanced vision where “the network adjusts itself for the applications that you use” — essentially dynamic network behavior based on workload requirements.

Infrastructure and Technology Context

Several technical aspects of BT’s infrastructure are relevant to the LLMOps context:

Containerized Core Network: BT has moved to a “huge containerized network” which enables capabilities like moving traffic from faulty nodes to working nodes. This modern infrastructure is more amenable to automated management and provides better observability for AI systems.

AWS Partnership: The cloud partnership provides both the data platform capabilities and AI/ML services. AWS is specifically mentioned as providing services around network graph, topology, discovery, and analytics.

Consolidation Strategy: Rather than building AI capabilities for each existing tool, the vision is to create “a bigger network monitoring system” that consolidates hundreds of tools into unified intelligence.

Organizational Transformation

A critical but often overlooked aspect of LLMOps is the organizational change required. BT’s approach explicitly addresses this:

Skills Development: The team needs to understand “how all these new technology works” — specifically combining domain expertise (radio engineering, core engineering) with data and software skills.

Process Automation: Before applying AI, basic processes need to be cleaned and automated. Some decisions can be made with “very basic wall-do statements” once processes are well-defined.

Human-Machine Division: The vision clarifies that machines handle immediate decisions “to keep the network going” while humans “go figure out why it failed, was it a design issue or not.” This is a thoughtful approach to human-AI collaboration.

Balanced Assessment

It’s important to note that this case study represents a vision and early-stage work rather than a deployed production system. Several honest acknowledgments in the discussion deserve attention:

The partnership with AWS positions this as a vendor-customer relationship where AWS is promoting its services, so the positive framing should be considered in that context. However, the phased approach and emphasis on data foundations before AI implementation reflects industry best practices.

Key LLMOps Lessons

Several themes from this case study are relevant for LLMOps practitioners:

The importance of data quality and accessibility cannot be overstated — even the most sophisticated Gen AI systems require clean, governed, accessible data. The concept of the “AI continuum” recognizes that production AI systems often need multiple model types working together, not just LLMs. The agentic framework approach suggests orchestrating different AI capabilities rather than relying on single models. Human-in-the-loop design remains crucial even in “autonomous” visions — the goal is augmentation, not complete replacement. Finally, organizational transformation (skills, processes, culture) is just as important as technology choices for successful AI deployment at scale.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42

Observability Platform's Journey to Production GenAI Integration

New Relic 2023

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

code_generation data_analysis data_cleaning +32