Cisco: Multi-Agent AI System for Network Change Management

LLMOps Database

Telecommunications

Cisco

Company

Cisco

Title

Multi-Agent AI System for Network Change Management

Industry

Telecommunications

Link

https://www.youtube.com/watch?v=m0dxZ-NDKHo

Year

2025

Summary (short)

Cisco's Outshift incubation group developed a multi-agent AI system to address network change management failures in production environments. The solution combines a natural language interface, multiple specialized AI agents using ReAct reasoning loops, and a knowledge graph-based digital twin of production networks. The system integrates with ITSM tools like ServiceNow, automatically generates impact assessments and test plans, and executes validation tests using network configuration data stored in standardized schemas, significantly reducing tokens consumed and response times through fine-tuning approaches.

## Overview Cisco's Outshift incubation group tackled a critical networking industry problem: high failure rates during network change management in production environments. This case study demonstrates a comprehensive LLMOps implementation that combines multiple AI agents, knowledge graphs, and digital twin technology to automate and improve network change validation processes. The project represents Cisco's broader strategy of using emerging AI technologies to accelerate traditional business unit roadmaps through their incubation process, which includes customer problem identification, prototyping, A/B testing, MVP delivery, and eventual graduation to production Cisco business units upon achieving product-market fit. ## Problem Context and Customer Requirements The customer pain point centered around frequent failures during network change management processes. Network operations teams struggled with the complexity of modern networking environments, which include multiple vendors, diverse device types (firewalls, switches, routers), and various data formats across different systems. The challenge was particularly acute because network engineers needed to assess potential impacts of changes, create comprehensive test plans, and validate changes before production deployment. From a product requirements perspective, the team identified several critical needs: multimodal flexibility to handle different data formats (key-value pairs, JSON, relationships), instant performance for node information queries regardless of location, operational flexibility through consolidated schema frameworks, vector indexing capabilities for semantic search functionality, and ecosystem stability that wouldn't require heavy customer integration efforts while supporting multiple vendors. ## Technical Architecture and LLMOps Implementation The solution architecture consists of three primary components working in concert. The first is a natural language interface that enables both human network operations teams and automated systems to interact with the platform. Notably, this interface extends beyond human users to include agent-to-agent communication, such as integration with ITSM tools like ServiceNow where agents on both systems communicate directly. The second component is a sophisticated multi-agent system where each agent is specialized for specific tasks. The system includes an assistant agent that serves as the primary orchestrator and planner, coordinating activities across all other agents. Additional agents handle specific functions like impact assessment, test plan generation, configuration analysis, and test execution. All agents except the orchestrator utilize ReAct (Reasoning and Acting) loops, enabling them to reason through problems and take appropriate actions iteratively. The third critical component is the network knowledge graph and digital twin implementation. This represents one of the most technically sophisticated aspects of the solution, as it creates a comprehensive digital representation of production networks that can be queried and manipulated by AI agents. ## Knowledge Graph and Data Pipeline Architecture The knowledge graph implementation faced significant technical challenges due to the heterogeneous nature of network data sources. The team had to design an ingestion pipeline capable of handling data from network controllers, device agents, configuration management systems, SIEM systems like Splunk, and direct device telemetry. These sources output data in various formats including YANG models, JSON configurations, and streaming telemetry data. After evaluating multiple graph database options including Neo4j and ArangoDB, the team selected ArangoDB primarily due to historical usage in security-related recommendation systems within Cisco, though they continue exploring Neo4j for future use cases. The decision-making process involved comprehensive analysis across multiple criteria including performance characteristics, schema flexibility, and integration capabilities. The final architecture implements an ETL (Extract, Transform, Load) process that standardizes all network data into OpenConfig schema. This choice proved strategically important because OpenConfig is well-documented online, making it easily understood by large language models. The knowledge graph is structured in layers corresponding to different network abstraction levels, allowing agents to query specific layers based on their needs rather than traversing the entire graph structure. For example, configuration drift testing requires access only to raw configuration file layers, while reachability testing needs multiple layers including raw configuration, data plane, and control plane information. This layered approach optimizes query performance and reduces computational overhead. ## Agent Fine-tuning and Performance Optimization A particularly noteworthy aspect of this LLMOps implementation involves the fine-tuning of the query agent, which serves as the primary interface between the multi-agent system and the knowledge graph. Initially, the team attempted to use Retrieval-Augmented Generation (RAG) approaches for knowledge graph querying, but this proved ineffective for their specific use case. The performance issues were significant: agents were consuming excessive tokens because AQL (ArangoDB Query Language) queries were traversing all knowledge graph layers during reasoning loops, resulting in slow response times and high computational costs. To address this, the team implemented a targeted fine-tuning approach for the query agent, incorporating schema information and example queries into the training process. This fine-tuning effort yielded dramatic improvements in system performance. The team observed substantial reductions in both token consumption and query response times. This represents a practical example of when fine-tuning approaches may be more effective than RAG methods for specific technical domains and use cases, highlighting the importance of evaluating multiple approaches during LLMOps implementation. ## Production Integration and Workflow Automation The system demonstrates sophisticated integration with existing enterprise workflows through its ITSM connectivity. When network engineers submit change requests through ServiceNow, the AI system automatically processes these tickets using natural language understanding capabilities. The workflow progresses through several automated stages that traditionally require significant manual effort. First, the system synthesizes and summarizes the change request information to ensure all agents understand the scope and requirements. Next, it generates comprehensive impact assessments by analyzing the proposed changes against the current network state represented in the knowledge graph. These assessments identify potential implications beyond the immediate target area of the change. The system then creates detailed test plans by reasoning through extensive test plan documentation and best practices available online, customized to the specific intent extracted from the change request. This addresses a critical customer pain point where network teams often miss important test cases during manual planning processes. For test execution, the system integrates with version control systems like GitHub to pull configuration changes, combines this with real-time network snapshots from the knowledge graph, and executes comprehensive validation testing using network engineering tools such as Batfish and Routenet. All test results are automatically compiled into detailed reports with pass/fail status and remediation recommendations for failed tests. ## Open Standards and Ecosystem Approach Cisco's implementation emphasizes open standards and ecosystem interoperability through their participation in an open source collective that includes partners like LangChain and Galileo. This collective, accessible through agency.org, aims to create standardized frameworks for agent communication and composition, reducing the integration overhead typically required when connecting different agent systems. The framework addresses several critical aspects of agent system operationalization: identity management for agents, schema frameworks for defining agent skills and capabilities, directory services for agent discovery and registration, semantic and syntactic composition methods, and comprehensive observability for agent interactions and performance monitoring. This approach represents a sophisticated understanding of enterprise LLMOps challenges, particularly the need for systems that can integrate with existing toolchains without requiring extensive custom development work. The framework supports multiple communication protocols including MCP (Model Context Protocol) and A2A (Agent-to-Agent), ensuring broad compatibility with emerging industry standards. ## Evaluation Methodology and Metrics The evaluation approach demonstrates mature thinking about LLMOps assessment, focusing on extrinsic rather than intrinsic metrics to ensure measurements align with customer value delivery. The team evaluates multiple system components including individual agent performance, knowledge graph accuracy and completeness, digital twin fidelity, and overall system effectiveness. For knowledge graph evaluation, the focus on extrinsic metrics means measuring how well the graph serves the actual use case rather than abstract graph quality measures. This includes assessing query response accuracy, completeness of network representation, and the graph's ability to support agent reasoning processes effectively. Agent evaluation encompasses both individual agent performance and multi-agent coordination effectiveness. This includes measuring reasoning loop efficiency, task completion rates, and the quality of inter-agent communication and coordination. The digital twin evaluation focuses on how accurately the system can predict real-world network behavior and identify potential issues before they impact production systems. This represents a critical validation point for the entire system's value proposition. ## Current Status and Future Development The system currently operates as an MVP (Minimum Viable Product) in customer environments, representing a successful transition from Cisco's incubation process to production deployment. The team continues learning from real-world usage and iterating on system capabilities based on customer feedback and performance data. Key insights from the current deployment include the critical importance of the knowledge graph foundation and the open agent framework for building scalable customer solutions. The modular architecture allows for incremental improvements and capability additions without requiring system-wide changes. The presenter noted that while this represents a functional production system, it remains an early-stage implementation with significant opportunities for enhancement. The team continues exploring additional use cases and expanding the system's capabilities based on customer requirements and emerging technologies. ## Technical Challenges and Lessons Learned Several significant technical challenges emerged during development that offer valuable insights for other LLMOps implementations. The heterogeneous nature of network data sources required sophisticated ETL processes and careful schema design to ensure agent effectiveness. The decision to standardize on OpenConfig schema proved crucial for enabling LLM understanding of network concepts and relationships. The fine-tuning versus RAG decision for knowledge graph querying highlights the importance of empirical testing rather than assuming that popular approaches will work for all use cases. The team's willingness to pivot from RAG to fine-tuning based on performance results demonstrates good LLMOps engineering practices. The multi-agent coordination challenges required careful design of communication protocols and task distribution mechanisms. The ReAct reasoning loop implementation for specialized agents while using a different approach for the orchestrator agent shows sophisticated understanding of when different architectural patterns are appropriate. Integration with existing enterprise systems like ITSM tools and version control systems required significant engineering effort but proved essential for customer adoption. The bidirectional communication between ServiceNow and the AI system demonstrates the complexity of real-world enterprise integrations. This case study represents a comprehensive example of enterprise LLMOps implementation, showcasing sophisticated technical architecture, practical performance optimization strategies, and thoughtful approaches to evaluation and ecosystem integration in a production environment.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source