Company
Meta
Title
Scaling AI Infrastructure: Network Architecture and Communication Optimization at Microsoft
Industry
Tech
Year
2023
Summary (short)
Microsoft's AI infrastructure team tackled the challenges of scaling large language models across massive GPU clusters by optimizing network topology, routing, and communication libraries. They developed innovative approaches including rail-optimized cluster designs, smart communication libraries like TAL and MSL, and intelligent validation frameworks like SuperBench, enabling reliable training across hundreds of thousands of GPUs while achieving top rankings in ML performance benchmarks.
## Overview This presentation, delivered by a Microsoft engineer (Judin), provides a deep technical dive into the infrastructure challenges of building and operating large-scale AI training clusters at Microsoft Azure. While the speaker is from Microsoft (not Meta as noted in the company context), the content is highly relevant to any organization operating LLMs at scale, as it addresses the fundamental infrastructure layer that enables large model training. The talk focuses on the "AI platform side" rather than model optimization or parameter tuning, specifically covering networking infrastructure and communication libraries. This represents a critical but often overlooked aspect of LLMOps—the physical and logical infrastructure that enables models to train across thousands of GPUs reliably and efficiently. ## The Scale Challenge The presentation opens with a reflection on how dramatically scale requirements have changed. What was once considered a large cluster (1-2K GPUs) is now routine, with CEOs publicly announcing clusters of "multiple hundreds of thousands of GPUs." This exponential growth in scale brings correspondingly complex challenges in network design, validation, and reliability. Microsoft's AI clusters have achieved notable benchmarks, including ranking #1 in the cloud (and #3 overall) in the Top 500 supercomputer list, as well as #1 in cloud (and #2 overall) in MLPerf benchmarks. These achievements came on Azure NDv4 and NDv5 SKUs (based on NVIDIA A100 and H100 GPUs respectively). The driving forces behind this scaling push include evolving markets, increasingly diverse and large datasets, and more complex models. There's also competitive pressure—the "race is on" to complete training first while maintaining accuracy and performance targets. ## Network Topology Design A significant portion of the presentation addresses network topology decisions, which have profound implications for workload performance and cost. The speaker discusses several topology approaches: **Rail-Optimized Non-Oversubscribed Clusters**: For public cloud clusters supporting diverse workloads, Microsoft focuses on rail-optimized designs without oversubscription, where any node can communicate with any other node without performance degradation. The presentation shows bandwidth distribution data demonstrating a "single band" of performance across all node pairs, enabling support for various communication patterns (all-reduce, reduce-scatter, all-to-all) without topology-specific limitations. **Multi-Pod Architecture**: Within a pod, there's zero oversubscription with full-speed communication between any nodes, but some oversubscription is acceptable between pods. This provides flexibility in cluster design while managing costs. **Rail-Only Topology**: An interesting cost optimization approach where communication is limited to within a single rail. This works well for many collective operations and provides "a lot of cost savings" by eliminating the top layer of switches. However, the speaker notes this may not work for all workloads. **Multiplane Designs**: Where NICs are split into multiple planes, potentially combined with multi-rail approaches. The key consideration the speaker emphasizes is "fungibility"—how well the network design serves the actual workloads it will run. ## Routing Challenges and Solutions The presentation reveals some surprising routing issues discovered during cluster bring-up that aren't obvious from small-scale testing: - **On-off vs. Continuous Flow Differences**: Between two hardware generations, the newer generation took additional time to reach peak bandwidth with certain traffic patterns. - **Multiple Flow Startup Delays**: The first flow in a multi-flow scenario took significantly longer to reach peak bandwidth, a problem that only manifested at scale. - **Path Overprovisioning Issues**: Counterintuitively, when more paths were available than communicating pairs, per-pair performance actually decreased below theoretical peak. Reducing the number of links improved individual pair performance. These issues were resolved through collaboration with networking partners, but highlight the importance of scale-specific validation. For routing strategies, Microsoft is exploring several approaches: "Simple NIC + Smart Switch," "Smart NIC + Simple Switch," and more recently "Smart NIC + Simple Switch + Smart CCL." The key insight is that the communication library (CCL) can actively participate in routing decisions, either with the NIC selecting optimal routing schemes or with hints flowing from the CCL to the network. A "snooping agent" can provide feedback to a control agent that translates decisions to the communication library. ## Cluster Validation Frameworks Given that large clusters take weeks or months to build and cannot remain idle during construction (new sections are constantly being added while training runs on completed sections), efficient validation is critical. Microsoft has developed two main tools: **SuperBench**: A node-level monitoring and classification framework. It observes node behavior, models expected performance, generates targeted benchmarks, and classifies nodes as "good" or "bad" for remediation. The framework is customizable and extensible to different node types and architectures. **IB Pulse**: An in-house benchmarking framework for network validation that can target specific areas of the network or topology levels. It outputs lists of good/bad nodes, links, or NICs based on expected vs. actual bandwidth thresholds. The goal is reducing the time spent on validation during cluster bring-up while maintaining quality. ## Communication Library Optimization The speaker notes that popular communication libraries are typically tuned for specific hardware and topology combinations, leaving room for improvement in heterogeneous or novel environments. Key optimization efforts include: **TACCL (Topology-Aware Communication Collective Library)**: Uses hierarchical algorithms targeting mixture-of-experts (MoE) models. Instead of linear all-to-all operations, it performs 2D hierarchical data movement—shuffling data over NVLink first, then gathering and sending over the network in larger chunks. This improves network utilization by coalescing data rather than "spraying smaller packets across the network." **MSCCL (Microsoft Collective Communication Library)**: Provides tuned configurations for specific clusters. Users can specify their Azure SKU, scale, and message size to receive optimal tuning parameters. It also allows custom HTSL-based collective algorithms. **QP Tuning and Protocol/Algorithm Tuning**: Fine-grained optimization for specific message sizes that are critical for training workloads, demonstrably lowering latency and increasing bandwidth. **Non-RTT Based Algorithms**: For multi-region communication (maximizing power across regions), round-trip-time-based algorithms don't work effectively. Microsoft is investing in efficient transfer mechanisms that don't require acknowledgments. The communication library work connects back to topology—rail-only architectures become more viable when algorithms like TACCL can keep communication within the rail while maintaining performance. ## Reliability and Fault Tolerance Training jobs run for extended periods (potentially months), making reliability crucial. The presentation examines factors impacting training downtime: **Link Flaps**: The primary reliability concern, split into server-to-switch and switch-to-switch categories. For server-to-switch links, when flap duration falls below a threshold, proper tuning of the NIC library allows jobs to sustain link-down events with only performance degradation (not failure). Smart NIC-based approaches can maintain sustained performance even during link flaps. **Proportional Degradation**: When links are taken down one by one, the target is proportional performance drop while avoiding job failure. Key metrics include CCL reaction time to link events and time to return to peak bandwidth. **Smart Switch vs. Smart NIC**: Smart switch designs handle these scenarios well, but feedback-based smart NIC or CCL-driven designs can handle failures even more gracefully. **Targeted Link Testing**: Server-to-switch links are identified as "the most impactful," but testing them efficiently without wasting cluster capacity is challenging. Microsoft is working with networking partners on intelligent tools that can validate links even before they're integrated into the full fabric. ## Key Insights and Forward-Looking Directions The presentation concludes with observations about emerging trends: - Shifts in network topologies are accelerating, with combinations of multiplane and multi-rail approaches becoming common - Fine-grained routing is becoming increasingly important - Communication libraries are playing a more central role in routing decisions, not just data movement - The boundary between network intelligence and application intelligence is blurring ## Relevance to LLMOps While this presentation focuses on infrastructure rather than model development, it directly impacts LLMOps practitioners in several ways: - Understanding infrastructure constraints helps in model parallelism strategy decisions - Communication patterns in training (all-reduce, all-to-all for MoE) are directly impacted by these topology choices - Reliability engineering for long-running training jobs depends on these fault tolerance mechanisms - Cost optimization through topology choices affects the economics of large-scale training The scale described here (hundreds of thousands of GPUs, months-long training runs, multi-region coordination) represents the current frontier of AI infrastructure, relevant to any organization training or fine-tuning large language models at scale.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.