Meta: Scaling AI Infrastructure: Managing Data Movement and Placement on Meta's Global Backbone Network

Meta operates one of the world's largest backbone networks, connecting over 25 data centers and 85 points of presence through millions of miles of fiber in both terrestrial and submarine routes. This case study examines how Meta had to transform their infrastructure approach when AI workloads began creating unprecedented demands on their network backbone, particularly starting in 2022. The company initially underestimated the impact AI would have on their backbone network. While they had previously focused primarily on storage-to-GPU data movement within data centers, they discovered that AI workloads introduced several new challenges that required a comprehensive rethinking of their infrastructure strategy: **Initial State and Challenge Discovery** When AI workloads first emerged in their infrastructure, Meta observed that the demand for GPUs from their product groups (Facebook, WhatsApp, Instagram, etc.) was relatively limited and requested in small cluster sizes. The traffic generated from these initial AI workloads largely stayed within individual data centers or regions. However, starting in 2022, they saw GPU demand grow by over 100% year-over-year, with requests for much larger cluster sizes. Contrary to their expectations, this growth led to a significant increase in cross-region traffic on their backbone network. **The AI Lifecycle and Its Infrastructure Impact** The speakers (Jna and Abishek Gopalan) outlined several critical stages in the AI lifecycle that affect their backbone network: * Data Ingestion: The initial collection of data from users and machines * Data Preparation and Placement: Determining where to store data and ensuring its protection and freshness * Data Replication: Moving data copies between regions for training and serving * Training: The actual AI model training process * Serving/Inference: Delivering trained models to users **Key Challenges** The case study identified three main challenges that AI workloads introduced to their infrastructure: 1. **Data Freshness Requirements**: AI systems often require access to the most recent data, leading to increased cross-region reads and network traffic. The networking costs for shuttling fresh data across regions began to exceed compute costs. 2. **Data Replication at Planetary Scale**: AI workloads require data to be moved between multiple regions: * Source regions with original data * Training regions where AI jobs run * Serving regions where models are deployed 3. **Complex Data Placement Optimization**: The placement of data and AI training resources became a complex optimization problem involving: * Volatile demand signals from product groups * Supply constraints (construction delays, geopolitical factors, market availability) * Multiple physical sites, buildings, and hardware SKUs * Limited fungibility of AI workloads across different hardware types **Solutions Implemented** Meta approached these challenges through two main strategies: 1. **Bending the Demand Curve**: * Implemented a holistic approach involving compute, data, and storage teams * Developed better caching and data placement strategies to reduce cross-region reads * Built improved instrumentation and observability for data flows * Implemented differentiated classes of service for different workload types * Leveraged temporal opportunities in network usage patterns 2. **Expanding the Supply Curve**: * Designed infrastructure considering power, cooling, and network solutions together * Invested in buying space and power capacity * Procured additional fiber infrastructure * Created flexible backbone designs to handle demand pattern changes * Built in workload optionality to handle unexpected spikes **Results and Ongoing Challenges** Through their solutions, Meta achieved significant improvements: * Reduced cross-region reads to approximately 2/3 of what they would have been without optimization * Better utilized existing backbone capacity through improved workload classification and temporal scheduling * Created more resilient infrastructure capable of handling AI workload volatility However, they acknowledge that challenges remain: * The impact of large clusters and general AI is still not fully understood * The rate of innovation in AI continues to create new infrastructure demands * The need to balance infrastructure optimization with enabling product innovation remains ongoing **Technical Infrastructure Details** The backbone network infrastructure includes: * Millions of miles of fiber (both terrestrial and submarine) * Petabits of capacity * 30% year-over-year growth in backbone capacity for the last 5 years * Support for exabyte-scale data movement * Multiple hardware SKUs and heterogeneous computing resources This case study provides valuable insights into the real-world challenges of scaling infrastructure for AI workloads at a global scale. It highlights the importance of thinking holistically about infrastructure design, the need for sophisticated data movement and placement strategies, and the value of building flexible systems that can adapt to rapidly changing demands. The experience at Meta demonstrates that successful AI infrastructure requires going beyond just providing compute resources, and must consider the entire lifecycle of AI workloads and their impact on network infrastructure.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source