Meta: Meta's Hardware Reliability Framework for AI Training and Inference at Scale

Meta operates one of the world's largest AI infrastructures, supporting the training of large-scale models like Llama 3 and advanced AI applications including text-to-image generation and object segmentation. This comprehensive case study reveals how Meta has tackled the critical challenge of hardware reliability in production AI systems, where even minor hardware faults can have cascading effects on training efficiency and inference quality. Meta's AI infrastructure consists of thousands of hardware components and servers connected via network fabric across globally distributed data centers. This setup integrates storage, compute, and network architectures with unique file systems and PyTorch applications specifically tailored for training and inference workloads. The scale of operations is immense - training large-scale models involves thousands of accelerators operating in a synchronous environment, where any component failure can interrupt or completely halt the training process. From their experience running the Llama 3 family of models, Meta discovered that hardware failures in components such as SRAMs, HBMs, processing grids, and network switch hardware significantly impact AI cluster reliability, accounting for over 66% of training interruptions. This finding underscores a critical challenge in LLMOps: accelerators tend to be less reliable than traditional CPUs due to their complexity and limited telemetry capabilities, while network complexity can result in misattributed failures and errors within the GPU software stack may require extensive configuration corrections. Meta has identified three distinct categories of hardware faults that affect their AI production systems. Static errors represent binary device states where hardware either powers on or powers off completely. These are the most straightforward to identify in large-scale fleets through simple health checks, though they become more frequent as configurations and device scales grow in large training clusters. While easier to triage and repair, their increased frequency at scale makes them a significant operational concern. Transient errors present a more complex challenge, characterized by their reproducibility issues and including load-dependent or partially observable faults. These might manifest as device issues from thermal runaway or random crashes from uncorrectable errors. Meta's approach to mitigation involves understanding the specific conditions under which these errors manifest, leveraging their large scale to aid in triaging and pattern matching. They set traps for these conditions and mark devices for mitigation or repair when triggered. Advances in Reliability, Availability, and Serviceability (RAS) telemetry in hyperscale infrastructure have greatly improved this detection process. The most insidious category is silent errors or silent data corruptions (SDCs), which occur when hardware miscomputes without leaving detectable traces, leading applications to consume incorrect results. These errors, often caused by silicon defects, can remain unnoticed for extended periods unless significant deviations are observed. The challenge is particularly acute for AI systems that rely heavily on accurate data for both training and inference. Detecting SDCs requires extensive engineering effort and costly telemetry systems to trace data corruption back to specific devices. Meta's research has revealed that with increased silicon density in modern accelerators, silent data corruptions now occur at approximately one fault per thousand devices, which is significantly higher than the historical rate of cosmic-ray-induced soft errors (one fault per million devices). This dramatic increase in SDC frequency has made hardware reliability a first-order concern for AI operations at scale. To combat these challenges, Meta has developed three sophisticated detection mechanisms that work in concert to provide comprehensive fleet coverage. Fleetscanner captures performance outliers at scale using targeted micro-benchmarks for identifying hardware defects. These benchmarks' signatures are integrated into telemetry systems for non-benchmark-based detection. The approach involves running directed tests during maintenance operations such as firmware upgrades and hardware repairs, with tests scheduled periodically to cover the entire fleet every 45 to 60 days. While this provides dedicated testing capabilities, it may be too slow for detecting some rapidly manifesting SDCs. Ripple represents a more agile approach, co-locating with production workloads and executing tests in milliseconds to seconds, allowing fleet-wide coverage in days rather than weeks. It overlaps test instructions across cores and threads, providing significantly faster detection capabilities than Fleetscanner. This approach enables near real-time monitoring of hardware health without significantly impacting production workloads. Hardware Sentinel introduces a novel, test-and-architecture-agnostic approach that evaluates application exceptions in kernel space. It identifies core-based anomalies as potential silent data corruption without requiring dedicated test allocations, operating solely in the analytical plane. This system has demonstrated superior performance, outperforming testing-based methods by 41% across different architectures, applications, and data centers. The combination of these three mechanisms provides what Meta considers one of the best in-fleet coverage solutions at scale for detecting and protecting infrastructure against SDCs. The impact of SDCs on AI training workloads presents unique challenges that go beyond traditional computing environments. In training scenarios, SDCs lead to incorrect computations affecting both forward and backward passes, resulting in divergence from the intended training path and significantly impacting training efficacy. While AI training workloads are sometimes considered self-resilient to SDCs, Meta's experience shows this is true only for a limited subset of SDC manifestations. In most realistic scenarios, self-resilience is inadequate because SDCs persist across iterations, and the quantization of data values in AI training, which increases information density per bit, actually exacerbates the impact of SDCs. Meta has identified two primary manifestations of training divergence due to SDCs. Not-a-Number (NaN) propagation occurs when an SDC pushes a representable value into an incorrect representation, generating NaN values during training computations. Once created, NaNs propagate through subsequent computations, affecting the training iteration, accelerator domain, host domain, and eventually the entire cluster. This widespread NaN contagion can lead to complete cluster halts, with the source often being just a few specific computations on a single accelerator that may be extremely difficult to trace amidst the cluster's massive scale. Corrupted gradient variance presents an even more subtle challenge, occurring when SDCs affect gradient calculations, leading to gradient explosion, implosion, or trapping in local minima. This corruption remains within numeric bounds and is mistakenly treated as correct data, affecting the entire cluster in synchronous training environments. The corrupted values are exchanged as legitimate training data, causing training to appear to progress without actual model improvement. Over time, SDCs aggregate and cause major divergences in gradients, potentially trapping algorithms in local minima or causing gradient explosions or implosions. Detecting these SDCs is particularly challenging due to their subtlety and the extended time periods required to observe their effects, which can span weeks or months. For inference workloads, SDCs create different but equally serious challenges. Silent data corruptions in inference applications lead to incorrect results that, due to Meta's massive scale of operations, can affect thousands of inference consumers simultaneously. Persistent SDCs can directly impact critical systems such as recommendation engines or large language model outputs. These corruptions can bypass policies related to privacy, safety, or content integrity, as they operate outside normal constraint boundaries. The result is a significant reduction in the efficacy of models that were trained using substantial computational resources, making seemingly benign inference use cases highly problematic at scale. Meta has developed comprehensive mitigation strategies organized into infrastructure and stack-level approaches. Infrastructure strategies focus on operational triage at the cluster level, managing and mitigating SDCs through physical and network infrastructure improvements. Reductive triage involves conducting binary searches with mini-training iterations on progressively smaller cluster sizes to isolate NaN propagation sources. This process aims to identify small clusters that can replicate issues, allowing offending nodes to be quarantined for investigation while reconstituted clusters resume training from saved checkpoints. Deterministic training involves running known effective models for short training iterations to verify computational correctness for specific value sets, helping identify failures that aren't data-dependent. Hyper-checkpointing creates checkpoints at increasingly high frequencies to facilitate faster identification and isolation of corrupting nodes, maintaining training throughput while containing problems to specific accelerators or hosts. Stack-level strategies require coordination with workloads and involve software-level adjustments. Gradient clipping enforces value limits within training workloads to mitigate NaN propagation, with computations exceeding specified ranges being clipped and NaNs detected during this process. While effective for many NaN cases depending on representation format, it may introduce partial errors in certain scenarios. Algorithmic fault tolerance represents a robust approach that integrates fault tolerance directly into training algorithms, handling a range of data corruptions while reducing the need for detection and triage. This method requires deep understanding of common defect modes and significant engineering investment across the entire stack, with some overhead to overall training footprint but enhanced computational efficiency. The tri-variate computational training architecture uses shadow nodes in synchronous training to mitigate SDCs, with training steps repeated across different nodes at random iterations. Training progress is verified between shadow and live nodes, with training halting if differences are detected. This approach offers robust training capabilities but demands significant algorithmic changes and increased data movement and infrastructure overhead. Parameter vulnerability factors identify vulnerable and resilient layers in machine learning architectures, enabling strategic mapping of vulnerable layers to resilient hardware and resilient layers to unprotected hardware. This approach requires dynamic evaluation that scales with architectural evolution and enables targeted resilient design, particularly valuable for inference workloads. Divergence detection maintains distribution maps for each neuron to identify deviations from typical output distributions during inference operations. While computationally expensive, it can be applied at selected sampling rates for large-scale inference systems, preserving behavioral patterns for specific workloads to detect corruptions during execution. Meta's journey toward industry leadership in SDC detection and mitigation began with identifying frequent fleet issues in 2016, progressed through scaling SDC detection capabilities in 2018, and implemented comprehensive detection frameworks by 2019. By 2020, detection mechanisms were successfully integrated into accelerator hardware, leading to influential research publications. The company has continued to advance the field through collaboration with industry leaders including Google, Microsoft, ARM, AMD, NVIDIA, and Intel to enhance server resilience standards. Looking forward, Meta is applying these hard-learned lessons to their Meta Training and Inference Accelerator (MTIA) family, aiming to deliver industry-leading fleet-reliability practices. Their factory-to-fleet approach emphasizes comprehensive silicon lifecycle management, from design through deployment, with innovations needed across all phases. This includes revisiting RAS solutions for scale, implementing lifecycle debug hooks, and developing telemetry architectures that support advanced tools like Hardware Sentinel, Fleetscanner, and Ripple. The case study demonstrates that hardware reliability has become a first-order concern for LLMOps at scale, requiring sophisticated detection mechanisms, comprehensive mitigation strategies, and deep integration between hardware and software systems. Meta's experience shows that as cluster sizes and semiconductor complexity continue growing, fault complexity will increase exponentially, necessitating coordinated factory-to-fleet solutions and stack-level resiliency measures. For organizations operating AI systems at scale, treating reliability as a primary design consideration rather than an afterthought has proven essential for maintaining production effectiveness and avoiding costly training failures or inference degradation.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source