ZenML

Scaling AI Image Animation System with Optimized Latency and Traffic Management

Meta 2024
View original source

Meta developed and deployed an AI-powered image animation feature that needed to serve billions of users efficiently. They tackled this challenge through a comprehensive optimization strategy including floating-point precision reduction, temporal-attention improvements, DPM-Solver implementation, and innovative distillation techniques. The system was further enhanced with sophisticated traffic management and load balancing solutions, resulting in a highly efficient, globally scalable service with minimal latency and failure rates.

Industry

Tech

Technologies

Summary

Meta’s AI-generated image animation feature represents a significant production deployment challenge in the generative AI space. The feature, part of Meta AI, allows users to generate short animations from AI-generated images. The core problem was deploying this computationally intensive model to serve billions of users across Meta’s family of apps while maintaining fast generation times (a few seconds), minimizing errors, and ensuring efficient GPU utilization across a global infrastructure.

This case study is particularly valuable because it addresses real-world production concerns that go beyond model development: latency optimization, resource efficiency, global traffic routing, failure handling, and sustainable scaling. The work builds on Meta’s prior research in video diffusion (animated stickers), Imagine Flash (Emu diffusion acceleration), and block caching techniques.

Model Optimization Techniques

The first major focus was on making the animation model fast and efficient enough to deploy at scale. The team employed several sophisticated optimization techniques that collectively reduced computational requirements significantly.

Precision Reduction with bfloat16

The team converted the model from float32 to float16/bfloat16 precision. This approach halves the memory footprint of the model and enables faster floating-point operations. bfloat16 was specifically chosen for its smaller mantissa, which provides a good balance between numerical stability during training and inference while still capturing the benefits of reduced precision. This is a common but effective technique in production ML systems where the marginal quality loss is acceptable for the significant performance gains.

Temporal-Attention Optimization

A more sophisticated optimization targeted the temporal-attention layers, which attend between the time axis and text conditioning. The original implementation replicated context tensors to match the time dimension (number of frames) before passing to cross-attention layers. The optimized implementation recognized that these repeated tensors are identical and delayed the expansion until after passing through the cross-attention’s linear projection layers. This reduced both compute and memory requirements, which is a good example of algorithmic optimization that doesn’t sacrifice quality.

DPM-Solver Integration

The team leveraged DPM-Solver (Diffusion Probabilistic Model Solver) combined with a linear-in-log signal-to-noise time schedule to reduce the number of sampling steps to 15. This addressed the fundamental slowness of diffusion models while maintaining generation quality. The choice of DPM-Solver over alternatives like DDIM (denoising diffusion-implicit models) or DDPM (denoising diffusion-probabilistic models) was driven by the favorable tradeoff between quality and computational cost.

Combined Guidance and Step Distillation

Perhaps the most impactful optimization was the combination of guidance distillation and step distillation. The original model required three forward passes per solver step: unconditional, image-conditional, and full-conditional (text-and-image). Guidance distillation collapsed these three passes into one. Simultaneously, step distillation trained a student model (initialized with the same weights as the teacher) to match multiple teacher steps in a single step.

The combined approach distilled 32 teacher steps (each with multiple U-Net passes) into just 8 student steps, with only one forward pass through the U-Net per step. This represents a dramatic reduction in computational requirements—roughly a 12x improvement when considering both the step reduction and the pass consolidation.

PyTorch Deployment Optimizations

Initial TorchScript Approach

For the initial launch, the team converted the model to TorchScript with freezing. This enabled automatic optimizations including constant folding, operation fusion, and computational graph simplification. Freezing further optimized the model by converting dynamically computed values to constants, reducing the total number of operations required during inference.

Migration to PyTorch 2.0

Post-launch, the team migrated from TorchScript to a PyTorch 2.0-based solution. This migration yielded several benefits:

This migration path is noteworthy as it reflects the broader industry trend of moving from TorchScript to PyTorch 2.0’s compilation stack, which offers better optimization opportunities and developer experience.

Global Traffic Management and Scaling

Once the model was optimized, the challenge shifted to running it at scale for global traffic. The team’s approach to this problem offers valuable lessons for production ML systems.

Capacity Planning

The team analyzed historical traffic data from previous AI-generated media launches to estimate request volumes. Combined with model speed benchmarks, they calculated GPU requirements. Load testing was then used to validate capacity and identify bottlenecks before launch.

Regional Traffic Routing System

Initial testing revealed unexpectedly high end-to-end latency due to global traffic routing, which added significant network overhead. The solution was a sophisticated traffic management system that:

The routing algorithm works iteratively: it identifies the region running closest to capacity, then attempts to offload a portion of that region’s traffic to nearby regions that can handle the additional load. The definition of “nearby” expands as the source region approaches maximum capacity—slight overload considers only close regions, while severe overload unlocks more distant regions.

GPU Utilization and Failure Handling

A critical constraint was that each GPU can only actively process one request at a time (full GPU saturation). To maintain fast latency, the team enforced that server load (queued requests plus in-flight requests) should be at most one, rejecting additional requests. This approach, while necessary for latency, caused failures when running near capacity.

The initial solution used retries as a probing mechanism to quickly find free GPUs. However, the regional traffic management system reduced the number of available hosts per request, causing retry cascades during traffic spikes.

The final solution involved:

This combination smoothed out traffic spikes and significantly reduced cascading errors.

Critical Assessment

While this case study provides valuable insights into production ML engineering, a few considerations are worth noting:

Despite these limitations, the case study offers a comprehensive view of the engineering challenges in deploying generative AI at Meta’s scale and the practical solutions employed to address latency, scaling, and reliability concerns.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Scaling AI Infrastructure: From Training to Inference at Meta

Meta 2024

Meta shares their journey in scaling AI infrastructure to support massive LLM training and inference operations. The company faced challenges in scaling from 256 GPUs to over 100,000 GPUs in just two years, with plans to reach over a million GPUs by year-end. They developed solutions for distributed training, efficient inference, and infrastructure optimization, including new approaches to data center design, power management, and GPU resource utilization. Key innovations include the development of a virtual machine service for secure code execution, improvements in distributed inference, and novel approaches to reducing model hallucinations through RAG.

high_stakes_application realtime_application code_interpretation +25

Mission-Critical LLM Inference Platform Architecture

Baseten 2025

Baseten has built a production-grade LLM inference platform focusing on three key pillars: model-level performance optimization, horizontal scaling across regions and clouds, and enabling complex multi-model workflows. The platform supports various frameworks including SGLang and TensorRT-LLM, and has been successfully deployed by foundation model companies and enterprises requiring strict latency, compliance, and reliability requirements. A key differentiator is their ability to handle mission-critical inference workloads with sub-400ms latency for complex use cases like AI phone calls.

high_stakes_application healthcare realtime_application +28