ZenML

AWS Trainium & Metaflow: Democratizing Large-Scale ML Training Through Infrastructure Evolution

Outerbounds / AWS 2024
View original source

The key lesson from this meetup is that we're seeing a fundamental shift in how organizations can approach large-scale ML training and deployment. Through the combination of purpose-built hardware (AWS Trainium/Inferentia) and modern MLOps frameworks (Metaflow), teams can now achieve enterprise-grade ML infrastructure without requiring deep expertise in distributed systems. The traditional approach of having ML experts manually manage infrastructure is being replaced by more automated, standardized workflows that integrate with existing software delivery practices. This democratization is enabled by significant cost reductions (up to 50-80% compared to traditional GPU deployments), simplified deployment patterns through tools like Optimum Neuron, and the ability to scale from small experiments to massive distributed training with minimal code changes. Perhaps most importantly, the barrier to entry for sophisticated ML infrastructure has been lowered to the point where even small teams can leverage these tools effectively.

Industry

Tech

Technologies

Overview

This case study documents a collaboration between Outerbounds (maintainers of the open-source Metaflow ML orchestration framework) and AWS’s Annapurna machine learning accelerator team to integrate Metaflow with AWS’s custom ML chips, Trainium and Inferentia. The integration addresses a key challenge in production LLM operations: enabling cost-efficient training and inference of large language models while maintaining the ease of use that data scientists expect from modern MLOps tooling.

The presentation was delivered as an MLOps community meetup featuring Eddie from Outerbounds and Scott Perry, a Solutions Architect on AWS’s custom ML accelerator team. Their joint presentation illustrates how infrastructure-level innovations (custom silicon) can be made accessible through high-level orchestration frameworks.

The Problem Space

Organizations looking to train and deploy large language models face several interconnected challenges. First, there’s the sheer cost of compute—training state-of-the-art models requires significant GPU resources, and GPU availability has been constrained in recent years. Second, there’s the complexity of setting up distributed training infrastructure that can scale from experimentation to production. Third, organizations want their MLOps investments to be robust against the rapid pace of change in the AI landscape, where new model architectures and training techniques emerge constantly.

Eddie framed this using the concept of “pace layering” from Stewart Brand—the idea that complex systems have layers that evolve at different speeds. The infrastructure layer should be stable and robust, while the modeling layer (where trends like new architectures emerge) changes rapidly. The goal is to build an MLOps stack that enables access to commercial AI opportunities while remaining stable as upper layers change.

AWS Custom Silicon: Trainium and Inferentia

Scott Perry provided deep technical context on AWS’s custom ML chips. These are not simply “AWS’s version of a GPU”—they are purpose-built accelerators with architecture specifically designed for deep learning and generative AI workloads.

The Inferentia 2 and Trainium chips share a similar architecture with several key components. Each chip contains two neuron cores, which are the basic addressable units of compute. Within each neuron core, there are specialized engines: a tensor engine powered by a systolic array for matrix operations (the core of deep learning compute), vector engines for operations like batch normalization, scalar engines for activation functions, and general-purpose SIMD processors that can run custom C code for operators that didn’t exist when the chips were designed.

The chips include 32GB of HBM memory per chip, a Collective Communications Engine that enables overlapping compute and collective operations (critical for distributed training efficiency), and NeuronLink for high-bandwidth, low-latency communication between chips within an instance.

AWS launched Inferentia 1 in 2019, targeting smaller deep learning models with up to 70% lower cost per inference. Inferentia 2 followed in 2023, targeting transformer and diffusion models with up to 40% better price-performance. Trainium launched in 2022 for large-scale distributed training workloads, claiming up to 50% savings on training costs compared to comparable EC2 instances.

The Metaflow Integration

Eddie detailed how Outerbounds integrated Metaflow with AWS Trainium, making these specialized accelerators accessible through familiar MLOps patterns. The integration leverages AWS Batch as the compute backend, with Metaflow handling orchestration, versioning, and workflow management.

The deployment process involves two CloudFormation stacks: one for Metaflow itself (which can be deployed on any cloud or on-premise) and one for the Trainium compute environment. Once deployed, these link together to provide a compute environment ready for distributed training on Trainium devices.

Metaflow’s decorator-based approach allows users to annotate Python functions with resource requirements. A user can specify how many CPUs, how many Trainium/Inferentia devices, and which Docker image to use for dependencies. This declarative paradigm means the same workflow code can dispatch jobs to Kubernetes, AWS Batch, or other compute providers simply by changing configuration.

The integration includes monitoring capabilities that wrap the neuron-monitor CLI tool. Users can add a decorator to their functions that runs neuron-monitor at specified intervals, with results displayed as plots in the Metaflow UI showing neuron core utilization over the function’s lifecycle. This addresses a key operational concern: ensuring that expensive accelerator hardware is actually being utilized efficiently.

The MLOps Stack Architecture

The presentation outlined a conceptual stack for ML infrastructure that remains consistent whether training scikit-learn models or state-of-the-art LLMs:

Data Layer: Foundation for storage including data lakes on S3, warehouse providers like Snowflake, storage for Metaflow metadata, training/evaluation datasets, and model checkpoints. When dealing with Trainium devices, models are stored in slightly different formats (compiled for the Neuron SDK), making robust storage infrastructure important.

Compute Layer: Metaflow connects to different runtimes where data is accessed, with dynamic configuration of resources. The integration allows users to start with smaller Trainium instances (trn1.2xlarge) for testing at lower cost, then scale to full 32-node instances for production training—a factor of roughly 20x cost difference between test and production configurations.

Orchestration Layer: Metaflow provides workflow composition, scheduling, and triggering capabilities. Workflows can be deployed to Argo Workflows, AWS Step Functions, or Airflow. The GitHub repository includes examples with configuration files for HuggingFace datasets, hyperparameters, and Neuron SDK parameters for caching and optimization.

Versioning Layer: Every workflow run is versioned by default, which is particularly valuable for expensive Trainium jobs where understanding historical run behavior is critical before relaunching. The integration with GitHub provides code versioning alongside execution versioning.

Deployment Layer: While the repository focuses on training workflows, AWS provides documentation for deploying models trained with Neuron SDK to Inferentia inf2 instances for inference. The same neuron cores (with some differences) power both training and inference chips.

Modeling Layer: This is where the Neuron SDK becomes central. The SDK includes an extensive library of examples covering different model types: encoders, decoders, vision models, multimodal models. The AWS team actively adds examples as new architectures emerge.

Neuron SDK Details

The Neuron SDK is the complete software stack for driving Trainium and Inferentia chips. It includes several components:

Neuron Runtime: A driver and runtime library for loading and executing models on the hardware.

Framework Integration: PyTorch and JAX integration that allows users to keep their existing model code, moving models and tensors onto XLA devices. AWS is a founding member of the Open XLA initiative, and the Neuron stack uses XLA under the hood.

Compiler: When models run through the framework integration, a compilation process extracts XLA graphs representing the model’s computations and optimizes them for the Trainium/Inferentia hardware. This can be just-in-time or ahead-of-time compilation.

User Land Tools: neuron-top provides a graphical view of core utilization and memory usage; neuron-ls shows available cores and which processes are using them; a profiler helps diagnose performance bottlenecks.

Neuron Kernel Interface (NKI): A recently launched capability similar to OpenAI Triton that allows users to write lower-level kernels in C that execute directly on the hardware. This addresses a key concern for users migrating from GPUs who have custom CUDA kernels—NKI provides a path to implement equivalent functionality. Flash attention has been implemented using NKI as an example.

Ecosystem Integration

The solution emphasizes ecosystem compatibility, recognizing that customers have varied stacks and don’t want to change their tooling to adopt new technology. Key integrations include:

Optimum Neuron: A collaborative project with HuggingFace that adapts Transformers, the Trainer API, and SFTTrainer to work with Trainium/Inferentia. This is described as “probably the easiest way to get started” since users can continue using familiar HuggingFace patterns.

Model Hosting: Support for popular model servers including vLLM, HuggingFace TGI, DJL, TorchServe, and Ray Serve.

AWS Services: Trainium/Inferentia instances are available through EC2, ECS, EKS, SageMaker, Parallel Cluster, and AWS Batch.

Customer Results

Two customer testimonials were highlighted:

NinjaTech AI: Released AI personal assistants with models trained and deployed on Inferentia/Trainium. They reported up to 80% total cost savings and 50% more energy efficiency compared to previous GPU usage.

Leonardo AI: A visual asset design platform that moved models to Inferentia 2 and saw 80% cost reduction compared to previous GPU usage without sacrificing performance.

These are significant claims, though as with any vendor-provided testimonials, specific workload characteristics and baseline comparisons matter. The cost savings likely depend heavily on the specific model architectures and whether they’re well-optimized for the Neuron SDK.

Operational Considerations

The discussion touched on several practical operational aspects:

Availability: During the integration work, Eddie found that on-demand Trainium instances were consistently available within 10-15 minutes when targeting appropriate regions—a qualitatively different experience from the GPU availability challenges of recent years.

Instance Types: Trainium offers trn1.2xlarge (single chip) for fine-tuning and experimentation, and trn1.32xlarge/trn1n.32xlarge (16 chips, 32 neuron cores) for distributed training. The trn1n variant has twice the networking capability for distributed cases. Inferentia 2 offers four instance sizes from inf2.xlarge to inf2.48xlarge.

Regional Availability: 23+ regions with more planned.

Kubernetes vs. Serverless: The discussion noted increasing EKS adoption for ML workloads, with the Metaflow integration currently using AWS Batch but with plans for EKS support. The batch-like experience provides an almost serverless feel where compute appears when needed without manual cluster management.

Future Directions

The session concluded with discussion of continued integration work, including potential EKS support for Metaflow with Trainium, and enthusiasm for the NKI capability enabling community contributions of custom kernels. The ability to write custom kernels was identified as addressing one of the historical friction points in migrating GPU-based workloads.

The overall narrative is one of making specialized AI infrastructure accessible through familiar MLOps abstractions, enabling smaller teams to leverage large-scale training capabilities without deep infrastructure expertise.

More Like This

Automating Weather Forecast Text Generation Using Fine-Tuned Vision-Language Models

UK MetOffice 2025

The UK Met Office partnered with AWS to automate the generation of the Shipping Forecast, a 100-year-old maritime weather forecast that traditionally required expert meteorologists several hours daily to produce. The solution involved fine-tuning Amazon Nova foundation models (both LLM and vision-language model variants) to convert complex multi-dimensional weather data into structured text forecasts. Within four weeks of prototyping, they achieved 52-62% accuracy using vision-language models and 62% accuracy using text-based LLMs, reducing forecast generation time from hours to under 5 minutes. The project demonstrated scalable architectural patterns for data-to-text conversion tasks involving massive datasets (45GB+ per forecast run) and established frameworks for rapid experimentation with foundation models in production weather services.

poc data_analysis structured_output +31

Large-Scale Foundation Model Training Infrastructure for National AI Initiative

AWS GENAIC (Japan) 2025

Japan's GENIAC program partnered with AWS to provide 12 organizations with massive compute resources (127 P5 instances and 24 Trn1 instances) for foundation model development. The challenge revealed that successful FM training required far more than raw hardware access - it demanded structured organizational support, reference architectures, cross-functional teams, and comprehensive enablement programs. Through systematic deployment guides, monitoring infrastructure, and dedicated communication channels, multiple large-scale models were successfully trained including 100B+ parameter models, demonstrating that large-scale AI development is fundamentally an organizational rather than purely technical challenge.

code_generation multi_modality high_stakes_application +21

Training a 70B Japanese Large Language Model with Amazon SageMaker HyperPod

Institute of Science Tokyo 2025

The Institute of Science Tokyo successfully developed Llama 3.3 Swallow, a 70-billion-parameter large language model with enhanced Japanese capabilities, using Amazon SageMaker HyperPod infrastructure. The project involved continual pre-training from Meta's Llama 3.3 70B model using 314 billion tokens of primarily Japanese training data over 16 days across 256 H100 GPUs. The resulting model demonstrates superior performance compared to GPT-4o-mini and other leading models on Japanese language benchmarks, showcasing effective distributed training techniques including 4D parallelism, asynchronous checkpointing, and comprehensive monitoring systems that enabled efficient large-scale model training in production.

translation question_answering chatbot +37