Doordash: MLOps Platform Integration with Metaflow for Training and Serving Workflows

Company

Doordash

Title

MLOps Platform Integration with Metaflow for Training and Serving Workflows

Industry

E-commerce

Link

https://www.youtube.com/watch?v=BHPdzkdR530

Year

2025

Summary (short)

DoorDash's ML platform team integrated Metaflow into their infrastructure to create a unified, scalable machine learning platform that addresses the nuanced boundary between development and production. The solution abstracts underlying compute infrastructure (Kubernetes, Argo Workflows), provides reproducibility through dependency management and metadata persistence, and enables ML engineers to deploy models from development to production using a single Python script interface. The platform uses namespace-based resource management with Kueue for multi-team orchestration, integrates with Triton for model serving, and supports self-service deployment workflows that significantly improve the ML engineer experience by eliminating the need to interact with multiple disparate systems.

## Overview This case study documents DoorDash's implementation of Metaflow as the core orchestration framework for their machine learning platform. The presentation was delivered by Faras and Sunza from DoorDash's ML platform team during a Metaflow community office hours session. While this case study doesn't specifically focus on LLMs, it provides valuable insights into MLOps patterns that are directly applicable to LLMOps scenarios, particularly around model deployment, reproducibility, and production-scale infrastructure management. DoorDash operates with a globally distributed ML platform team spanning multiple time zones including the US and EU, supporting over a dozen different organizational units. Each team has distinct budget requirements, resource needs, and security contexts. The platform needed to support both traditional ML workloads and increasingly complex model architectures while maintaining developer velocity and operational reliability. ## Core Problem Statement The DoorDash ML platform team faced several interconnected challenges that are particularly relevant to organizations deploying ML and LLM systems at scale. First, they needed to address the nuanced boundary between development and production in ML workflows. Unlike traditional software where clear promotion gates exist, ML model training often happens at production scale even during development phases. When a training run produces good metrics, teams want to promote that exact artifact to production rather than rerunning the entire training process, which would be both wasteful and potentially non-deterministic. Second, the team needed to abstract underlying compute infrastructure from ML engineers while still providing flexibility for rapidly evolving ML techniques. The ML landscape has been moving at an accelerated pace, and platform teams risk becoming bottlenecks if they must explicitly bless every new technique or framework before practitioners can use it. This tension between standardization and innovation flexibility is particularly acute in the LLM space where new techniques emerge frequently. Third, reproducibility emerged as a critical requirement with multiple dimensions. ML engineers needed the ability to reliably rerun the same workflow across different execution environments, at different times, by different users, and still achieve comparable results. This encompasses dependency management, metadata persistence, code versioning, and execution isolation—all of which become even more critical when dealing with non-deterministic components inherent to ML systems. ## Solution Architecture DoorDash selected Metaflow as their core orchestration framework based on three primary criteria: unified and extensible user experience, compute infrastructure abstraction, and comprehensive reproducibility mechanisms. The architecture leverages Metaflow's plugin-based extensibility model, which allows the platform team to remain "highly aligned but loosely coupled"—a principle borrowed from Netflix's organizational philosophy. The technical stack centers on Kubernetes and Argo Workflows as the execution layer. Metaflow workflows compile down to Argo workflow definitions, with each step potentially running in isolated Kubernetes pods. The architecture supports multiple compute providers beyond Kubernetes, including specialized services for Spark jobs (such as Databricks or EMR) and dedicated GPU infrastructure. However, even when launching jobs on external compute providers, the orchestration typically initiates from a Kubernetes pod that serves as a jump-off point. Resource management represents a sophisticated aspect of the implementation. DoorDash uses Kueue to implement namespace-based resource quotas and sharing policies. Each team at DoorDash receives its own Kubernetes namespace with designated resource allocations including different GPU types, memory, and CPU. Critically, these quotas support borrowing mechanisms—if Team A needs 150 A10 GPUs but only has a quota for 100, they can borrow the additional 50 from Team B's unused allocation. This approach provides both isolation and efficiency, preventing resource contention while avoiding stranded capacity. Namespaces are configured with Okta group-based access control, customizable object limits, and resource constraints. The platform team provides base templates that individual teams can override for their specific needs, creating a self-service model that reduces platform team toil. Each namespace receives replicated event messages from Argo Workflows, enabling cross-namespace workflow orchestration—a pattern important when different teams' workflows depend on each other's completion. ## Reproducibility Implementation DoorDash's reproducibility strategy addresses four key dimensions through Metaflow's native capabilities and custom extensions. The first dimension is dependency management, implemented through a custom `@image` decorator. This decorator takes a base Docker image and specified Python packages, mints a new image with those dependencies installed, and caches it for reuse. Any workflow step across any namespace that requests the same dependency combination will reuse the cached image, ensuring consistency. This approach provides stronger isolation than virtual environments while maintaining reasonable build times through caching. Metadata persistence leverages Metaflow's built-in metadata service, which tracks hyperparameters, metrics, and execution context. For ML workflows, this proves critical—tracking the tree depth in tree-based models, dropout ratios in neural networks, or learning rate schedules enables teams to understand exactly what configuration produced specific results. The metadata service stores this information in blob storage (S3 or Google Cloud Storage), making it queryable across workflow executions. Code persistence ensures that the exact code used to produce a model remains available even after repository changes. While Metaflow natively backs up code to S3, DoorDash plans to enhance this by also pushing code artifacts to GitHub for discoverability. This proves particularly valuable during incident response when on-call engineers need to quickly understand what code is running in production without executing Metaflow-specific commands or navigating blob storage. Execution isolation ensures that concurrent workflow runs don't interfere with each other. Each step runs in its own pod with dedicated resources, and artifacts are stored with execution-specific identifiers. This isolation extends to the model serving layer, where models deploy in their own pods even in the production environment, preventing one deployment from impacting others. ## Model Serving Integration The serving infrastructure integration represents a significant component of DoorDash's Metaflow implementation, particularly relevant for organizations deploying LLMs. Sunza from the platform team explained that the serving workflow consolidates what previously required interacting with multiple systems into a single Python script with Metaflow and Argo workflow definitions. The deployment process encompasses several steps, all defined in one unified interface. ML engineers specify model artifacts, define machine resources (CPU, GPU count, pod scaling, and autoscaling policies), declare dependencies including Python packages and any required auxiliary processes, configure networking and service mesh integration, and trigger CI/CD pipelines. Previously, engineers navigated multiple systems for these steps; with Metaflow, they execute a single script that can target either development or production environments from their local machines. The serving stack uses Triton Inference Server with its model ensemble capabilities. Triton allows packaging preprocessing code, the model itself, and postprocessing logic together as a single artifact. The platform provides helper functions for model preparation that package everything according to required specifications and generate deployment specs. These artifacts get uploaded to S3-based model registries—separate buckets for development and production with different access controls. Development buckets allow any ML engineer to write, while production buckets maintain restricted access. Model registration involves API calls to a registry service and caching configuration in a configuration management system. The caching proves necessary because online serving operates at millions of queries per second, making standard gRPC calls to a registry service impractical. The configuration system efficiently provides information about model dependencies, particularly features needed for real-time feature fetching. The deployment workflow supports testing before production promotion. Engineers can deploy to development environments for initial validation, run models locally in "box" mode for rapid iteration, or deploy to production in isolated pods that don't affect existing traffic. A routing layer sits in front of production deployments, directing traffic to appropriate model endpoints. Development deployments bypass the router, enabling faster feedback cycles. The platform also provides separate Argo workflow pipelines built via Metaflow for performance testing against production environments. When promoting a model from development to production, the system doesn't just copy the model file—it duplicates all associated metadata, preprocessing and postprocessing code, and feature definitions. The production copy essentially changes flags and replicates artifacts from the development S3 bucket to the production bucket, maintaining full provenance and reproducibility. ## Platform Team Enablement The Metaflow integration significantly impacted how DoorDash's platform team operates. The plugin-based architecture means that components built by one subteam (such as the `@image` decorator built by the training team) automatically become available to other subteams like serving, without requiring direct coordination. This loose coupling proves essential for a globally distributed team where synchronous communication across time zones creates bottlenecks. The open-source foundation of Metaflow provides additional leverage for vendor integration. When evaluating new tools or services, DoorDash can point vendors to the Metaflow repository and ask them to build integrations conforming to Metaflow's plugin contracts. This self-service vendor onboarding reduces platform team involvement and accelerates proof-of-concept timelines. The public API contract allows ML engineers to adopt emerging techniques without waiting for platform team approval. If someone publishes a Ray plugin for Metaflow or a torch distributed training plugin, engineers can pip install and use it immediately, assuming it conforms to the plugin interface. This pattern proves particularly valuable in fast-moving domains like LLM training and inference where new optimization techniques emerge frequently. The standardization on Metaflow enables the platform team to focus on longer-term initiatives rather than constantly context-switching to unblock individual teams or projects. The unified interface reduces the cognitive overhead for ML engineers as they move between feature engineering, training, and serving, all using consistent patterns. ## Cost Management and Attribution DoorDash discussed their approach to cost attribution and management, though with less detail than other topics. The namespace-based architecture provides natural cost boundaries—each team's Kubernetes namespace maps to their budget, making it straightforward to attribute compute costs. The Kueue-based resource quotas enable both hard limits and soft sharing policies, giving teams budget predictability while allowing opportunistic use of idle resources. The team mentioned this as an active area of interest for community discussion, suggesting that more sophisticated approaches remain under development. Cost management for ML infrastructure, particularly GPU resources, represents an ongoing challenge that becomes even more acute with large language models requiring substantial compute for both training and inference. ## Feature Engineering Integration Challenges During the community discussion, DoorDash raised feature engineering as a key area where they're seeking better integration patterns. Currently, their feature store operates somewhat independently from Metaflow. Engineers define features in the feature store system, specifying types, upstream dependencies, and other metadata. Metaflow workflows can then update these features, but must conform to the predefined schemas. For offline batch features, the feature store updates first, then a background process managed by the platform uploads data to online stores (Redis or RocksDB) on a set schedule. This creates a bifurcated experience where engineers must be aware of two systems. DoorDash is exploring several potential approaches. One option involves a decorator-based pattern where steps in Metaflow workflows declare feature sets and dependencies, potentially including SQL queries that execute on Spark or other compute. Another approach focuses on tighter feature store integration where Metaflow might only define final training tables that depend on upstream feature definitions, supporting basic transformations but deferring complex feature engineering to specialized systems. Backfills emerged as a critical requirement for any solution. Feature definition changes frequently require recomputing historical values—potentially 30 days or six months of data. The feature engineering solution must support backfills as first-class operations. DoorDash currently uses Airflow integration for large-scale backfills rather than native Metaflow constructs, but this creates additional complexity. Lineage tracking represents another gap. While Metaflow provides some lineage capabilities, they're less mature than specialized data lineage tools. Understanding what features depend on which upstream data sources and how changes propagate proves essential for production ML systems, particularly when debugging unexpected model behavior or planning infrastructure changes. The discussion revealed that Netflix faces similar challenges and is exploring both SQL-step annotations and direct feature store integrations. DoorDash and the Metaflow team agreed to continue this conversation in follow-up sessions, potentially leading to community-wide patterns or official integrations. ## Assessment and Considerations This case study, while not specifically about LLMs, provides valuable patterns directly applicable to LLMOps. The challenges DoorDash addresses—reproducibility, scale, resource management, deployment automation—all apply to LLM workflows with even greater intensity due to model sizes and computational requirements. The reproducibility approach deserves particular attention. The `@image` decorator pattern provides stronger dependency isolation than Python virtual environments, critical for LLMs where specific CUDA versions, transformer library versions, and other dependencies create complex compatibility matrices. The metadata tracking becomes even more important with LLMs where subtle hyperparameter changes (learning rates, attention patterns, context windows) can dramatically impact results. The serving integration demonstrates sophisticated patterns for deploying complex model pipelines. LLM serving often requires preprocessing (tokenization, prompt templating), the model itself, and postprocessing (detokenization, output formatting, safety filtering). The Triton ensemble approach that DoorDash uses maps well to these requirements. The namespace-based resource management with borrowing policies offers a practical approach to GPU allocation for LLM workloads, where training runs might need bursty access to large GPU clusters while inference maintains steadier demand. The ability to borrow unused capacity while maintaining quota boundaries provides both flexibility and cost control. However, several limitations and open questions emerge. The feature engineering integration remains incomplete, and for LLM applications that increasingly rely on retrieval-augmented generation or feature-based context injection, this gap could prove significant. The reliance on Argo Workflows and Kubernetes provides solid foundations but requires substantial infrastructure expertise to operate at scale—smaller organizations might find this barrier high. The cost attribution approach, while functional, appears relatively basic. As LLM costs become more significant portions of infrastructure budgets, more sophisticated attribution (by model, by application, by token volume) might become necessary. The brief mention of UI optimization for caching as a scaling concern suggests that the system faces scaling challenges as usage grows. The presentation focused heavily on infrastructure and orchestration, with less detail about monitoring, observability, and incident response for production models. For LLMs, understanding inference latency distributions, cache hit rates, prompt token distributions, and output quality metrics proves essential for production operations. Overall, DoorDash's Metaflow implementation demonstrates mature MLOps practices that provide solid foundations for LLMOps. The emphasis on reproducibility, unified interfaces, and self-service capabilities directly addresses challenges that LLM practitioners face. The open questions around feature engineering and cost management reflect areas where the broader MLOps and LLMOps communities continue to develop best practices.

Start deploying reproducible AI workflows today