ZenML

Migrating LLM Fine-tuning Workflows from Slurm to Kubernetes Using Metaflow and Argo

Adept.ai 2023
View original source

Adept.ai, building an AI model for computer interaction, faced challenges with complex fine-tuning pipelines running on Slurm. They implemented a migration strategy to Kubernetes using Metaflow and Argo for workflow orchestration, while maintaining existing Slurm workloads through a hybrid approach. This allowed them to improve pipeline management, enable self-service capabilities for data scientists, and establish robust monitoring infrastructure, though complete migration to Kubernetes remains a work in progress.

Industry

Tech

Technologies

Overview

Adept.ai is a company building machine learning models that can interact with everything on a user’s computer—essentially creating AI agents capable of operating browsers, filling out spreadsheets, and performing complex computer-based tasks on behalf of users. This case study, presented by Rahul from Adept’s infrastructure team, details how the company migrated their LLM fine-tuning and training pipelines from a Slurm-based system to a more modern orchestration approach using Metaflow with Argo Workflows on Kubernetes.

The presentation provides valuable insights into the real-world challenges of managing large-scale LLM training infrastructure, particularly around the tension between maintaining velocity for data scientists while modernizing the underlying infrastructure. It’s worth noting that this is an ongoing migration rather than a completed project, which offers an honest look at the iterative nature of such infrastructure transitions.

The Problem: Complex and Fragmented Training Infrastructure

Adept.ai started with Slurm as their training and fine-tuning infrastructure, which is a natural choice given Slurm’s long history in high-performance computing (HPC) environments. Slurm is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system that predates modern AI workloads but fits them reasonably well. Cloud vendors typically offer either Slurm or Kubernetes clusters when providing large-scale GPU infrastructure (the presenter mentions scenarios with 200-300 nodes, each with 8 GPUs).

However, as the team grew and different people joined at different times, the codebase became increasingly complex:

An illustrative example from the presentation: one workflow might load a config with hyperparameters and call slearn as batch, while another workflow for nightly fine-tuning and evaluation had its own config, conditional logic for checking if nightly runs succeeded before triggering evaluation, and coordination with annotation partners for evaluation—all executed differently.

The Solution: Metaflow + Argo on Kubernetes

The team investigated several workflow orchestration alternatives and ultimately chose Metaflow with Argo Workflows on Kubernetes for several key reasons:

The team’s goals extended beyond just workflow orchestration—they also wanted to support Dev boxes, CI/CD, and eventually gang-scheduled training directly on Kubernetes, making this a holistic containerization initiative.

Implementation Challenges

The migration was not straightforward, and the presenter candidly discusses several challenges that took months to resolve:

Challenge 1: Untangling Complex Code

The existing code was tightly coupled with specific config file locations and loading patterns. The team spent several sprints refactoring the code to identify logical steps that could be mapped to Metaflow’s step-based workflow approach. They also had to ensure consistent config loading once Metaflow containerized the Python files and associated code. Python path and environment variable manipulation were required to help Metaflow’s containerized code find the correct resources.

Challenge 2: Maintaining Backward Compatibility with Slurm

A pragmatic decision was made to implement a hybrid approach rather than forcing a complete migration. The Metaflow flows serve as the orchestration layer, but each step SSHs into a bastion node on the Slurm cluster and runs the actual srun command for execution. This allows data scientists to continue working without interruption while the infrastructure team progressively migrates workloads to run natively on Kubernetes.

This hybrid approach is explicitly described as a “stop-gap measure”—the ultimate goal remains running fully containerized workloads on Kubernetes. This honest acknowledgment of technical debt and incremental migration is a valuable lesson for other teams facing similar challenges.

Challenge 3: Containerization with Large Repositories

The code repository had grown to approximately 1 GB due to historical use of Git LFS for storing model versions and data. This made containerization slow and cumbersome, especially when data scientists wanted to run fine-tuning jobs on their local code changes (requiring commit, container build, and execution at that commit hash).

The team addressed this by:

Challenge 4: Multi-User Access and Discoverability

To make the system accessible to multiple users, the team:

Production Workflows in Use

The presentation showcases several production workflows that demonstrate the practical application of this infrastructure:

Fine-Tune and Eval Workflow

This is a primary use case where the DAG-based approach of Metaflow shines. The workflow:

This seemingly simple DAG masks significant complexity in providing observability to data scientists.

Cron Jobs and Nightly Workflows

The system supports various scheduled jobs:

Power User Patterns

Some data scientists discovered they could trigger new workflows programmatically from within their code. For example, when a training job completes, it can automatically launch an evaluation job by making requests to trigger Metaflow commands. This emergent use pattern demonstrates the flexibility of the system.

CI/CD Integration

The team implemented CI/CD through CircleCI that:

This automation ensures that workflow updates are deployed consistently without manual intervention.

Infrastructure Monitoring

Even while still on Slurm, the team uses Metaflow to orchestrate monitoring tasks:

Results and Current State

The wins from this migration include:

The work still in progress includes:

Desired Features and Future Work

The presenter mentions several features that would improve their setup:

Key Takeaways for LLMOps Practitioners

This case study offers several lessons for teams building LLM training infrastructure:

The honest discussion of challenges, compromises, and work in progress makes this a particularly valuable case study for teams considering similar migrations.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53