Adept.ai, building an AI model for computer interaction, faced challenges with complex fine-tuning pipelines running on Slurm. They implemented a migration strategy to Kubernetes using Metaflow and Argo for workflow orchestration, while maintaining existing Slurm workloads through a hybrid approach. This allowed them to improve pipeline management, enable self-service capabilities for data scientists, and establish robust monitoring infrastructure, though complete migration to Kubernetes remains a work in progress.
Adept.ai is a company building machine learning models that can interact with everything on a user’s computer—essentially creating AI agents capable of operating browsers, filling out spreadsheets, and performing complex computer-based tasks on behalf of users. This case study, presented by Rahul from Adept’s infrastructure team, details how the company migrated their LLM fine-tuning and training pipelines from a Slurm-based system to a more modern orchestration approach using Metaflow with Argo Workflows on Kubernetes.
The presentation provides valuable insights into the real-world challenges of managing large-scale LLM training infrastructure, particularly around the tension between maintaining velocity for data scientists while modernizing the underlying infrastructure. It’s worth noting that this is an ongoing migration rather than a completed project, which offers an honest look at the iterative nature of such infrastructure transitions.
Adept.ai started with Slurm as their training and fine-tuning infrastructure, which is a natural choice given Slurm’s long history in high-performance computing (HPC) environments. Slurm is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system that predates modern AI workloads but fits them reasonably well. Cloud vendors typically offer either Slurm or Kubernetes clusters when providing large-scale GPU infrastructure (the presenter mentions scenarios with 200-300 nodes, each with 8 GPUs).
However, as the team grew and different people joined at different times, the codebase became increasingly complex:
An illustrative example from the presentation: one workflow might load a config with hyperparameters and call slearn as batch, while another workflow for nightly fine-tuning and evaluation had its own config, conditional logic for checking if nightly runs succeeded before triggering evaluation, and coordination with annotation partners for evaluation—all executed differently.
The team investigated several workflow orchestration alternatives and ultimately chose Metaflow with Argo Workflows on Kubernetes for several key reasons:
The team’s goals extended beyond just workflow orchestration—they also wanted to support Dev boxes, CI/CD, and eventually gang-scheduled training directly on Kubernetes, making this a holistic containerization initiative.
The migration was not straightforward, and the presenter candidly discusses several challenges that took months to resolve:
The existing code was tightly coupled with specific config file locations and loading patterns. The team spent several sprints refactoring the code to identify logical steps that could be mapped to Metaflow’s step-based workflow approach. They also had to ensure consistent config loading once Metaflow containerized the Python files and associated code. Python path and environment variable manipulation were required to help Metaflow’s containerized code find the correct resources.
A pragmatic decision was made to implement a hybrid approach rather than forcing a complete migration. The Metaflow flows serve as the orchestration layer, but each step SSHs into a bastion node on the Slurm cluster and runs the actual srun command for execution. This allows data scientists to continue working without interruption while the infrastructure team progressively migrates workloads to run natively on Kubernetes.
This hybrid approach is explicitly described as a “stop-gap measure”—the ultimate goal remains running fully containerized workloads on Kubernetes. This honest acknowledgment of technical debt and incremental migration is a valuable lesson for other teams facing similar challenges.
The code repository had grown to approximately 1 GB due to historical use of Git LFS for storing model versions and data. This made containerization slow and cumbersome, especially when data scientists wanted to run fine-tuning jobs on their local code changes (requiring commit, container build, and execution at that commit hash).
The team addressed this by:
To make the system accessible to multiple users, the team:
The presentation showcases several production workflows that demonstrate the practical application of this infrastructure:
This is a primary use case where the DAG-based approach of Metaflow shines. The workflow:
This seemingly simple DAG masks significant complexity in providing observability to data scientists.
The system supports various scheduled jobs:
Some data scientists discovered they could trigger new workflows programmatically from within their code. For example, when a training job completes, it can automatically launch an evaluation job by making requests to trigger Metaflow commands. This emergent use pattern demonstrates the flexibility of the system.
The team implemented CI/CD through CircleCI that:
metaflow argo workflows create to automatically update workflow definitionsThis automation ensures that workflow updates are deployed consistently without manual intervention.
Even while still on Slurm, the team uses Metaflow to orchestrate monitoring tasks:
The wins from this migration include:
The work still in progress includes:
The presenter mentions several features that would improve their setup:
This case study offers several lessons for teams building LLM training infrastructure:
The honest discussion of challenges, compromises, and work in progress makes this a particularly valuable case study for teams considering similar migrations.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.