Software Engineering

Scaling ML Workflows Across Multiple AWS Accounts (and Beyond): Best Practices for Enterprise MLOps

Hamza Tahir
Apr 28, 2025
12 mins

As enterprise machine learning initiatives scale, managing the lifecycle of models across separate AWS accounts for development, staging, and production environments becomes increasingly complex. This separation of environments—while essential for security, compliance, and resource isolation—creates significant operational challenges in the model promotion process.

Industry leaders like Aviva have implemented serverless MLOps platforms using Amazon SageMaker and the AWS Enterprise MLOps Framework, achieving remarkable cost reductions compared to on-premises solutions. However, the manual promotion process between AWS accounts often creates bottlenecks, delays model deployment, and introduces unnecessary operational overhead.

This article explores best practices and modern solutions for streamlining model promotion across multiple AWS accounts, ensuring governance without sacrificing agility.

The Cross-Account Challenge

Organizations with mature ML practices typically implement a multi-account strategy with distinct AWS accounts for:

  1. Development/Experimentation: Where data scientists build and test models
  2. Staging/QA: Where models undergo validation and compliance checks
  3. Production: Where approved models serve business-critical applications
Source: “How Aviva built a scalable, secure, and reliable MLOps platform using Amazon SageMaker.”

While this separation provides essential guardrails, it introduces several challenges:

  • Manual Model Promotion: Teams must manually export models from one account, validate them, and re-import them into the target account
  • Inconsistent Environments: Configuration drift between accounts leads to "works in development, fails in production" scenarios
  • Governance Complexity: Tracking model lineage and ensuring compliance across account boundaries
  • Limited Visibility: No unified view of model performance across environments
  • Resource Duplication: Redundant infrastructure and pipeline definitions across accounts

These challenges are especially acute for organizations managing dozens of models in production, where manual processes quickly become unsustainable.

10 Critical MLOps Challenges and How ZenML Solves Them

1 Complex Pipeline Definition Languages

The Problem:

AWS MLOps implementations can require multiple different contexts across different libraries and SDKs, from YAML to Python, which can get overwhelming for managing efficiency at scale. For example, to create a single training workflow, one might have to write local scripts, push to CodeCommit, then write a CI workflow that compiles into a Sagemaker pipeline SDK.

Many different touchpoints to deploy a ML pipeline on AWS
ZenML's Solution:

ZenML provides a simple Python SDK that feels natural to data scientists:

Python
from zenml import step, pipeline

@step
def preprocess_data(data_path: str):
    # Simple Python data preprocessing logic
    return processed_data

@pipeline
def training_pipeline(data_path):
    processed_data = preprocess_data(data_path)
    # Rest of the pipeline in plain Python...

# This can now run locally or on Sagemaker

2 Lack of Centralized Visibility

The Problem:

In traditional multi-account AWS setups, teams have no unified view of models across environments. Each AWS account has its own separate dashboards, metrics, and logs, making it difficult to track models through their entire lifecycle without custom solutions.

ZenML's Solution:

ZenML provides a central dashboard that shows all pipelines, models, and metrics across every environment. This unified view allows teams to:

  • Track model lineage from development to production
  • Compare model performance across environments
  • Monitor deployment status across all accounts
  • Manage approvals from a central location

3 Configuration Management for Different Personas

The Problem:

Different team members (data scientists, ML engineers, DevOps) need different configurations across AWS accounts, creating significant overhead and often leading to environment inconsistencies.

ZenML's Solution:

ZenML's stack concept allows each team member to work with configurations tailored to their role:

Shell
# Step 1: Test in development environment
zenml stack set dev-stack  # Configuration for data scientists (#3)
python run.py train --insurance_claim_pipeline --data sample_claims.csv

# Step 2: Validate in staging environment
zenml stack set staging-stack  # Switch environments seamlessly (#6)
python run.py  train --insurance_claim_pipeline --data validation_claims.csv

# Step 3: Deploy to production
zenml stack set prod-stack  # Works across AWS accounts (#7)
python run.py  deploy --pipeline insurance_claim_pipeline --endpoint claims-api

4 Dependency Management Across Environments

The Problem:

Slight variations in package versions between AWS accounts can cause models to behave differently in production than in development, leading to unexpected failures and inconsistent results.

ZenML's Solution:

ZenML containerizes pipeline steps to ensure identical dependencies:

Python
@step(settings={"docker": DockerSettings(parent_image="custom-ml-image:1.2.3")})
def train_insurance_model(claim_data):
		...
    return model

@step(settings={"docker": DockerSettings(requirements=["torch"])})
def deploy_model(model):
		...
    return model

5 Pipeline Orchestration Complexity

The Problem:

While AWS has many amazing tools, authoring AWS Sagemaker pipelines can be cumbersome for data scientists to create and maintain, especially across multiple AWS accounts. The business and infrastructure logic can easily get mixed up.

Python
# Infrastructure configuration - VPC settings
vpc_id = "vpc-1234abcd"
subnet_ids = ["subnet-1234abcd", "subnet-5678efgh"]
security_group_ids = ["sg-1234abcd"]
network_config = NetworkConfig(
    vpc_id=vpc_id,
    subnet_ids=subnet_ids,
    security_group_ids=security_group_ids,
    enable_network_isolation=True
)

# Infrastructure configuration - IAM roles and S3 buckets
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
bucket = "insurance-claims-ml-123456789012"
kms_key = "arn:aws:kms:us-west-2:123456789012:key/1234abcd-12ab-34cd-56ef-1234567890ab"

# Infrastructure configuration - SageMaker session
sagemaker_session = sagemaker.Session(
    boto_session=boto3.Session(region_name="us-west-2"),
    default_bucket=bucket
)

# Create the pipeline - pulling everything together
pipeline = Pipeline(
    name="ClaimsProcessingPipeline",
    steps=[preprocessing_step, training_step, model_step],
    sagemaker_session=sagemaker_session,
)

# Submit the pipeline
pipeline.upsert(role_arn=role)
execution = pipeline.start()
ZenML's Solution:

ZenML automatically generates proper DAGs from simple Python functions and separates the infrastructure-level configuration with the Stack concept.

Shell
zenml orchestrator register aws_dev_sagemaker_orchestrator \
	--role arn:aws:iam::123456789012:role/SageMakerExecutionRole \
	--bucket insurance-claims-ml-123456789012
	
zenml stack register dev_stack -o dev_sagemaker_orchestrator -o dev_artifact_store

zenml stack set dev_stack
Python
@pipeline
def claims_processing_pipeline():
    data = ingest_claims_data()
    features = extract_claim_features(data)
    model = predict_repair_cost(features)
    # ZenML automatically figures out the execution order

claims_processing_pipeline()  # Will run on dev_stack

6 Local Development vs. Cloud Execution Gap

The Problem:

Data scientists develop locally but deploy to AWS cloud environments, creating a significant gap between development and production workflows that leads to "works on my machine" problems.

ZenML's Solution:

ZenML allows the exact same code to run locally or in any AWS account:

7 Hybrid and Multi-Cloud Deployments

The Problem:

Many organizations operate across multiple cloud providers or maintain hybrid environments, which creates complexity when trying to maintain consistent ML operations.

ZenML's Solution:

ZenML's stack abstraction layer allows the same pipelines to run seamlessly across different cloud providers using detailed stack configurations. See this animation to see how easily we can configure stack components for different providers across different regions with a few  clicks (you can also do this via Terraform or API):

8 Model Governance and Compliance

The Problem:

Regulated industries like financial services and insurance companies face strict regulatory requirements that are difficult to enforce consistently across separate AWS accounts. This creates challenges for model reproducibility, audit trails, and compliance verification.

ZenML's Solution:

ZenML enables model governance through specialized pipeline steps that can be appended to any pipeline, generating compliance reports and visualizations automatically. Furthermore, integrations with data validation tools like Evidently make this even easier:

Python
# Configure the Evidently visualization step with regulatory requirements
@step
def regulatory_visualization_step(compliance_report, drift_report):
    """Generates visualization dashboards for regulatory review"""
    # Create interactive dashboard
    dashboard = RegulatoryDashboard()
    dashboard.add_bias_metrics(compliance_report.bias_metrics)
    dashboard.add_drift_analysis(drift_report)
    dashboard.add_model_explainability(compliance_report.shap_values)
    
    # Export to standard formats required by regulators
    dashboard.export_pdf("regulatory_compliance_report.pdf")
    dashboard.export_html("interactive_compliance_dashboard.html")
    
    return dashboard

This approach allows organizations to implement "data-quality gates & alerting easily," as mentioned in their case study, while Brevo has leveraged similar techniques to become "a safer platform, fighting against fraudsters and scammers." The visualizations and reports generated serve as crucial documentation for audit purposes while ensuring consistent governance across all AWS environments.

9 Cost Management and Resource Optimization

The Problem:

ML operations across multiple AWS accounts often result in inefficient resource allocation and unnecessary cloud costs.

ZenML's Solution:

ZenML enables fine-grained control over resource allocation for each step:

Python
@step(settings={"resources": ResourceSettings(cpu=2, memory="4Gi")})
def preprocess_data(raw_data):
    # Uses minimal resources for preprocessing
    return processed_data

@step(settings={"resources": ResourceSettings(gpu=1, memory="16Gi")})
def train_model(processed_data):
    # Uses GPU only for the training step
    return model

10 Experiment Tracking and Reproducibility

The Problem:

Tracking experiments across multiple AWS accounts and ensuring reproducibility is challenging without centralized systems.

ZenML's Solution:

ZenML provides consistent experiment tracking regardless of where experiments run:

Python
@step
def train_model(params: Dict[str, Any]):
    # Parameters, metrics, and artifacts are automatically tracked
    model = train_with_params(params)
    log_metadata(metadata={"accuracy", model.accuracy})
    return model

You can then compare metadata easily in one interface across stacks and models:

Transforming MLOps Across AWS Accounts

Organizations implementing modern MLOps solutions across multiple AWS accounts have achieved remarkable improvements, as evidenced by real-world case studies. Brevo (formerly Sendinblue) reduced its ML deployment time by 80%, taking what was once a month-long development-to-deployment process and dramatically shortening it. Similarly, ADEO Leroy Merlin decreased their time-to-market from 2 months to just 2 weeksa 75% reduction. Both companies have successfully deployed multiple models into production (5 models for Brevo and 5 for ADEO, with the latter targeting 20 by the end of 2024).

These implementations have yielded significant operational benefits: ADEO's data scientists gained autonomy in pipeline creation and deployment, eliminating bottlenecks between teams, while Brevo achieved enhanced team productivity with just 3-4 data scientists independently handling end-to-end ML use cases. Additionally, both organizations reported improved business outcomes, including better fraud targeting for Brevo and seamless cross-country model deployment for ADEO. These real-world results demonstrate how proper MLOps implementation can transform operations across multiple environments while maintaining necessary governance and security controls.

Modern MLOps for Multi-Account AWS

As organizations scale their ML operations across AWS accounts, they need solutions that address these 10 critical challenges. ZenML provides a comprehensive approach that allows data scientists to focus on ML innovation rather than infrastructure complexity.

ZenML's Python-first approach, stack-based architecture, and unified dashboard enable teams to maintain all the security and governance benefits of multi-account AWS setups while dramatically reducing operational overhead and accelerating time-to-value.

To learn more about how ZenML can transform your cross-account MLOps strategy, make a free account on ZenML Pro or try our open-source project on GitHub.

Looking to Get Ahead in MLOps & LLMOps?

Subscribe to the ZenML newsletter and receive regular product updates, tutorials, examples, and more articles like this one.
We care about your data in our privacy policy.