Build an agent factory with Kitaru — our durable runtime for production AI agents. Read the docs →

The unified layer for ML and AI

Orchestrate training pipelines and durable AI agents on the tools, clouds, and environments you already use — without rewriting your stack.

Integrate your MLOps stack For compute-intense, distributed ML pipelines.
simple_pipeline · run #7
quickstart.py ZENML
from typing import Annotatedfrom zenml import pipeline, step @stepdef simple_step(name: str = "World") -> Annotated[str, "greeting"]:    return f"Hello, {name}! Welcome to ZenML!" @pipelinedef simple_pipeline(name: str = "World"):    return simple_step(name=name)
name
name str
default World
simple_step
2s
greeting str · v7
local · default stack healthy
Integrate your MLOps stack For compute-intense, distributed ML pipelines.
training · run #32
pipelines/training.py ZENML
from zenml import pipeline, step @stepdef data_loader(random_state: int) -> pd.DataFrame:    return load_breast_cancer(as_frame=True).frame @stepdef model_trainer(dataset_trn: pd.DataFrame) -> ClassifierMixin:    return SGDClassifier().fit(dataset_trn.drop("target", axis=1), dataset_trn.target) @pipelinedef training(model_type: str = "sgd"):    model_trainer(dataset_trn=data_loader(random_state=17))
data_loader
7s
dataset_trn DataFrame
dataset_tst DataFrame
model_trainer
4m 5s
sklearn_classifier sklearn · v32
kubernetes-prod · 12 pods healthy
Integrate your MLOps stack For compute-intense, distributed ML pipelines.
llm_peft_full_finetune · run #18
pipelines/train.py ZENML
from zenml import pipelinefrom steps import prepare_data, finetune, evaluate_model, promote @pipelinedef llm_peft_full_finetune(    base_model_name: str = "microsoft/phi-2",    dataset_name: str = "gem/viggo",):    datasets_dir = prepare_data(base_model_name, dataset_name)    ft_model_dir = finetune(base_model_name, datasets_dir)    evaluate_model(base_model_name, ft_model_dir, datasets_dir)    promote(ft_model_dir)
prepare_data
3m 41s
datasets_dir Path
tokenizer PreTrainedTokenizer
finetune
58m 22s
ft_model_dir phi-2 · v18
vertex-gcp · 1× A100 80GB healthy
Integrate your MLOps stack For compute-intense, distributed ML pipelines.
churn_inference_pipeline · run #44
pipelines/inference_pipeline.py ZENML
from zenml import pipelinefrom zenml.config import DeploymentSettingsfrom steps.inference import predict_churn @pipeline(    on_init=init_model,    settings={"deployment": DeploymentSettings(        app_title="Churn Prediction API",        dashboard_files_path="ui",    )},)def churn_inference_pipeline(customer_features: Dict) -> Dict:    return predict_churn(customer_features=customer_features)
init_model
warm
model RandomForest
customer Dict
predict_churn
87ms
prediction Dict · v44
sagemaker-aws · 2 replicas live
Integrate your MLOps stack For compute-intense, distributed ML pipelines.
object_detection_training · run #12
pipelines/training_pipeline.py ZENML
from zenml import pipelinefrom steps import load_coco_dataset, train_yolo, fiftyone_analysis @pipelinedef object_detection_training_pipeline(    max_samples: int = 50,    epochs: int = 1,    model_name: str = "yolov8n.pt",):    dataset = load_coco_dataset(max_samples=max_samples)    model = train_yolo(dataset=dataset, epochs=epochs, model_name=model_name)    fiftyone_analysis(dataset=dataset, model=model)
load_coco_dataset
1m 02s
dataset FiftyOneDataset
labels COCO80
train_yolo
17m 14s
yolo-model ultralytics · v12
airflow · 4× T4 GPU healthy
Stack composer Swap orchestrator, store, tracker — same pipeline code.
5 components
Stack Orchestrator Artifact store Container reg Tracker
local-dev
local local default mlflow
kubernetes-prod
kubeflow s3://prod ecr mlflow
vertex-gcp
vertex gcs://prod gcr neptune
sagemaker-aws
sagemaker s3://eu-west ecr w&b
airflow-staging
airflow gcs://staging gcr mlflow
azureml-eu
azureml azure://eu acr comet
Register the stack 1 command
$ zenml stack register local-dev -o default -a default --set
Register the stack 3 commands
$ zenml service-connector register aws-prod --type aws -i
$ zenml orchestrator register kubeflow-orch --flavor kubeflow --connector aws-prod
$ zenml stack register kubernetes-prod -o kubeflow-orch -a s3-prod --set
Register the stack 3 commands
$ zenml service-connector register gcp-prod --type gcp -i
$ zenml orchestrator register vertex-orch --flavor vertex --connector gcp-prod
$ zenml stack register vertex-gcp -o vertex-orch -a gcs-prod --set
Register the stack 3 commands
$ zenml service-connector register aws-eu --type aws -i
$ zenml orchestrator register sagemaker-orch --flavor sagemaker --connector aws-eu
$ zenml stack register sagemaker-aws -o sagemaker-orch -a s3-eu --set
Register the stack 3 commands
$ zenml service-connector register gcp-staging --type gcp -i
$ zenml orchestrator register airflow-orch --flavor airflow --connector gcp-staging
$ zenml stack register airflow-staging -o airflow-orch -a gcs-staging --set
Register the stack 3 commands
$ zenml service-connector register azure-eu --type azure -i
$ zenml orchestrator register azureml-orch --flavor azureml --connector azure-eu
$ zenml stack register azureml-eu -o azureml-orch -a blob-eu --set
churn_predictor v18 of 18
6c5e0a14 · sklearn 1.4 · 2.4 MB
Version Evolution 18 versions · +4.2pt since v1
v1 v8 · DEV v12 · STAGE v18 · PROD
Metadata 4 fields
accuracy 0.926
f1 score 0.911
train rows 47,392
promoted v18 → PROD
Lineage Produced by step predict_on_endpoint · used by model churn_predictor·v18 · deployed to kubernetes-prod
Where it runs 3 envs · 4 endpoints
PROD
v18 vertex · 2 endpoints
99.97% uptime
STAGE
SHADOW
v18 k8s · shadow traffic
promoted 3h ago
CANARY
v18·rc2 vertex · 5% rollout
accuracy +0.4pt vs v18
Registry 42 models · 248 artifacts
last promoted 3h ago
INSIDE THE @STEP Use PyTorch in any @step. Bring your own. ZenML wraps it — you don't change your training loop.
torch 2.4.1 · CUDA 12.1
training_pipeline.py
@step · zenml
import torchfrom zenml import step @step(enable_cache=False)def train_model(    X: torch.Tensor, y: torch.Tensor) -> torch.nn.Module:    model = torch.nn.Sequential(        torch.nn.Linear(784, 256), torch.nn.ReLU(),        torch.nn.Linear(256, 10),    )    return model  # auto-versioned by ZenML
What ZenML gives you automatic versioning GPU pinning any torch version
Good fit Custom training loops and research code that changes often.
Trade-off Large CUDA images mean slower cold starts on remote stacks.
INSIDE THE @STEP Train Keras models in any @step. ZenML snapshots your SavedModel automatically — no boilerplate.
tensorflow 2.16.1 · Keras 3
train_classifier.py
@step · zenml
import tensorflow as tffrom zenml import step @step(enable_cache=False)def train_classifier(    X_train: tf.Tensor, y_train: tf.Tensor) -> tf.keras.Model:    model = tf.keras.Sequential([        tf.keras.layers.Dense(128, activation='relu'),        tf.keras.layers.Dense(10, activation='softmax'),    ])    return model  # saved as SavedModel artifact
What ZenML gives you SavedModel artifact Keras 3 support auto caching
Good fit Production Keras models with a stable SavedModel format.
Trade-off Heavier dependency — version pinning matters across environments.
INSIDE THE @STEP Fit any sklearn estimator in a @step. ZenML auto-pickles your model and registers it in the model registry.
scikit-learn 1.5.2
pipelines/training.py
@step · zenml
import pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom typing_extensions import Annotatedfrom zenml import ArtifactConfig, step @stepdef model_trainer(    dataset_trn: pd.DataFrame,) -> Annotated[RandomForestClassifier,               ArtifactConfig(is_model_artifact=True)]:    model = RandomForestClassifier()    model.fit(dataset_trn.drop('target', axis=1), dataset_trn['target'])    return model
What ZenML gives you pickle materializer model registry ArtifactConfig
Good fit Tabular models where fast iteration beats raw scale.
Trade-off Pickled estimators are Python-version sensitive across envs.
INSIDE THE @STEP Return DataFrames from any @step. ZenML materializes your DataFrame as a versioned artifact — no manual saving.
pandas 2.2.2
steps/data_loader.py
@step · zenml
import pandas as pdfrom sklearn.datasets import load_breast_cancerfrom typing_extensions import Annotatedfrom zenml import step @stepdef data_loader(    random_state: int,) -> Annotated[pd.DataFrame, 'dataset']:    df = load_breast_cancer(as_frame=True).frame    df.reset_index(drop=True, inplace=True)    return df  # versioned DataFrame artifact
What ZenML gives you DataFrame artifact versioned by run lazy loading
Good fit Feature prep and ETL when the dataset fits in memory.
Trade-off In-memory DataFrames strain on very large datasets.
INSIDE THE @STEP Fine-tune any HF model in a @step. ZenML saves your model checkpoint as a versioned artifact on any cloud.
transformers 4.44.2 · PEFT 0.12
steps/finetune.py
@step · zenml
from transformers import AutoModelForCausalLMfrom peft import get_peft_model, LoraConfigfrom zenml import step @step(enable_cache=False)def finetune_step(    base_model_name: str, datasets_dir: str) -> str:    model = AutoModelForCausalLM.from_pretrained(base_model_name)    model = get_peft_model(model, LoraConfig(r=16, lora_alpha=32))    # trainer.train() — ZenML tracks the checkpoint    return datasets_dir
What ZenML gives you LoRA / PEFT checkpoint artifact remote GPU stack
Good fit Fine-tuning transformers and LLMs with PEFT or LoRA.
Trade-off Checkpoints are large — budget artifact-store space and transfer.
INSIDE THE @STEP Train XGBoost models in a @step. ZenML registers your Booster as a model artifact with full lineage.
xgboost 2.1.1
steps/train_xgb.py
@step · zenml
import pandas as pdimport xgboost as xgbfrom zenml import step @stepdef train_xgb_model(    df_train: pd.DataFrame, label_col: str = 'target') -> xgb.Booster:    dtrain = xgb.DMatrix(        df_train.drop(columns=[label_col]), df_train[label_col]    )    return xgb.train({'max_depth': 6}, dtrain, num_boost_round=100)
What ZenML gives you Booster artifact full lineage cache-aware
Good fit Strong tabular baselines with minimal tuning.
Trade-off Booster objects need the matching XGBoost version to reload.
INSIDE THE @STEP Run LightGBM training in a @step. ZenML saves your LGBMModel as an artifact and links it to the run.
lightgbm 4.5.0
steps/train_lgbm.py
@step · zenml
import lightgbm as lgbimport pandas as pdfrom zenml import step @stepdef train_lgbm(    df_train: pd.DataFrame, label: str = 'target') -> lgb.LGBMClassifier:    clf = lgb.LGBMClassifier(n_estimators=300, learning_rate=0.05)    clf.fit(df_train.drop(columns=[label]), df_train[label])    return clf
What ZenML gives you sklearn API GPU trees versioned model
Good fit Fast gradient boosting on wide tabular data.
Trade-off GPU builds need extra setup in the orchestrator image.
INSIDE THE @STEP Pass ndarrays between @steps. ZenML serializes NumPy arrays automatically — share them across steps.
numpy 2.1.0
steps/preprocess.py
@step · zenml
import numpy as npfrom typing_extensions import Annotatedfrom zenml import step @stepdef normalize_features(    X_raw: np.ndarray,) -> tuple[    Annotated[np.ndarray, 'X_norm'],    Annotated[np.ndarray, 'mean'],]:    mean = X_raw.mean(axis=0)    return (X_raw - mean) / X_raw.std(axis=0), mean
What ZenML gives you ndarray artifact tuple outputs content-hashed
Good fit Passing numerical arrays cleanly between steps.
Trade-off Raw ndarrays carry no schema — annotate outputs for clarity.
INSIDE THE @STEP Use Polars DataFrames in a @step. ZenML materializes Polars DataFrames — fast ETL without Spark overhead.
polars 1.9.0
steps/feature_eng.py
@step · zenml
import polars as plfrom typing_extensions import Annotatedfrom zenml import step @stepdef build_features(    raw_path: str,) -> Annotated[pl.DataFrame, 'features']:    return (        pl.scan_parquet(raw_path)        .filter(pl.col('value') > 0)        .collect()    )
What ZenML gives you Parquet artifact lazy execution fast ETL
Good fit Large ETL that's too big for pandas, too small for Spark.
Trade-off Newer ecosystem — fewer integrations than pandas.
INSIDE THE @STEP Log experiments to W&B from a @step. ZenML connects your stack's experiment tracker — one decorator, full lineage.
wandb 0.18.3
steps/train_with_tracking.py
@step · zenml
import wandbfrom zenml import stepfrom zenml.integrations.wandb.flavors import WandbExperimentTrackerSettings @step(    experiment_tracker='wandb_tracker',    settings={'experiment_tracker.wandb':        WandbExperimentTrackerSettings(tags=['training', 'v2'])})def train_and_log(X_train, y_train) -> float:    wandb.log({'loss': 0.42, 'accuracy': 0.91})    return 0.91
What ZenML gives you experiment tracker sweep support run linking
Good fit Rich experiment tracking and sweep visualization.
Trade-off Adds an external service and API key to manage.
1f42d62d production-gpu-pool HIGH LOAD
GPUs 75%
6 / 8
CPU (Cores) 75%
18 / 24
Memory (GB) 75%
48 / 64
Parallel Pipelines 100%
21 / 21
Parallel Steps 68%
68 / 100
A100 50%
4 / 8
Active Jobs
step_032 1f42d62d
pipeline_032 #0045
2 GPUs · 4 CPU · +1
19s
CRITICAL
step_033 4e7a34bc
pipeline_032
4 GPUs · 8 CPU · +1
45s
HIGH
step_034 8c5e1abc
pipeline_032
1 GPU · 16 GB · +1
59s
MEDIUM
CONNECTED kubernetes docker aws_ec2 google_cloud azure
5 active
Wrap your harness with one line For deploying agents at scale.
research_agent · run #4127 llm_writer · run #8341 audit_company · run #2014 wait_for_approval · run #1198 news_scout · run #6720
first_working_flow.py KITARU
from kitaru import checkpoint, flow

@checkpoint
def gather_sources(topic: str) -> str:
    return f"Source notes on {topic}."

@checkpoint
def summarize(notes: str) -> str:
    return f"Summary: {notes.split(':')[0].lower()}."

@flow
def research_agent(topic: str) -> str:
    notes = gather_sources(topic)
    return summarize(notes)
Execution timeline
flow llm tool checkpoint
Span
0s 30s 1m 1m30s 2m
research_agent
gather_sources
chkpt.notes
summarize
runtime: kubernetes · 1 checkpoint persisted · resumable LIVE · 1m 47s
flow_with_llm.py KITARU
import kitaru
from kitaru import checkpoint, flow

@checkpoint
def write_draft(topic, outline):
    return kitaru.llm(
        f"Write a paragraph about {topic} from {outline}.",
        model="fast", name="draft_call",
    )

@flow
def llm_writer(topic: str) -> str:
    outline = kitaru.llm(
        f"Create a 3-bullet outline about {topic}.",
        model="fast", name="outline_call",
    )
    return write_draft(topic, outline)
Execution timeline
flow llm tool checkpoint
Span
0s 3s 6s 9s 12s
llm_writer
outline_call
chkpt.outline
draft_call
chkpt.draft
runtime: kubernetes · 2 checkpoints persisted · resumable COMPLETED · 11s
stage_2_multi_domain.py KITARU
from kitaru import checkpoint, flow

@checkpoint
def check_hr_compliance(prompt=HR_PROMPT):
    return _run_domain_turn(prompt, domain="hr")

@checkpoint
def check_it_security(prompt=IT_PROMPT):
    return _run_domain_turn(prompt, domain="it_security")

@checkpoint
def synthesize_report(hr, it, vendors, ins):
    return _run_agent(SYNTHESIS_PROMPT.format(...))

@flow
def audit_company():
    hr, it, v, ins = run_domain_checks()
    return synthesize_report(hr, it, v, ins)
Execution timeline
flow llm tool checkpoint
Span
0s 2m 4m 6m 8m
audit_company
check_hr_compliance
check_it_security
check_vendor_contracts
check_insurance
chkpt.findings
synthesize_report
runtime: kubernetes · 4 checkpoints persisted · resumable LIVE · 6m 18s
wait_and_resume.py KITARU
import kitaru
from kitaru import checkpoint, flow

@checkpoint
def draft_release_note(topic: str) -> str:
    return f"Draft about {topic}."

@flow
def wait_for_approval_flow(topic: str) -> str:
    draft = draft_release_note(topic)
    approved = kitaru.wait(
        name="approve_release",
        schema=bool,
        question=f"Approve {topic}?",
        timeout=3600,
    )
    if approved is False:
        return f"REJECTED: {topic}"
    return publish_release_note(draft, details)
Execution timeline
flow llm tool checkpoint
Span
0s 30s 1m 1m30s 2m
wait_for_approval_flow
draft_release_note
chkpt.draft
wait.approve_release
publish_release_note
runtime: kubernetes · compute released · resumable via CLI PAUSED · awaiting input
scout.py KITARU
from kitaru import checkpoint, flow
from kitaru.adapters.pydantic_ai import KitaruAgent
from pydantic_ai import Agent

scout_agent = KitaruAgent(
    Agent(MODEL, name="news_scout",
          tools=[search_news, search_twitter,
                 investigate, fetch_url]),
    granular_checkpoints=True,
)

@checkpoint
def publish_report(text: str) -> str:
    return text

@flow
def news_scout(interests: list[str]) -> str:
    result = scout_agent.run_sync(
        build_user_prompt(interests),
    )
    return publish_report(result.output)
Execution timeline
flow llm tool checkpoint
Span
0s 1m 2m 3m 3m30s
news_scout
agent.plan
search_news
search_twitter
investigate
fetch_url
agent.summarize
publish_report
runtime: kubernetes · 12 checkpoints persisted · replayable COMPLETED · 3m 22s
EXECUTION CHECKPOINTS Resume from any checkpoint after a crash, rate-limit, or eviction.
report_agent · ex_8a2f
EXECUTION TIMELINE
checkpoint failure resume
gather 12s
llm.outline 33s
chkpt.notes 0.4s
llm.write 1m 14s
resumed +11s
draft 42s
chkpt.draft 0.3s
persist queued
429 RATE-LIMIT ↗ at 1m 14s
openai api · llm.write
RESUMED FROM CHECKPOINT 3 14:02:59
$ kitaru flow resume ex_8a2f --from chkpt.notes
Saved 47s · 2 LLM calls not re-issued
ex_8a2f · 6 checkpoints · 38m 12s
resumed
BASELINE VS CANDIDATE Compare v18·rc2 against v17 baseline.
400 runs · 48m total
BASELINE v17 — prompt:v2 + gpt-4o
RUNS 400
DURATION 7m 21s
COST $0.39
LATENCY P95 1.9s
CANDIDATE v18·rc2 — prompt:tone-direct
RUNS
400 same
DURATION
4m 38s −37%
COST
$0.26 −$0.13
LATENCY P95
1.2s −37%
Ship v18·rc2. 400 / 400 runs better on cost and latency · no quality regression.
BASELINE VS CANDIDATE Compare v18·rc1 against v17 baseline.
400 runs · 52m total
BASELINE v17 — prompt:v2 + gpt-4o
RUNS 400
DURATION 7m 21s
COST $0.39
LATENCY P95 1.9s
CANDIDATE v18·rc1 — model:claude-3-opus
RUNS
400 same
DURATION
8m 02s +9%
COST
$0.81 +$0.42
LATENCY P95
2.3s +21%
Hold v18·rc1. Higher quality ceiling but 2× cost increase · not worth promoting.
BASELINE VS CANDIDATE v17 is the current production baseline.
400 runs · 49m total
BASELINE v17 — prompt:v2 + gpt-4o
RUNS 400
DURATION 7m 21s
COST $0.39
LATENCY P95 1.9s
CANDIDATE v17 — same as baseline
RUNS
400 same
DURATION
7m 21s same
COST
$0.39 same
LATENCY P95
1.9s same
Currently in production. v17 is the active baseline · select a candidate to compare.
BASELINE VS CANDIDATE Compare v18·rc3 against v17 baseline.
400 runs · 31m total
BASELINE v17 — prompt:v2 + gpt-4o
RUNS 400
DURATION 7m 21s
COST $0.39
LATENCY P95 1.9s
CANDIDATE v18·rc3 — prompt:v3 + gpt-4o-mini
RUNS
400 same
DURATION
3m 14s −56%
COST
$0.11 −$0.28
LATENCY P95
0.8s −58%
Block v18·rc3. 12 / 400 runs show quality regression · cost savings do not offset.
DURABLE BY DEFAULT Versioned deployments. Promote, shadow, or roll back.
support_agent · v3.2.1
checkpoint replay wait / resume fan-out
VERSION HISTORY
v1
v2
v3 rolled back
v4
PROD
100% traffic
v3.2.1 Hamza · deployed 3h ago
HISTORY
v3.2.0 · 2d
v3.1.5 · 5d
CANARY
5% shadow
v3.2.2-rc Priya · deployed 32m ago
HISTORY
v3.2.1-rc · 6h
v3.2.0-rc · 3d
DEV
dev only
v3.3.0-dev Adam · deployed 8m ago
HISTORY
v3.2.9-dev · 1h
v3.2.8-dev · 5h
5 deployments · 3 envs · last promotion 3h ago healthy
HARNESS typed agent logic
Harness stays. Kitaru wraps around it.
KITARU ADDS durable run layer
KitaruAgent(agent)
Kitaru runtime outer layer
flow one durable run
checkpoint saved between steps
wait pause & resume
replay from a boundary
PydanticAI
typed deps tools output
Model + tools
OpenAI MCP db
Good fit Typed agents where you want schema validation on every step.
Trade-off Adds a Pydantic dependency and some per-call overhead.
KITARU ADDS durable run layer
KitaruAgent(runner)
Kitaru runtime outer layer
flow one durable run
checkpoint saved between steps
wait pause & resume
replay from a boundary
OpenAI Agents
tools handoffs output
Model + tools
OpenAI MCP db
Good fit Multi-agent runs with handoffs via the Agents SDK Runner.
Trade-off Tied to OpenAI-hosted models and their rate limits.
KITARU ADDS durable run layer
KitaruAgent(graph)
Kitaru runtime outer layer
flow one durable run
checkpoint saved between steps
wait pause & resume
replay from a boundary
LangGraph
state tools edges
Model + tools
OpenAI MCP db
Good fit Branching multi-step graphs that need explicit state.
Trade-off Graph state must stay JSON-serializable to checkpoint cleanly.
KITARU ADDS durable run layer
KitaruAgent(session)
Kitaru runtime outer layer
flow one durable run
checkpoint saved between steps
wait pause & resume
replay from a boundary
Anthropic
messages tools stream
Model + tools
OpenAI MCP db
Good fit Long Claude tool-use sessions with expensive context to rebuild.
Trade-off Checkpoints land between turns — mid-stream tokens aren't saved.
KITARU ADDS durable run layer
KitaruAgent(fn)
Kitaru runtime outer layer
flow one durable run
checkpoint saved between steps
wait pause & resume
replay from a boundary
Custom loop
any callable sync / async
Model + tools
OpenAI MCP db
Good fit Any Python callable — no framework lock-in at all.
Trade-off You define the checkpoint boundaries; Kitaru can't infer them.
AFTER A CRASH run #4127 · resumed
resumed
SPAN
0s 15s 30s 45s 60s
classify 13s
checkpoint saved
tool.lookup crashed
resumed +8s
tool.lookup 35s
artifact saved
with kitaru resumed in 8s · classify & model call preserved
without restart from zero · repeat the model call · ~2m lost
AFTER A CRASH run #7834 · resumed
resumed
SPAN
0s 12s 24s 36s 48s
triage 9s
checkpoint saved
web_search crashed
resumed +6s
web_search 29s
response saved
with kitaru resumed in 6s · triage & first tool call preserved
without restart from zero · repeat the model call · ~2m lost
AFTER A CRASH run #2291 · resumed
resumed
SPAN
0s 18s 36s 54s 72s
route_intent 17s
checkpoint saved
tool_node crashed
resumed +11s
tool_node 40s
state saved
with kitaru resumed in 11s · graph state & routing preserved
without restart from zero · re-invoke full graph · ~2m lost
AFTER A CRASH run #3019 · resumed
resumed
SPAN
0s 13s 26s 39s 52s
classify_intent 10s
checkpoint saved
tool_use_block crashed
resumed +6s
tool_use_block 31s
message saved
with kitaru resumed in 6s · context window & tool call preserved
without restart from zero · rebuild context · ~2m lost
AFTER A CRASH run #1102 · resumed
resumed
SPAN
0s 16s 32s 48s 64s
step_one 16s
checkpoint saved
tool_call crashed
resumed +9s
tool_call 34s
output saved
with kitaru resumed in 9s · step progress & tool call preserved
without restart from zero · repeat all steps · ~2m lost

Trusted by teams shipping ML pipelines and AI agents

AXA
JetBrains
ADEO
Leroy Merlin
Brevo
Safran
Airbus Defence & Space
Rohlik
Knuspr
Maven Robotics
CrossScreen Media
GEMA
Homa Games
Koble
IKEA
Sciemo
Vodafone
Stepstone
Neara
Rivian
Happening XYZ
Veridas
AXA
JetBrains
ADEO
Leroy Merlin
Brevo
Safran
Airbus Defence & Space
Rohlik
Knuspr
Maven Robotics
CrossScreen Media
GEMA
Homa Games
Koble
IKEA
Sciemo
Vodafone
Stepstone
Neara
Rivian
Happening XYZ
Veridas
Two products · one team

ML systems today. Agent systems tomorrow. One engineering team underneath both.

ZenML — ML/AI Orchestration

The open-source platform for production ML systems.

Orchestrate workflows across your existing tools, clouds, and environments. Modular, agnostic, no lock-in.

  • Pipelines and stacks across any cloud
  • Model registry, lineage, and reproducibility built in
  • Open source — your stack, your data, your governance
Explore ZenML

Kitaru — Agent Runtime

The runtime layer underneath your agent stack.

Durable execution for autonomous agents. Pause, resume, replay — without lock-in. Self-hosted, framework-agnostic.

  • Crash recovery — flows survive pod evictions and timeouts
  • Pause and resume with kitaru.wait() — minutes, hours, or days
  • Your cloud, your model, your SDK — framework-agnostic
Explore Kitaru

The platform advantage

One foundation. ML pipelines and AI agents.

78%

faster time‑to‑market

65%

reduced engineering overhead

3x

more workflows in production

5x

faster time to production

Unified workflow orchestration dashboard showing ML and agent runs
Artifact and checkpoint versioning view
Infrastructure abstraction across clouds
Smart caching and deduplication across runs
Governance and security dashboard

Your stack, not ours

Run in your VPC, point at your object store, train on your clusters. The platform is a metadata layer — your artifacts, prompts, and code stay inside your infrastructure end to end. No lock-in on either side.

From local prototype to production

Stop rewriting code to move between environments. The same pipeline step or agent flow runs locally for debugging and on Kubernetes for production — without changing your logic. The platform handles the wiring.

Lineage and replay across both workspaces

Every artifact version and every agent checkpoint is tracked in the same metadata store. When something breaks, you have the full execution lineage — from raw input to model output or agent response — to debug and reproduce it.

Open source, enterprise ready

Apache 2.0 from day one, with thousands of teams running it in production. Self-host forever, or adopt the managed control plane when you need governance, SSO, and an SLA. SOC2 and ISO 27001 certified.

Pick your workspace and start shipping.

Open source at the core. ML pipelines, agent flows, or both — same plans, same control plane.

Works with the tools you already use

60+ integrations across the AI ecosystem — from scikit-learn to LangGraph, PyTorch to OpenAI Agents SDK.

Whitepaper

ZenML as your Enterprise-Grade AI Platform

We have put down our expertise around building production-ready, scalable AI platforms, building on insights from our top customers.

Customer Stories

How engineering teams cut time-to-production and simplify their AI infrastructure.

Track production ML and AI deployments across the industry

See the LLMOps database →

HashiCorp
ZenML offers the capability to build end-to-end ML workflows that seamlessly integrate with various components of the ML stack. This enables teams to accelerate their time to market by bridging the gap between data scientists and engineers.
Harold Gimenez

Harold Gimenez

SVP R&D at HashiCorp

Salesforce
ZenML allows orchestrating ML pipelines independent of any infrastructure or tooling choices. ML teams can free their minds of tooling FOMO from the fast-moving MLOps space, with the simple and extensible ZenML interface.
Richard Socher

Richard Socher

Former Chief Scientist Salesforce and Founder of You.com

ADEO
ZenML allowed us a fast transition between dev to prod. It's no longer the big fish eating the small fish – it's the fast fish eating the slow fish.
François Serra

François Serra

ML Engineer / ML Ops / ML Solution architect at ADEO Services

Stanford University
Many teams still struggle with managing models, datasets, code, and monitoring as they deploy ML models into production. ZenML provides a solid toolkit for making that easy in the Python ML world.
Chris Manning

Chris Manning

Professor of Linguistics and CS at Stanford

WiseTech Global
Thanks to ZenML we've set up a pipeline where before we had only Jupyter notebooks. It helped us tremendously with data and model versioning.
Francesco Pudda

Francesco Pudda

Machine Learning Engineer at WiseTech Global

MadeWithML
ZenML allows you to quickly and responsibly go from POC to production ML systems while enabling reproducibility, flexibility, and above all, sanity.
Goku Mohandas

Goku Mohandas

Founder of MadeWithML

No compliance headaches

Your VPC, your data

ZenML is a metadata layer on top of your existing infrastructure, meaning all data and compute stays on your side.

ZenML architecture — metadata layer on top of your infrastructure
SOC2 Type II certified ISO 27001 certified

ZenML is SOC2 and ISO 27001 Compliant

We Take Security Seriously

ZenML is SOC2 and ISO 27001 compliant, validating our adherence to industry-leading standards for data security, availability, and confidentiality in our ongoing commitment to protecting your ML workflows and data.

Looking to Get Ahead in MLOps & LLMOps?

Subscribe to the ZenML newsletter and receive regular product updates, tutorials, examples, and more.

We care about your data in our privacy policy.

Support

Frequently asked questions

Everything you need to know about the product.

What is the difference between ZenML and other machine learning orchestrators?
ZenML doesn't take an opinion on the orchestration layer. Start writing locally, deploy on any orchestrator. We support many orchestrators natively and can be extended to work with custom orchestrators. Read more about how ZenML compares to orchestrators.
Does ZenML integrate with my MLOps stack?
Yes! ZenML supports Kubernetes, AWS, GCP Vertex AI, Kubeflow, Apache Airflow, and many more. Artifact, secrets, and container storage for all major cloud providers.
Does ZenML help in GenAI / LLMOps use-cases?
Yes, ZenML is fully compatible and intended for productionalizing LLM applications. We have examples with LlamaIndex, OpenAI, LangChain, and more. Check out our projects for real-world examples.
How can I build my MLOps/LLMOps platform using ZenML?
Start simple with our user guides, then extend with experiment trackers, model deployers, model registries and more from the stack components library.
What is the difference between the open source and Pro product?
The core framework is Apache 2.0 on GitHub. Pro offers a managed version plus Pro-only features for scaling teams. Learn more on the comparison page.

Unify Your ML and LLM Workflows

  • Open-source foundation, no vendor lock-in
  • Works with any infrastructure
  • Upgrade to managed Pro features
Dashboard displaying machine learning models with version tracking