Last updated: November 21, 2022.
If you are of a more visual disposition, please check out this blog’s accompanying video tutorial.
What is a step operator?
The step operator defers the execution of individual steps in a pipeline to specialized runtime environments that are optimized for Machine Learning workloads. This is helpful when there is a requirement for specialized cloud backends ✨ for different steps. One example could be using powerful GPU instances for training jobs or distributed compute for ingestion streams.
I’m confused 🤔. How is it different from an orchestrator?
An orchestrator is a higher level entity than a step operator. It is what executes the entire ZenML pipeline code and decides what specifications and backends to use for each step.
The orchestrator runs the code which launches your step in a backend of your choice. If you don’t specify a step operator, then the step code runs on the same compute instance as your orchestrator.
While an orchestrator defines how and where your entire pipeline runs, a step operator defines how and where an individual step runs. This can be useful in a variety of scenarios. An example could be if one step within a pipeline needed to run on a separate environment equipped with a GPU (like a trainer step).
How do I use it?
A step operator is a stack component, and is therefore part of a ZenML stack.
An operator can be registered as part of the stack as follows:
And then a step can be decorated with the custom_step_operator parameter to run it with that operator backend:
Run on AWS Sagemaker, GCP Vertex AI, and Microsoft Azure ML
The step operator makes you feel like this -- via GIPHY
ZenML’s cloud integrations are now extended to include step operators that run an individual step in all of the public cloud providers hosted ML platform offerings. The ZenML GitHub repository gives a great example of how to use these integrations. Let’s walk through one example, with AWS Sagemaker, in this blog. The other two clouds are quite similar and follow the same pattern.
Introduction to AWS Sagemaker
AWS Sagemaker is a hosted ML platform offered by Amazon Web Services. It manages the full lifecycle of building, training, and deploying machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. It offers specialized compute instances to run your training jobs and has a beautiful UI to track and manage your models and logs.
You can now use the new SagemakerStepOperator class to submit individual steps to be run on compute instances managed by Amazon Sagemaker.
Set up a stack with the AWS Sagemaker StepOperator
As we are working in the cloud, we need to first do a bunch of preperatory steps to regarding permissions and resource creation. In the future, ZenML will automate a lot of this way. For now, follow these manual steps:
- Create or choose an S3 bucket to which Sagemaker should output any artifacts from your training run. Then register it as an artifact store:
- A container registry has to be configured in the stack. This registry will be used by ZenML to push your job images that Sagemaker will run. Register this as well:
- Set up the aws cli set up with the right credentials. Make sure you have the permissions to create and manage Sagemaker runs.
- Create a role in the IAM console that you want the jobs running in Sagemaker to assume. This role should at least have the AmazonS3FullAccess and AmazonSageMakerFullAccess policies applied. Check this link to learn how to create a role.
- Choose what instance type needs to be used to run your jobs. You can get the list here.
- Come up with an experiment name if you have one created already. Check this guide to know how. If not provided, the job runs would be independent of an experiment.
- Optionally, select a custom docker image that you want ZenML to use as a base image for creating an environment to run your jobs in Sagemaker.
- Once you have all these values handy, you can proceed to setting up the components required for your stack.
The command to register the stack component would look like the following. More details about the parameters that you can configure can be found in the class definition of Sagemaker Step Operator in the API docs.
- Register the sagemaker stack with the same pattern as always:
And now you have the stack up and running! Note that similar steps can be undertaken with Vertex AI and Azure ML. See the docs for more information.
Create a pipeline with the step operator decorator
Once the above is out of the way, any step of any pipeline we create can be decorated with the following decorator:
ZenML will take care of packaging the step for you into a docker image, pushing the image, provisioning the resources for the custom job, and monitoring it as it progresses. Once complete, the pipeline will continue as always.
You can also switch the “sagemaker” operator with any other operator of your choosing, and it will work with the same step code as you always have. Modularity at its best!