Last updated: April 3, 2023.
Note: This example does not work for any ZenML versions > 0.36.1.
We’re really proud of our Kubeflow integration. It gives you a ton of power and flexibility and is a production-ready tool. But we also know that for many of you it’s one step too many. Setting up a Kubernetes cluster is probably nobody’s ideal way to spend their time, and it certainly requires some time investment to maintain.
We thought this was a concern worth addressing so I worked to build an alternative during the ZenHack Day we recently ran. GitHub Actions is a platform that allows you to execute arbitrary software development workflows right in your GitHub repository. It is most commonly used for CI/CD pipelines, but using the GitHub Actions orchestrator ZenML now enables you to easily run and schedule your machine learning pipelines as GitHub Actions workflows.
GitHub Actions: best in class for what?
Most technical decisions come with various kinds of tradeoffs, and it’s worth taking a moment to assess why you might want to use the GitHub Actions orchestrator in the first place.
Let’s start with the downsides:
- You don’t have as much flexibility as with a tool like Kubeflow in terms of specifying exactly what kinds of hardware are used to run your steps.
- The orchestrator itself runs on the hardware that GitHub Actions provides (generously and for free). This isn’t the fastest or most performant infrastructure setup, and it generally is much slower than even your local CPU machine. There are also memory and storage constraints to the machines they provide as GitHub Actions runners.
- GitHub offers no guarantees about when your actions will be executed; at peak times you might be waiting a while before the hardware is allocated and provisioned to run. If you are planning on running ZenML pipelines on a schedule (every ten minutes, for example) then this might not work as expected.
So what’s the point, then? These are indeed some serious downsides. Firstly and foremostly, there’s the cost: running your pipelines on GitHub Actions is free. If you’re interested in running your pipelines in the cloud on serverless infrastructure, there’s probably no easier way to get started than to try out this orchestrator.
You are also spared the pain of maintaining a Kubernetes cluster. Once you’ve configured it (see below for instructions) there’s basically nothing you have to do on an ongoing basis. I hope you’re sold on trying it out and want to get started, so let’s not hold off any more.
(Note that some of the commands in this tutorial rely on environment variables or a specific working directory from previous commands, so be sure to run them in the same shell. In this tutorial we’re going to use Microsoft’s Azure platform for cloud storage and our MySQL database, but it works just as well on AWS or GCP.
This tutorial assumes that you have:
- Python installed (version 3.7-3.9)
- Git installed
- a GitHub account
- Docker installed and running
- Remote ZenML Server A Remote Deployment of the ZenML HTTP server and Database
Create an account
If you don’t have an Azure account yet, go to https://azure.microsoft.com/en-gb/free/ and create one.
Create a resource group
Resource groups are a concept in Azure that allows us to bundle different resources that share a similar lifecycle. We’ll create a new resource group for this tutorial so we’ll be able to differentiate them from other resources in our account and easily delete them at the end.
Go to the Azure portal, click the hamburger button in the top left to open up the portal menu. Then, hover over the Resource groups section until a popup appears and click on the + Create button:
Select a region and enter a name for your resource group before clicking on Review + create:
Verify that all the information is correct and click on Create:
Create a storage account
An Azure storage account is a grouping of Azure data storage objects which also provides a namespace and authentication options to access them. We’ll need a storage account to hold the blob storage container we’ll create in the next step.
Open up the portal menu again, but this time hover over the Storage accounts section and click on the + Create button in the popup once it appears:
Select your previously created resource group, a region and a globally unique name and then click on Review + create:
Make sure that all the values are correct and click on Create:
Wait until the deployment is finished and click on Go to resource to open up your newly created storage account:
In the left menu, select Access keys:
Click on Show keys, and once the keys are visible, note down the storage account name and the value of the Key field of either key1 or key2. We’re going to use them for the <STORAGE_ACCOUNT_NAME> and <STORAGE_ACCOUNT_KEY> placeholders later.
Create an Azure Blob Storage Container
Next, we’re going to create an Azure Blob Storage Container. It will be used by ZenML to store the output artifacts of all our pipeline steps. To do so, select Containers in the Data storage section of the storage account:
Then click the + Container button on the top to create a new container:
Choose a name for the container and note it down. We’re going to use it later for the <BLOB_STORAGE_CONTAINER_NAME> placeholder. Then create the container by clicking the Create button.
Create a GitHub Personal Access Token
Next up, we’ll need to create a GitHub Personal Access Token that ZenML will use to authenticate with the GitHub API in order to store secrets and upload Docker images.
- Go to https://github.com, click on your profile image in the top right corner and select Settings:
- Scroll to the bottom and click on Developer Settings on the left side:
- Select Personal access tokens and click on Generate new token:
- Give your token a descriptive name for future reference and select the repo and write:packages scopes:
- Scroll to the bottom and click on Generate token. This will bring you to a page that allows you to copy your newly generated token:
Now that we’ve got our token, let’s store it in an environment variable for future steps. We’ll also store our GitHub username that this token was created for. Replace the <PLACEHOLDERS> in the following command and run it:
Login to the Container registry
When we’ll run our pipeline later, ZenML will build a Docker image for us which will be used to execute the steps of the pipeline. In order to access this image inside GitHub Actions workflow, we’ll push it to the GitHub container registry. Running the following command will use the personal access token created in the previous step to authenticate our local Docker client with this container registry:
Note: If you run into issues during this step, make sure you’ve set the environment variables in the previous step and Docker is running on your machine.
Fork and clone the tutorial repository
If you’re new to ZenML, let’s quickly go over some basic concepts that help you understand what the code in this repository is doing:
- A pipeline in ZenML allows you to group a series of steps in whatever order makes sense for your particular use case. The example pipeline consists of three steps which import data, train a model and evaluate the model.
- A step is very similar to a Python function and contains arbitrary business logic. The three steps in our example do the following:
- The data loader step loads the digits dataset and splits it into train and test set.
- The trainer step trains a SKLearn SVC classifier on the training set returned by the data loader step.
- The evaluator step evaluates the model returned by the trainer step on the test set.
Let’s get going:
- Go to https://github.com/zenml-io/github-actions-orchestrator-tutorial
- Click on Fork in the top right:
- Click on Create fork:
- Clone the repository to your local machine:
Now that we’re done setting up and configuring all our infrastructure and external dependencies, it’s time to install ZenML and configure a ZenML stack that connects all these elements together.
Remote ZenML Server
For Advanced use cases where we have a remote orchestrator such as Vertex AI or to share stacks and pipeline information with team we need to have a separated non-local remote ZenML Server that it can be accessible from your machine as well as all stack components that may need access to the server. Read more information about the use case here
In order to achieve this there are two different ways to get access to a remote ZenML Server.
- Deploy and manage the server manually on your own cloud/
- Sign up for ZenML Enterprise and get access to a hosted version of the ZenML Server with no setup required.
Let’s install ZenML and all the additional packages that we’re going to need to run our pipeline:
We’re also going to initialize a ZenML repository to indicate which directories and files ZenML should include when building Docker images:
Connect to ZenML Server
Once the deployment is finished, let’s connect to it by running the following command and logging in with the username and password you set during the deployment phase:
Registering the stack
A ZenML stack consists of many components which all play a role in making your ML pipeline run in a smooth and reproducible manner. Let’s register all the components that we’re going to need for this tutorial!
- The orchestrator is responsible for running all the steps in your machine learning pipeline. In this tutorial we’ll use the new GitHub Actions orchestrator which, as the name already indicates, uses GitHub Actions workflows to orchestrate your ZenML pipeline. Registering the orchestrator is as simple as running the following command:
- We’ll also need to configure a container registry which will point ZenML to a Docker registry to store the images that ZenML builds in order to run your pipeline. Luckily, your GitHub account already comes with a free container registry! To register it simply run:
- The secrets manager is used to securely store all your credentials so ZenML can use them to authenticate with other components like your artifact store. We’re going to use a secrets manager implementation that stores these credentials as encrypted GitHub secrets:
- The artifact store stores all the artifacts that get passed as inputs and outputs of your pipeline steps. To register our blob storage container, replace the <BLOB_STORAGE_CONTAINER_PATH> placeholder in the following command with the path we saved when creating the blob storage container and run it:
These are all the components that we’re going to use for this tutorial, but ZenML offers additional components like:
- Step operators to run individual steps of your pipeline in specialized environments.
- Model deployers to deploy your trained machine learning model in production.
- And many more. Check out our docs for a full list of available components.
With all components registered, we can now create and activate our ZenML stack. This makes sure ZenML knows which components to use when we’re going to run our pipeline later.
Registering the secrets
Once the stack is active, we can register the secret that ZenML needs to authenticate with our artifact store. We’re going to need the storage account name and key that we saved when we created our storage account earlier: Replace the <PLACEHOLDERS> in the following command with those concrete values and run it:
Run the pipeline
That was quite a lot of setup, but luckily we’re (almost) done now. Let’s execute the python script that “runs” our pipeline and quickly discuss what it is doing:
This script runs a ZenML pipeline using our active GitHub stack. The orchestrator will now build a Docker image with our pipeline code and all the requirements installed and push it to the GitHub container registry. Once the image is pushed, the orchestrator will write a GitHub Actions workflow file to the directory .github/workflows. Pushing this workflow file will trigger the actual execution of our ZenML pipeline. We’ll explain later at how to automate this step, but for our first pipeline run there is one last configuration step we need to do: We need to make sure our GitHub Actions are allowed to pull the Docker image that ZenML just pushed.
- Wait until the python script has finished running so the Docker image is pushed to GitHub.
- Head to https://github.com/users/<GITHUB_USERNAME>/packages/container/package/zenml-github-actions (replace <GITHUB_USERNAME> with your GitHub username) and select Package settings on the right side:
- In the Manage Actions access section, click on Add Repository:
- Search for your forked repository github-actions-orchestrator-tutorial and give it read permissions. Your package settings should then look like this:
Done! Now all that’s left to do is commit and push the workflow file:
If we now check out the GitHub Actions for our repository here https://github.com/<GITHUB_USERNAME>/github-actions-orchestrator-tutorial/actions we should see our pipeline running! 🎉
Automate the committing and pushing
If we want the orchestrator to automatically commit and push the workflow file for us, we can enable it with the following command:
After this update, calling python run.py should automatically build and push a Docker image, commit and push the workflow file which will in turn run our pipeline on GitHub Actions.
Delete Azure Resources
Once we’re done experimenting, let’s delete all the resources we created on Azure so we don’t waste any compute/money. As we’ve bundled it all in one resource group, this step is very easy. Go the Azure portal and select your resource group in the list of resources:
Next click on Delete resource group on the top:
In the popup on the right side, type the resource group name and click Delete:
This will take a few minutes, but after it’s finished all the resources we created should be gone.
Where to go from here?
If you have any question or feedback regarding this tutorial, let us know here or join our weekly community hour. If you want to know more about ZenML or see more examples, check out our docs, examples or join our Slack.