How we Reduced our ML Training Costs by 78%!!

Setting up Kubeflow with Multitenancy, Google Auth, Spot Instances and Node Group Scale Down to Zero

Arjun Sunil
Building Fynd

--

Before starting, let’s get to know each other. I am a Data Scientist turned MLOps Engineer, currently working at Fynd in the AML domain. I love writing code, being around animals, traveling, and cooking. To connect/know more about me, check out my website.

I was working on reducing the cost of running training experiments at Fynd when I came across Kubeflow. Kubeflow runs on Kubernetes and with my previous experience in Kubernetes and spot instances, it seemed like the way to go.

It had beautiful documentation. I thought that since all the steps are out there, this will be a piece of cake.

Wrong.

I faced issues while trying to set up Kubeflow even though they have amazing documentation. To ensure no one else has to go through the hassle of putting together information across multiple sources, I thought of putting together a one-shot deployment tutorial to make it easier for everyone.

What is Kubeflow?

Probably the best overview of what Kubeflow is and what it could do for you is from the video linked below. But long story short; it is a Machine Learning toolkit that helps build the whole pipeline end to end, from data processing, training to deploy in production in a highly scalable manner. Read More

Why would you want to run Kubeflow on spot instances?

First, let's talk about what are spot instances

In any data center, you have a limited number of instances that can be set up, either constrained by power, network bandwidth, or simply the space available. Most data centers stack up on maximum capacity available. But all these instances are rarely used by customers.

AWS provides unutilized spot instances at a range of 60–80% discount on the OnDemand price.

Money saved is money earned

Considering that an 80% discount reflects immediately on the AWS bills, this is something that any individual or business would appreciate during the times of Covid-19 struck economy.

Things you could do possibly with that saved money

Easy there, cowboy, you’re gonna have to wait for a while until the Coronavirus situation subsides.

So, what's the catch? How is this too good to be true? Why can’t we use spot instances for all workloads?

Well, since AWS is providing us these instances at such affordable rates, they also have the right to claim back these instances at short notice.

How short is the notice?

2 minutes. Yes, that’s all we get to take all our belongings and head out of the instance.

Wait, WHAAAAT?

But do not fear, Kubernetes is here to the rescue!

Imagine, you being a tenant and your landlord kicks you out with a 2 minute notice. #nightmares

Kubernetes takes care of handling the difficult part of evicting workloads and scheduling it to new instances if such an event does arise. Interruptions aren’t usually too frequent. If you would like to study the interruption rate, you could check out AWS Spot advisor and select instances with the least interruptions.

Hint: Ensure that you select the right region to view the interruption rates since it varies significantly across regions.

Kubernetes scheduling works in a way that helps us run spot instances with minimal effort. Sure there could be downtime but the most downtime you will ever see is up to 15 minutes after a spot instance interruption.

Interruptions in spot instances could be due to a number of factors; which can found listed in the link below:

The icing on the cake is that with node groups scaling down to zero, you will never pay a penny if your users don’t have any notebooks or workloads running in Kubeflow.

The reduction that we noticed at Fynd was 78.58% on DAILY TRAINING COSTS.

Enough with the why, let’s start with the how-to.

This tutorial talks in detail on how to set up Kubeflow the following specifications:

  • AWS Cloud environment
  • EKS Kubernetes Cluster
  • Multitenancy
  • Google OAuth
  • Spot GPU instances
  • Scaling GPU nodes down to zero when no notebooks/workloads are running

Scaling down to zero was the most significant factor for us. The cost of a GPU node is staggering when the node is not being used but is left running. Also, another important condition was to run the whole setup on spot instances to reduce notebook instance costs significantly (which has come in handy especially in times of the COVID-19 pandemic).

Getting Started

Prerequisites

The only thing you will need is an AWS account with all the necessary permissions. No experience with Kubernetes or AWS is required, but definitely would be good to have.

Let’s start by generating AWS keys for building our Kubernetes cluster. Kubeflow will run on this cluster to leverage the full power of Kubernetes.

Step 1: Create AWS Programmatic Access Keys

Generating AWS Keys for CLI Programmatic Access
  1. Login to AWS console with your root account. Use the following link:
    https://aws.amazon.com/console/
  2. Head over to IAM Console
    https://console.aws.amazon.com/iam/home?region=us-east-1#/home
  3. Click on Users> Add user
  4. Enter username and Check Programmatic Access
    Click [Next: Permissions]
  5. Click on Attach existing policies directly
  6. Enable AdministratorAccess and Click [Next: Tags]
  7. Click [Next: Review]
  8. Review and ensure you have AdministratorAccess
  9. Click [Create User]
  10. Download CSV and store securely. Ensure this doesn’t go in the hands of anyone as it has Full Administrator Access.
    DO NOT SHARE THIS WITH ANYONE.

Step 2: Installing AWS CLI

Head over to the link below and install AWS CLI on your operating system.

When done, verify the AWS command works as expected.

aws --version
Check aws-cli

Step 3: Set Keys in AWS Profile

We will be using a named profile for this whole tutorial so it will not affect any existing environments.

aws configure --profile kf-admin
Create AWS Named Profile

We’re using us-east-1 as the default region since it’s among the cheapest regions with access to most features. Also at the time of building the cluster, we needed access to AWS’s largest GPU instance, the p3dn.24xlarge which was only available in us-east-1 at the time of writing this post.

Some factors to consider while choosing your region:

  • Cost
  • Feature/Instance Availability
  • Latency

I’ve linked a few resources that could come in handy for the same:

With the pre-requisites in place, we’re now ready to setup Kubeflow.
But to run Kubeflow, we need a Kubernetes cluster.

Building the Kubernetes Cluster

Step 1: Installing prerequisites

You will need to install the eksctl and kubectl CLI for creating and managing an EKS cluster.

Check the link below to install the prerequisites. Do not spin up a cluster just yet, use the config files provided with this blog to make sure everything works as expected.

Step 2: Check prerequisites for EKS

Prerequisites Check

Once all of these are installed on your machine, we’re good to go!

Set AWS Profile to the one you created earlier and ensure the keys work as expected.

export AWS_PROFILE=kf-admin
aws s3 ls

The output may vary slightly, but it should be good enough to know if your keys work or not.

Step 3: Clone Config Repository

Fork the repository on Github and clone it

git clone https://github.com/arjun921/aws-spot-instances-kubeflow.git

Hint: Make sure to set your forked repository to private before proceeding forward.

We will be storing the .kubeconfig file in this repository.
The benefit of doing that is we can clone the repository and connect to the cluster from any machine as needed.

In production, this is highly unrecommended.

If anyone gets their hands-on your .kubeconfig file, they will have full rights to manage and delete all resources inside the cluster.

Step 4: EKS CREATE CLUSTER

export ENVIRONMENT=staging
source envs/$ENVIRONMENT/variables.sh
eksctl create cluster -f envs/$ENVIRONMENT/cluster-spec.yml

Now if all goes well, after a long time(about 15–20 minutes),
you should have an EKS cluster running.

EKS makes spinning up a cluster very easy. The YAML creates the cluster with all the necessary prerequisites to run Kubeflow with maximum cost efficiency. However, if you do run into errors like this:

Error creating cluster

To fix the error, just delete the cluster, wait for about 30 minutes for all resources to get cleared, update the cluster.yml to fix the error, and recreate. Most often it might be an error of the instance type not being available in a given AZ. Just switch the AZ to the one suggested in the YAML and you should be good to go.

To delete the cluster, run the following command

eksctl delete cluster -f envs/$ENVIRONMENT/cluster-spec.yml

Step 5: Setting up Node Autoscaling

Kubernetes nodes are scaled by the Cluster Autoscaler. To understand how autoscaling works, check my other blog post.

We’re including a stripped-down version of autoscaling today. It will automatically discover node groups for autoscaling and will also be responsible for scaling our GPU nodes down to zero when there are no Jupyter Notebooks or GPU requesting workloads running on the cluster.

# install cluster autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
# Edit deployment
kubectl -n kube-system edit deployment.apps/cluster-autoscaler
# Update autoscaler deployment flags
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/kubeflow-us-east-1
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
Deploying Autodiscovery enabled Cluster Autoscaler

Step 6: Install Spot instance termination Handler

One last thing before we get started, install the spot instance termination handler to your cluster using helm.

Install helm with the following command

curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

And install the spot instance termination handler by these commands:

helm repo add eks https://aws.github.io/eks-charts 
helm repo update
helm install aws-node-termination-handler eks/aws-node-termination-handler \
--namespace kube-system \
--set enableSpotInterruptionDraining="true" \
--set enableScheduledEventDraining="true" \
--set nodeSelector.lifecycle=Ec2Spot

If you would like slack alerts, for Spot Instance termination notices, use this:

helm repo add eks https://aws.github.io/eks-charts 
helm repo update
helm install aws-node-termination-handler eks/aws-node-termination-handler \
--namespace kube-system \
--set enableSpotInterruptionDraining="true" \
--set enableScheduledEventDraining="true" \
--set nodeSelector.lifecycle=Ec2Spot \
--set webhookURL=https://hooks.slack.com/services/WEBHOOK/INTEGRATION/URL

The benefit of using this is this handler will take care of rescheduling all the pods from the instance being terminated to a new one with the requested specs.

Step 7: Install Nvidia Plugin

This plugin is a daemonset which will give your Pods access to the actual GPUs on the instances in the cluster.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml

Summary

Here’s what we’ve achieved so far,
We deployed a cluster with autoscaling enabled with the following specs

  • 10x m5a.2xlarge
    ng: ng-1
  • 10x m5a.2xlarge
    ng: ng-2
  • 10x p2.xlarge
    ng: 1-gpu-spot-p2-xlarge
  • 10x p3.2xlarge
    ng: 1-gpu-spot-p3–2xlarge
  • 4x p3.8xlarge
    ng: 4-gpu-spot-p3–8xlarge

But all these instances won’t be running unless you have enough workload. I have also included the goliath p3dn.24xlarge in the YAML which you could enable if you’d like by uncommenting out the spec at the bottom.

- name: 8-gpu-spot-p3dn-24xlarge
minSize: 0
maxSize: 1
instancesDistribution:
# set your own max price. AWS spot instance prices no longer cross OnDemand price.
# Comment out the field to default to OnDemand as max price.
maxPrice: 11
instanceTypes: ["p3dn.24xlarge"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: capacity-optimized
labels:
lifecycle: Ec2Spot
aws.amazon.com/spot: "true"
gpu-count: "8"
availabilityZones: ["us-east-1f"]
taints:
spotInstance: "true:PreferNoSchedule"
tags:
k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot: "true"
k8s.io/cluster-autoscaler/node-template/label/gpu-count: "8"
k8s.io/cluster-autoscaler/node-template/taint/spotInstance: "true:PreferNoSchedule"
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/kubeflow-us-east-1: "owned"
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
albIngress: true

Deploying Kubeflow

Now that we’ve got that out of the way, we move on to the actual task. Getting Kubeflow running on the cluster.

At the time of writing this, 1.0.2 was out; but for some reason, I just couldn’t get the auth version working. So we’ll be deploying Kubeflow 1.0.1 with our scripts.

Here are the prerequisites:

Step 1: Install kfctl

Download the release for your OS from the link

and run the following commands.

tar -xvf kfctl_v1.0.2_<platform>.tar.gz
mv ./kfctl /usr/local/bin
# check if it works
kfctl version
Install kfctl

Step 2: Install AWS IAM Authenticator

Install AWS IAM Authenticator from the link below and check if it works as expected.

Installing AWS IAM Authenticator

Step 3: Create Auth0 App

Create an account/log in to Auth0 and save the Client ID and Client Secret. We will be using these in the following steps. Also, make a note of your Auth0 app endpoint.

Step 4: Issue Wildcard Certificate

Follow this tutorial to get an ACM certificate.

To deploy the toolkit:

Step 1: Run script

Run script deploy_kubeflow.sh

When the editor opens up, replace the plugins key with the following.

Hint: Make sure to update ClientID, Client Secret, auth0 URL, and Certificate ARN with the ones that you generated earlier.

Close the editor and let the setup begin.

Deploying Kubeflow

Step 2: Set sub-domain

Once you have an ELB string, head over to AWS Route 53 and add a subdomain record pointing to the elb. Here’s an example of what your record would look like.

Step 3: Configure Auth0 callback

Head over to Auth0 app and update your application’s Allowed Callback URLs to

Make sure to scroll to the bottom and click save! Once this is done, you should be able to access Kubeflow at the specified domain of your own.

Final Kubeflow Dashboard
Invincible, Exactly how I felt when I got to the dashboard 😬

And we’re good to go! Just fire up a Jupyter Notebook with GPU image, request for 1–2 GPUs and you should have a notebook running in about 10 minutes. The time taken for the node to get attached might vary from time to time but shouldn’t exceed more than 15 minutes.

If a 4 GPU notebook is spun and only 2 GPUs are being used, 2 of them are still unused. When a new notebook server is spun up, Kubeflow (or rather Kubernetes) will check if there are nodes with uncommitted resources and schedules the Notebooks to the same instance.

Wrapping up

The hard part about the whole deployment flow was getting Cluster config right. If you notice, about 3/4 of the blog is just building the cluster and 1/4 of it being Kubeflow deployment.

That being said, it’s finally done and dusted.

If you face any issues setting up Kubeflow on AWS with the provided spec, got questions or comments, feel free to leave your feedback in the comments section or reach out to me directly at https://arjunsunil.com/

And don’t forget to 👏 if you enjoyed this article 🙂.

References

I talk about real world experiences in Tech and Scaling Deep Learning based workloads | Reach out via @arjun921 / connect@arjunsunil.com