How we Reduced our ML Training Costs by 78%!!

Setting up Kubeflow with Multitenancy, Google Auth, Spot Instances and Node Group Scale Down to Zero

Published in

Building Fynd

13 min readJun 9, 2020

Before starting, let’s get to know each other. I am a Data Scientist turned MLOps Engineer, currently working at Fynd in the AML domain. I love writing code, being around animals, traveling, and cooking. To connect/know more about me, check out my website.

I was working on reducing the cost of running training experiments at Fynd when I came across Kubeflow. Kubeflow runs on Kubernetes and with my previous experience in Kubernetes and spot instances, it seemed like the way to go.

It had beautiful documentation. I thought that since all the steps are out there, this will be a piece of cake.

I faced issues while trying to set up Kubeflow even though they have amazing documentation. To ensure no one else has to go through the hassle of putting together information across multiple sources, I thought of putting together a one-shot deployment tutorial to make it easier for everyone.

What is Kubeflow?

Probably the best overview of what Kubeflow is and what it could do for you is from the video linked below. But long story short; it is a Machine Learning toolkit that helps build the whole pipeline end to end, from data processing, training to deploy in production in a highly scalable manner. Read More

Why would you want to run Kubeflow on spot instances?

First, let's talk about what are spot instances

In any data center, you have a limited number of instances that can be set up, either constrained by power, network bandwidth, or simply the space available. Most data centers stack up on maximum capacity available. But all these instances are rarely used by customers.

AWS provides unutilized spot instances at a range of 60–80% discount on the OnDemand price.

Considering that an 80% discount reflects immediately on the AWS bills, this is something that any individual or business would appreciate during the times of Covid-19 struck economy.

Things you could do possibly with that saved money

Easy there, cowboy, you’re gonna have to wait for a while until the Coronavirus situation subsides.

So, what's the catch? How is this too good to be true? Why can’t we use spot instances for all workloads?

Well, since AWS is providing us these instances at such affordable rates, they also have the right to claim back these instances at short notice.

How short is the notice?

2 minutes. Yes, that’s all we get to take all our belongings and head out of the instance.

But do not fear, Kubernetes is here to the rescue!

Imagine, you being a tenant and your landlord kicks you out with a 2 minute notice. #nightmares

Kubernetes takes care of handling the difficult part of evicting workloads and scheduling it to new instances if such an event does arise. Interruptions aren’t usually too frequent. If you would like to study the interruption rate, you could check out AWS Spot advisor and select instances with the least interruptions.

Spot Instance Advisor

The Spot Instance advisor helps you determine pools with the least chance of interruption and provides the savings you…

aws.amazon.com

Hint: Ensure that you select the right region to view the interruption rates since it varies significantly across regions.

Kubernetes scheduling works in a way that helps us run spot instances with minimal effort. Sure there could be downtime but the most downtime you will ever see is up to 15 minutes after a spot instance interruption.

Interruptions in spot instances could be due to a number of factors; which can found listed in the link below:

Spot Instance interruptions

Demand for Spot Instances can vary significantly from moment to moment, and the availability of Spot Instances can also…

docs.aws.amazon.com

The icing on the cake is that with node groups scaling down to zero, you will never pay a penny if your users don’t have any notebooks or workloads running in Kubeflow.

The reduction that we noticed at Fynd was 78.58% on DAILY TRAINING COSTS.

Enough with the why, let’s start with the how-to.

This tutorial talks in detail on how to set up Kubeflow the following specifications:

AWS Cloud environment
EKS Kubernetes Cluster
Multitenancy
Google OAuth
Spot GPU instances
Scaling GPU nodes down to zero when no notebooks/workloads are running

Scaling down to zero was the most significant factor for us. The cost of a GPU node is staggering when the node is not being used but is left running. Also, another important condition was to run the whole setup on spot instances to reduce notebook instance costs significantly (which has come in handy especially in times of the COVID-19 pandemic).

Getting Started

Prerequisites

The only thing you will need is an AWS account with all the necessary permissions. No experience with Kubernetes or AWS is required, but definitely would be good to have.

Let’s start by generating AWS keys for building our Kubernetes cluster. Kubeflow will run on this cluster to leverage the full power of Kubernetes.

Step 1: Create AWS Programmatic Access Keys

Generating AWS Keys for CLI Programmatic Access

Login to AWS console with your root account. Use the following link:
https://aws.amazon.com/console/
Head over to IAM Console
https://console.aws.amazon.com/iam/home?region=us-east-1#/home
Click on Users> Add user
Enter username and Check Programmatic Access
Click [Next: Permissions]
Click on Attach existing policies directly
Enable AdministratorAccess and Click [Next: Tags]
Click [Next: Review]
Review and ensure you have AdministratorAccess
Click [Create User]
Download CSV and store securely. Ensure this doesn’t go in the hands of anyone as it has Full Administrator Access.
DO NOT SHARE THIS WITH ANYONE.

Step 2: Installing AWS CLI

Head over to the link below and install AWS CLI on your operating system.

Installing the AWS CLI version 2

Install the AWS Command Line Interface version 2 (AWS CLI version 2) on your system.

docs.aws.amazon.com

When done, verify the AWS command works as expected.

aws --version

Step 3: Set Keys in AWS Profile

We will be using a named profile for this whole tutorial so it will not affect any existing environments.

aws configure --profile kf-admin

We’re using us-east-1 as the default region since it’s among the cheapest regions with access to most features. Also at the time of building the cluster, we needed access to AWS’s largest GPU instance, the p3dn.24xlarge which was only available in us-east-1 at the time of writing this post.

Some factors to consider while choosing your region:

Cost
Feature/Instance Availability
Latency

I’ve linked a few resources that could come in handy for the same:

Save yourself a lot of pain (and money) by choosing your AWS Region wisely

Choosing an AWS region is not a trivial decision. Many variables affect the price, performance and…

www.concurrencylabs.com

AWS Regional Table

AWS Fargate — The table captures the regional availability of AWS Fargate when using Amazon ECS. Amazon EKS supports…

aws.amazon.com

With the pre-requisites in place, we’re now ready to setup Kubeflow.
But to run Kubeflow, we need a Kubernetes cluster.

Building the Kubernetes Cluster

Step 1: Installing prerequisites

You will need to install the eksctl and kubectl CLI for creating and managing an EKS cluster.

Check the link below to install the prerequisites. Do not spin up a cluster just yet, use the config files provided with this blog to make sure everything works as expected.

Getting started with eksctl

This getting started guide helps you to install all of the required resources to get started with Amazon EKS using…

docs.aws.amazon.com

Step 2: Check prerequisites for EKS

Once all of these are installed on your machine, we’re good to go!

Set AWS Profile to the one you created earlier and ensure the keys work as expected.

export AWS_PROFILE=kf-admin
aws s3 ls

The output may vary slightly, but it should be good enough to know if your keys work or not.

Step 3: Clone Config Repository

Fork the repository on Github and clone it

git clone https://github.com/arjun921/aws-spot-instances-kubeflow.git

Hint: Make sure to set your forked repository to private before proceeding forward.

We will be storing the .kubeconfig file in this repository.
The benefit of doing that is we can clone the repository and connect to the cluster from any machine as needed.

In production, this is highly unrecommended.
If anyone gets their hands-on your .kubeconfig file, they will have full rights to manage and delete all resources inside the cluster.

Step 4: EKS CREATE CLUSTER

export ENVIRONMENT=staging
source envs/$ENVIRONMENT/variables.sh
eksctl create cluster -f envs/$ENVIRONMENT/cluster-spec.yml

Now if all goes well, after a long time(about 15–20 minutes),
you should have an EKS cluster running.

EKS makes spinning up a cluster very easy. The YAML creates the cluster with all the necessary prerequisites to run Kubeflow with maximum cost efficiency. However, if you do run into errors like this:

To fix the error, just delete the cluster, wait for about 30 minutes for all resources to get cleared, update the cluster.yml to fix the error, and recreate. Most often it might be an error of the instance type not being available in a given AZ. Just switch the AZ to the one suggested in the YAML and you should be good to go.

To delete the cluster, run the following command

eksctl delete cluster -f envs/$ENVIRONMENT/cluster-spec.yml

Step 5: Setting up Node Autoscaling

Kubernetes nodes are scaled by the Cluster Autoscaler. To understand how autoscaling works, check my other blog post.

We’re including a stripped-down version of autoscaling today. It will automatically discover node groups for autoscaling and will also be responsible for scaling our GPU nodes down to zero when there are no Jupyter Notebooks or GPU requesting workloads running on the cluster.

# install cluster autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml# Edit deployment 
kubectl -n kube-system edit deployment.apps/cluster-autoscaler# Update autoscaler deployment flags
 - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/kubeflow-us-east-1
 - --balance-similar-node-groups 
 - --skip-nodes-with-system-pods=false

Deploying Autodiscovery enabled Cluster Autoscaler

Step 6: Install Spot instance termination Handler

One last thing before we get started, install the spot instance termination handler to your cluster using helm.

Install helm with the following command

curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

And install the spot instance termination handler by these commands:

helm repo add eks https://aws.github.io/eks-charts 
helm repo update 
helm install aws-node-termination-handler eks/aws-node-termination-handler \ 
 --namespace kube-system \ 
 --set enableSpotInterruptionDraining="true" \ 
 --set enableScheduledEventDraining="true" \ 
 --set nodeSelector.lifecycle=Ec2Spot

If you would like slack alerts, for Spot Instance termination notices, use this:

helm repo add eks https://aws.github.io/eks-charts 
helm repo update 
helm install aws-node-termination-handler eks/aws-node-termination-handler \ 
 --namespace kube-system \ 
 --set enableSpotInterruptionDraining="true" \ 
 --set enableScheduledEventDraining="true" \ 
 --set nodeSelector.lifecycle=Ec2Spot \ 
 --set webhookURL=https://hooks.slack.com/services/WEBHOOK/INTEGRATION/URL

The benefit of using this is this handler will take care of rescheduling all the pods from the instance being terminated to a new one with the requested specs.

Step 7: Install Nvidia Plugin

This plugin is a daemonset which will give your Pods access to the actual GPUs on the instances in the cluster.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml

Summary

Here’s what we’ve achieved so far,
We deployed a cluster with autoscaling enabled with the following specs

10x m5a.2xlarge
ng: ng-1
10x m5a.2xlarge
ng: ng-2
10x p2.xlarge
ng: 1-gpu-spot-p2-xlarge
10x p3.2xlarge
ng: 1-gpu-spot-p3–2xlarge
4x p3.8xlarge
ng: 4-gpu-spot-p3–8xlarge

But all these instances won’t be running unless you have enough workload. I have also included the goliath p3dn.24xlarge in the YAML which you could enable if you’d like by uncommenting out the spec at the bottom.

- name: 8-gpu-spot-p3dn-24xlarge
  minSize: 0
  maxSize: 1
  instancesDistribution:
    # set your own max price. AWS spot instance prices no longer cross OnDemand price.
    # Comment out the field to default to OnDemand as max price.
    maxPrice: 11
    instanceTypes: ["p3dn.24xlarge"]
    onDemandBaseCapacity: 0
    onDemandPercentageAboveBaseCapacity: 0
    spotAllocationStrategy: capacity-optimized
  labels:
    lifecycle: Ec2Spot
    aws.amazon.com/spot: "true"
    gpu-count: "8"
  availabilityZones: ["us-east-1f"]
  taints:
    spotInstance: "true:PreferNoSchedule"
  tags:
    k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
    k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot: "true"
    k8s.io/cluster-autoscaler/node-template/label/gpu-count: "8"
    k8s.io/cluster-autoscaler/node-template/taint/spotInstance: "true:PreferNoSchedule"
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/kubeflow-us-east-1: "owned"
  iam:
    withAddonPolicies:
      autoScaler: true
      cloudWatch: true
      albIngress: true

Deploying Kubeflow

Now that we’ve got that out of the way, we move on to the actual task. Getting Kubeflow running on the cluster.

At the time of writing this, 1.0.2 was out; but for some reason, I just couldn’t get the auth version working. So we’ll be deploying Kubeflow 1.0.1 with our scripts.

Here are the prerequisites:

Step 1: Install kfctl

Download the release for your OS from the link

kubeflow/kfctl

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

and run the following commands.

tar -xvf kfctl_v1.0.2_<platform>.tar.gz
mv ./kfctl /usr/local/bin
# check if it works 
kfctl version

Step 2: Install AWS IAM Authenticator

Install AWS IAM Authenticator from the link below and check if it works as expected.

Installing aws-iam-authenticator

Amazon EKS uses IAM to provide authentication to your Kubernetes cluster through the AWS IAM authenticator for…

docs.aws.amazon.com

Step 3: Create Auth0 App

Create an account/log in to Auth0 and save the Client ID and Client Secret. We will be using these in the following steps. Also, make a note of your Auth0 app endpoint.

Step 4: Issue Wildcard Certificate

Follow this tutorial to get an ACM certificate.

Request a Public Certificate

The following sections discuss how to use the ACM console or AWS CLI to request a public ACM certificate. If you…

docs.aws.amazon.com

To deploy the toolkit:

Step 1: Run script

Run script deploy_kubeflow.sh

When the editor opens up, replace the plugins key with the following.

Hint: Make sure to update ClientID, Client Secret, auth0 URL, and Certificate ARN with the ones that you generated earlier.

Close the editor and let the setup begin.

Deploying Kubeflow With Auth0

Check full deployment process on

terminalizer.com

Step 2: Set sub-domain

Once you have an ELB string, head over to AWS Route 53 and add a subdomain record pointing to the elb. Here’s an example of what your record would look like.

Step 3: Configure Auth0 callback

Head over to Auth0 app and update your application’s Allowed Callback URLs to

Make sure to scroll to the bottom and click save! Once this is done, you should be able to access Kubeflow at the specified domain of your own.

Invincible, Exactly how I felt when I got to the dashboard 😬

And we’re good to go! Just fire up a Jupyter Notebook with GPU image, request for 1–2 GPUs and you should have a notebook running in about 10 minutes. The time taken for the node to get attached might vary from time to time but shouldn’t exceed more than 15 minutes.

If a 4 GPU notebook is spun and only 2 GPUs are being used, 2 of them are still unused. When a new notebook server is spun up, Kubeflow (or rather Kubernetes) will check if there are nodes with uncommitted resources and schedules the Notebooks to the same instance.

Wrapping up

The hard part about the whole deployment flow was getting Cluster config right. If you notice, about 3/4 of the blog is just building the cluster and 1/4 of it being Kubeflow deployment.

That being said, it’s finally done and dusted.

If you face any issues setting up Kubeflow on AWS with the provided spec, got questions or comments, feel free to leave your feedback in the comments section or reach out to me directly at https://arjunsunil.com/

And don’t forget to 👏 if you enjoyed this article 🙂.

Also, don’t forget to check the project’s Github page

arjun921/aws-spot-instances-kubeflow

Config files for setting up Multitenant Kubeflow on AWS with spot instances Repo contains supporting code for How we…

github.com

References

https://www.kubeflow.org/docs/aws/authentication-oidc/
Official Documentation
https://github.com/NVIDIA/k8s-device-plugin
To get Notebooks communications running with the GPU
https://github.com/weaveworks/eksctl/tree/master/examples
Chunks of cluster spec scattered all over the place, taken from here.
https://medium.com/@arjun921/how-i-setup-autoscaling-on-kubernetes-and-you-can-too-470a5fec067f
Previous Blog on how to set up Autoscaling on Kubernetes
https://ec2spotworkshops.com/
Information on how to run spot instances with EKS

How we Reduced our ML Training Costs by 78%!!

Setting up Kubeflow with Multitenancy, Google Auth, Spot Instances and Node Group Scale Down to Zero

What is Kubeflow?

Why would you want to run Kubeflow on spot instances?

First, let's talk about what are spot instances

Spot Instance Advisor

The Spot Instance advisor helps you determine pools with the least chance of interruption and provides the savings you…

Spot Instance interruptions

Demand for Spot Instances can vary significantly from moment to moment, and the availability of Spot Instances can also…

Enough with the why, let’s start with the how-to.

Getting Started

Prerequisites

Step 1: Create AWS Programmatic Access Keys

Step 2: Installing AWS CLI

Installing the AWS CLI version 2

Install the AWS Command Line Interface version 2 (AWS CLI version 2) on your system.

Step 3: Set Keys in AWS Profile

Save yourself a lot of pain (and money) by choosing your AWS Region wisely

Choosing an AWS region is not a trivial decision. Many variables affect the price, performance and…

AWS Regional Table

AWS Fargate — The table captures the regional availability of AWS Fargate when using Amazon ECS. Amazon EKS supports…

Building the Kubernetes Cluster

Step 1: Installing prerequisites

Getting started with eksctl

This getting started guide helps you to install all of the required resources to get started with Amazon EKS using…

Step 3: Clone Config Repository

Step 4: EKS CREATE CLUSTER

Step 5: Setting up Node Autoscaling

Step 6: Install Spot instance termination Handler

Step 7: Install Nvidia Plugin

Summary

Deploying Kubeflow

Step 1: Install kfctl

kubeflow/kfctl

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Step 2: Install AWS IAM Authenticator

Installing aws-iam-authenticator

Amazon EKS uses IAM to provide authentication to your Kubernetes cluster through the AWS IAM authenticator for…

Step 3: Create Auth0 App

Step 4: Issue Wildcard Certificate

Request a Public Certificate

The following sections discuss how to use the ACM console or AWS CLI to request a public ACM certificate. If you…

Step 1: Run script

Deploying Kubeflow With Auth0

Check full deployment process on

Step 2: Set sub-domain

Step 3: Configure Auth0 callback

Wrapping up

arjun921/aws-spot-instances-kubeflow

Config files for setting up Multitenant Kubeflow on AWS with spot instances Repo contains supporting code for How we…

References

Written by Arjun Sunil