What is Kubeflow?
Kubeflow is an open-source project which is built on top of Kubernetes, that contains a set of tools and frameworks to enable the development, deployment and management of Machine Learning models, workflows and services in a completely portable and scalable manner. Kubeflow aims at creating an industry standard for end-to-end management on Machine Learning Infrastructure.
Using Machine Learning in an organization is not a straightforward process. It is very different from some of the generic services that are deployed in the environment production by a company. The life cycle of an ML service is very diverse, containing various stages. If these stages are not carefully thought about in the development phase then it can result in scalability issues at a later stage when the service is running on production. This is where Kubeflow comes to the rescue, it is an industry-standard that considers the ML service deployment right from the initial phase of development to the final deployment at scale.
Use Cases of Kubeflow
The main purpose of an ML model is to serve the ML model in production and generating value for the business. However, ML models go through a multi-stage process to reach this point. Starting from:
- Data Loading
- Verification
- Splitting
- Processing
- Feature Engineering
- Model Training and Validation
- Hyperparameter Tuning and Optimization
- Model Serving
The multi-stage process is the thing that is simplified and standardized by Kubeflow, as running and maintaining this processing is a challenge even for the most experienced Data Scientists and Machine Learning Engineers.
Core use cases of Kubeflow
Deploying and Managing complex ML Systems at Scale
Kubeflow can be used to manage the entire Machine Learning workflow of a company at scale and maintain the same level of quality. Its underlying Kubernetes provide any Kubeflow user all the capabilities that lie inside Kubernetes hence providing great scalability capabilities.
Research and Development with various Machine Learning Models
Any ML workflow requires a large amount of experimentation and research work. This includes testing various models, comparing them, tuning hyperparameters and validating the result. Kubeflow provides Jupyter Notebooks, various ML frameworks and capabilities such as CUJ’s (Critical User Journey) end-to-end pipelines that provide speedy development capabilities.
Hybrid and Multi-Cloud Machine Learning Workloads
Kubeflow is supported by all major cloud providers. It provides a standardized environment abstracting all the underlying config so that the researchers and developers can focus on the development with their ML workflows capable of working on cloud resources, laptops and on-prem servers.
Hyperparameters tuning and optimization
In the development phase, hyperparameters optimization is often a critical task and results can be skewed by very minor variations. Manual hyperparameters tuning is a tedious and time-consuming task. Kubeflow provides Katib, a tool that can be used to tune hyperparameters in an automated way. This automation in hyperparameters tuning can reduce the development time considerably.
The Core Principles of Kubeflow
The principles on which Kubeflow is built upon are:
Composability
A Machine Learning Service/Model often varies as per the use case and the data that is provided to it. Composability means the ability to choose what is right for your project. ML model generation is a multi-stage process where we need to carefully choose the stages that are required in our project. Kubeflow handles version switching, correlating various frameworks and libraries by treating each of these as an independent system and then giving us the ability to easily generate a pipeline between these multiple systems.
Portability
Portability in Kubeflow means that it generates an abstraction layer between your system and the ML project. That means the ML project can be run anywhere you are using Kubeflow, whether it is our laptop, Training Rig, or the Cloud. Kubeflow handles all the platform-specific configurations and we only need to worry about our ML models and not the underlying configs.
Scalability
Scalability is the ability to increase and decrease the resource consumption as per the requirement by the project or the request load it needs to handle. As Kubeflow is built on top of Kubernetes it lies in an ideal position to manage all the resources it needs due to the underlying capabilities of a Kubernetes engine. Toggling between computing resources, sharing between multiple teams and region allocation lies in the very foundation of Kubeflow due to its base underlying technology of Kubernetes.
The Components of Kubeflow
The components that collectively make Kubeflow are:
Dashboard
Kubeflow provides a central dashboard that helps you keep track of all the pipelines, services etc deployed via Kubeflow.
Jupyter Notebook/Servers
Jupyter notebooks are one of the most used tools in the field of Dat Science and Machine Learning, you can spin up a quick Jupyter Notebook and begin your research and development. It abstracts all the excess details that you need to handle in an IDE. Jupyter Notebooks contain cells in which code can be run in an interpreted manner, these are great for visualization and research work.
Machine Learning Frameworks
Kubeflow comes with the support for various state of the art frameworks for Machine Learning such as Tensorflow, PyTorch, MXNet, MPI and Chainer. These are widely used in the ML industry.
Machine Learning Pipelines
Kubeflow comes with inbuilt ML pipelines for End-to-End orchestration of ML workflow. Reusable and Easy Setup pipelines for experimentation and development.
Serving Tools
Serving the ML model as a service for production is the end goal for Machine Learning search work in a company. Kubeflow comes with a wide range of serving tools for your ML models such as TensorFlow Serving, NVIDIA Triton, Seldon Serving, KFServing, etc.
Metadata and Base ML Components
Kubeflow contains a facility for storing metadata for your ML workflow. It helps to maintain and manage your Machine Learning workflows. Metadata contains exec config, models, datasets, and deployment artifacts for Kubernetes.
Feature Storage
Feature Storage refers to the production deployment side of an ML Service and is often the part that most Machine Learning teams find challenging. It covers the stage of inference and training the Machine Learning models for production. This feature handles various issue such as:
- Feature Sharing and Reuse
- Serving features at Scale
- Consistency between Training and Serving
- Point-in-time correctness
- Data Quality and Validation
To address these issues Kubeflow uses Feat, which is an open-source feature store that is used to help teams working on a Machine Learning system for defining, managing, discovering, validating and serving features to the ML models during the training and inference phase.
The Functionality Feast Provides are:
- Standardization: it acts as the ground truth for various teams that are working in a company. Hence it provides better communication channels between teams as everyone is using the same ground truth for their development.
- Load Streaming and Data Batching: it helps to simplify data ingestion by providing the facility to ingest data from various sources such as data streams, object stores, databases, or notebooks. The data can be further used to generate datasets, training data, validation data, etc.
- Online and Historic Serving: Feast provides the facility for online and historic serving by exposing low latency APIs for the ingested data. It ensures point-in-time correctness that ensures the quality and consistency of features in models.
- Consistency between training and serving in production: using Feast teams are able to abstract the underlying data infrastructure and thus gaining the ability to migrate models from training to serving phase easily.
- Validation: Feast has inbuilt compatibility with TensorFlow Data Validation that can be used to capture the TFDV schemas which can be used for validation of features in the ingestion, training, or serving stage.
Installation Process
The process of setting up Kubeflow is explained below.
Prerequisites
Before moving to Kubeflow setup basic prior knowledge Kubernetes and Kustomize are required. Kubernetes is the underlying container orchestration service on which Kubeflow is built and Kustomize is a template-free wat to customize application configuration.
Reference Link: Kubernetes Basics
Reference Link: Kustomize
Note: even while using Kubernetes you need to comply with minimum system requirements for deploying Kubeflow on your Kubernetes cluster. The reference link for minimum system requirements is given below:
Reference Link: Minimum System Requirements
Image Source: Kubeflow Docs
Installing Kubeflow on a Kuernetes Cluster
You can use your pre-built Kubernetes cluster or follow the process below to create a quick Cluster using minikube. Make sure you have all Kubernetes helper tools installed such as kubectl.
1. Install Minikube using Binary for amd64/ x86_64
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikub
Note: For other Linus distros and operating systems refer to the link below:
Reference Link: minikube download
2. Run Minikube
minikube start
3. Check for proper kubectl installation using
kubectl version
4. Check Cluster Information
kubectl cluster-info
Other kubectl commands can be used to get more detailed information related to the cluster. Including the nodes, deployment, services etc. Refer to the reference link mentioned below to get all these commands.
Reference Link: kubectl Cheat Sheet
Kubeflow Operator
Kubeflow Operator helps deploy, monitor and manage the Kubeflow lifecycle. It is built using Operator Framework which is an open-source toolkit to built, test, package operators and manage the lifecycle of operators. The Kubeflow Operator uses KfDef as its custom resource and kfctl as the underlying tool for running the operator. It can be installed from operatorhub.io
5. Install Operator Lifecycle Manager
curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.17.0/install.sh | bash -s v0.17.0
6. Install Operator
kubectl create -f https://operatorhub.io/install/kubeflow.yaml
7. Watch Operators Come Up
kubectl get csv -n operators
8. Check and Verify Operator Installation
kubectl get pod -n operators
NAME READY STATUS RESTARTS AGE
kubeflow-operator-55876578df-25mq5 1/1 Running 0 17h
9. Prepare KfDef configuration
The metadata.name field must be set for the KfDef manifests whether it is downloaded from the Kubeflow manifests repo or is originally written. The following example shows how to prepare the KfDef manifests
# download a default KfDef configuration from remote repo
export KFDEF_URL=https://raw.githubusercontent.com/kubeflow/manifests/v1.1-branch/kfdef/kfctl_ibm.yaml
export KFDEF=$(echo “${KFDEF_URL}” | rev | cut -d/ -f1 | rev)
curl -L ${KFDEF_URL} > ${KFDEF}
# add metadata.name field
# Note: yq can be installed from https://github.com/mikefarah/yq
export KUBEFLOW_DEPLOYMENT_NAME=kubeflow
yq w ${KFDEF} ‘metadata.name’ ${KUBEFLOW_DEPLOYMENT_NAME} > ${KFDEF}.tmp && mv ${KFDEF}.tmp ${KFDEF}
10. Deploy Kubeflow using Kubeflow Operator
# create the namespace for Kubeflow deployment
KUBEFLOW_NAMESPACE=kubeflow
kubectl create ns ${KUBEFLOW_NAMESPACE}
# create the KfDef custom resource
kubectl create -f ${KFDEF} -n ${KUBEFLOW_NAMESPACE}
11. Watch the deployment progress
kubectl logs deployment/kubeflow-operator -n ${OPERATOR_NAMESPACE} -f
12. Monitor pods and verify Kubeflow Deployment
kubectl get pod -n ${KUBEFLOW_NAMESPACE}
Specified Kubeflow deployment for various Cloud providers
As mentioned above Kubeflow is supported by all major cloud providers. Although the underlying process is quite similar, reference docs are available for installation on various cloud providers.
- Google Cloud Platform(GCP): Reference Link
- Amazon Web Services: Reference Link
- Microsoft Aure Kubernetes Service(AKS): Reference Link
- IBM Cloud (IKS): Reference Link
- OpenShift: Reference Link