How Can You Do Deep Learning in the Cloud?
Deep learning is at the center of most artificial intelligence initiatives. It is based on the concept of a deep neural network, which passes inputs through multiple layers of connections. Neural networks can perform many complex cognitive tasks, improving performance dramatically compared to classical machine learning algorithms. However, they often require huge data volumes to train, and can be very computationally intensive.
Cloud computing services are helping make deep learning more accessible, making it easier to manage large datasets and train algorithms on distributed hardware.
How One AI-Driven Media Platform Cut EBS Costs for AWS ASGs by 48%
Cloud services are an enabler for deep learning in four respects:
- Provide access to large-scale computing capacity on demand, making it possible to distribute model training across multiple machines.
- Provide access to special hardware configurations, including GPUs, FPGAs, and massively parallel high performance computing (HPC) systems.
- Do not require an upfront investment — you can get advanced hardware, or large quantities of hardware, without having to purchase it. Pay only for the time you use.
- Assist with management of deep learning workflows — cloud services provide advanced features for managing datasets and algorithms, training models and deploying them efficiently to production.
What Are the Most Popular Deep Learning Services in the Cloud?
Let’s briefly review the deep learning offerings of major cloud providers — Amazon, Google Cloud, and Microsoft Azure.
IaaS vs. PaaS
In each of these clouds, it is possible to run deep learning workloads in a “do it yourself” model. This involves selecting machine images that come pre-installed with deep learning infrastructure, and running them in an IaaS model, for example as Amazon EC2 instances or Google Compute Engine VMs.
All the cloud providers we review below offer compute instances suitable for deep learning models, which provide specialized hardware such as graphical processing units (GPU), field-programmable gate arrays (FPGA) and TensorFlow Processing Units (TPU). To learn about the compute options offered by each cloud provider, refer to our articles about:
- Google TPU
- AWS GPU
- Azure GPU
Below, we focus on the platform as a service (PaaS) offering each cloud provides for deep learning users. These PaaS offerings provide the hardware needed for deep learning workloads, as well as software services for managing deep learning pipelines, from data ingestion to production deployment and real-world inference.
Deep Learning on AWS with SageMaker
Amazon Web Services provides the SageMaker service, which lets you build and manage machine learning models on the cloud, with a focus on deep learning.
- SageMaker services include:
- Ground Truth — lets you create and manage training data sets
- Studio — cloud-based development environment for machine learning models
- Autopilot — builds and trains models automatically
- Tuning — helps tune hyperparameters for a model
- Supports Jupyter notebooks — allowing users to share and collaborate on their own models and code.
- AWS Marketplace — provides pre-built algorithms and models created by third parties, which can be purchased on a pay-per-use basis.
- Framework support — supports all popular deep learning frameworks including TensorFlow, PyTorch, MXNet, Keras, Gluon, Scikit-learn, Horovod, and Deep Graph Library.
Google Cloud Machine Learning Services
Google’s set of machine learning services, together called Cloud AI, includes general purpose and dedicated services for specific use cases:
- Cloud AutoML suite — lets you build, train, and deploy models to production using cloud infrastructure
- AI Hub — provides a repository of components and algorithms that can be used to build models. Unlike the AWS model, AI Hub is focused on free knowledge sharing, not on commercial offerings of AI components.
- Data labeling service — lets you prepare and identify data for machine learning models.
- Visual AI and Video AI — these are two purpose-built services that provide preconfigured deep learning pipelines for processing image and video data.
Microsoft Azure Machine Learning
Azure Machine Learning is a complete environment for training, deploying, and managing machine learning models.
Key features of Azure Machine Learning:
- Drag-and-drop model designer — used to build machine learning models with no code. The designer supports several neural network architectures, including two-class classification, multi-class classification, neural network regression, DenseNet and ResNet.
- MLOps — supports a DevOps-style method for building and managing machine learning pipelines and workflows.
- Security and governance — integrated into the service, letting you verify compliance of machine learning processes, and perform identity and privacy management according to your organization’s governance policies.
- Frameworks support — supports PyTorch, TensorFlow, Keras, MXNet, scikit-learn, and Chainer.
How Should You Choose a Cloud Deep Learning Platform?
Here are a few key considerations when selecting your cloud-based deep learning service.
Data Preparation
Data preparation can be one of the heaviest and most sensitive parts of a deep learning project. There are two common ways to prepare large volumes of data for analytics, which are also used to create deep learning datasets from raw data:
- Export, transform, load (ETL) — transforms data as it is pulled from the source and creates a ready-made dataset that can be used for analytics purposes.
- Export, load, transform (ELT) — provides greater flexibility, lets you store raw data in a data lake and then transform it into the required format on demand.
Check which data services are provided by your cloud vendor and whether they support ETL, ELT, or both. Understand which data storage, database or data warehouse services you will use, and how they can make data preparation easier.
Scale-Up and Scale-Out Training
Data scientists typically start by developing a model on a local notebook, but it is not feasible to train most deep learning models on a local workstation. A key capability of a cloud deep learning service is the ability to integrate with notebooks and push training jobs seamlessly to cloud-based compute instances.
Evaluate the process and how easy it is to run training jobs on hardware like GPUs, TPUs, and FPGAs, manage these jobs across data science teams, visualize and interpret their results.
Deep Learning Frameworks Support
Each cloud machine learning service supports different frameworks. You can typically get the broadest framework support in an IaaS model, when deploying deep learning directly on compute instances. However, if you use a full ML Ops platform, you will be limited to the frameworks it supports.
Look for support of the following frameworks, which your data science team may need to use now or in the future:
- Deep learning frameworks — TensorFlow, PyTorch, Keras, MXNet, Deep Java Library
- Classical machine learning — Scikit-learn, R, Spark MLlib, H2O.ai, Java-ML
- Job scheduling and distribution — Horovod, Kubernetes, Slurm, LSF
Also evaluate the ability to integrate your own code and algorithms with the platform’s library of built-in algorithms. This can improve productivity, because you can draw on existing building blocks and only develop unique aspects of your model.
Pre-Tuned AI Services
Most cloud platforms provide pre-trained, pre-optimized AI services for many applications including:
- Image classification
- Object recognition
- Video data extraction
- Language translation
- Speech synthesis
- Recommendation engines
The advantage of these types of services is that they have been trained on massive data volumes that are not available to individual companies. They can provide very high accuracy for general use cases, and provide excellent performance and low latency in production. Best of all, they are ready to use out of the box.
Monitor Prediction Performance
Deploying a model is only the start, not the end point, of your AI journey. Data changes and user requirements change, and it is essential to monitor a model’s performance over time, tune it, augment it, and if necessary, replace it. Evaluate the tools a cloud service provides for monitoring model performance when it is already in production, and how easy it is to release updates and improvements to live deep learning models.
Deep Learning in the Cloud with MLOps Innovation
Some innovative MLOps solutions automate resource management and orchestration for machine learning infrastructure. You can automatically run as many compute intensive experiments as needed.
For example, AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
- Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
- Distributed training on multiple GPU nodes to accelerate model training times,
- Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
- Visibility into workloads and resource utilization to improve user productivity.
While this solution is not meant for cost reduction (as it won’t necessarily result in decreasing the number of on-prem GPUs) it helps utilize the full capacity of existing GPUs. This simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.
Contact GlobalDots today to get the most out of your resources with today’s latest MLOps solutions.
Originally published by our friends at Run:AI