Cloud Deep Learning: AWS, Azure & GCP Compared

Miguel Fersen Director for Iberia and LATAM, GlobalDots

23rd February, 2022 6 Min read

How Can You Do Deep Learning in the Cloud?

Deep learning is at the center of most artificial intelligence initiatives. It is based on the concept of a deep neural network, which passes inputs through multiple layers of connections. Neural networks can perform many complex cognitive tasks, improving performance dramatically compared to classical machine learning algorithms. However, they often require huge data volumes to train, and can be very computationally intensive.

Cloud computing services are helping make deep learning more accessible, making it easier to manage large datasets and train algorithms on distributed hardware.

How One AI-Driven Media Platform Cut EBS Costs for AWS ASGs by 48%

Cloud services are an enabler for deep learning in four respects:

Provide access to large-scale computing capacity on demand, making it possible to distribute model training across multiple machines.
Provide access to special hardware configurations, including GPUs, FPGAs, and massively parallel high performance computing (HPC) systems.
Do not require an upfront investment — you can get advanced hardware, or large quantities of hardware, without having to purchase it. Pay only for the time you use.
Assist with management of deep learning workflows — cloud services provide advanced features for managing datasets and algorithms, training models and deploying them efficiently to production.

What Are the Most Popular Deep Learning Services in the Cloud?

Let’s briefly review the deep learning offerings of major cloud providers — Amazon, Google Cloud, and Microsoft Azure.

IaaS vs. PaaS

In each of these clouds, it is possible to run deep learning workloads in a “do it yourself” model. This involves selecting machine images that come pre-installed with deep learning infrastructure, and running them in an IaaS model, for example as Amazon EC2 instances or Google Compute Engine VMs.

All the cloud providers we review below offer compute instances suitable for deep learning models, which provide specialized hardware such as graphical processing units (GPU), field-programmable gate arrays (FPGA) and TensorFlow Processing Units (TPU). To learn about the compute options offered by each cloud provider, refer to our articles about:

Google TPU
AWS GPU
Azure GPU

Below, we focus on the platform as a service (PaaS) offering each cloud provides for deep learning users. These PaaS offerings provide the hardware needed for deep learning workloads, as well as software services for managing deep learning pipelines, from data ingestion to production deployment and real-world inference.

Deep Learning on AWS with SageMaker

Amazon Web Services provides the SageMaker service, which lets you build and manage machine learning models on the cloud, with a focus on deep learning.

SageMaker services include:
Ground Truth — lets you create and manage training data sets
Studio — cloud-based development environment for machine learning models
Autopilot — builds and trains models automatically
Tuning — helps tune hyperparameters for a model
Supports Jupyter notebooks — allowing users to share and collaborate on their own models and code.
AWS Marketplace — provides pre-built algorithms and models created by third parties, which can be purchased on a pay-per-use basis.
Framework support — supports all popular deep learning frameworks including TensorFlow, PyTorch, MXNet, Keras, Gluon, Scikit-learn, Horovod, and Deep Graph Library.

Google Cloud Machine Learning Services

Google’s set of machine learning services, together called Cloud AI, includes general purpose and dedicated services for specific use cases:

Cloud AutoML suite — lets you build, train, and deploy models to production using cloud infrastructure
AI Hub — provides a repository of components and algorithms that can be used to build models. Unlike the AWS model, AI Hub is focused on free knowledge sharing, not on commercial offerings of AI components.
Data labeling service — lets you prepare and identify data for machine learning models.
Visual AI and Video AI — these are two purpose-built services that provide preconfigured deep learning pipelines for processing image and video data.

Microsoft Azure Machine Learning

Azure Machine Learning is a complete environment for training, deploying, and managing machine learning models.

Key features of Azure Machine Learning:

Drag-and-drop model designer — used to build machine learning models with no code. The designer supports several neural network architectures, including two-class classification, multi-class classification, neural network regression, DenseNet and ResNet.
MLOps — supports a DevOps-style method for building and managing machine learning pipelines and workflows.
Security and governance — integrated into the service, letting you verify compliance of machine learning processes, and perform identity and privacy management according to your organization’s governance policies.
Frameworks support — supports PyTorch, TensorFlow, Keras, MXNet, scikit-learn, and Chainer.

How Should You Choose a Cloud Deep Learning Platform?

Here are a few key considerations when selecting your cloud-based deep learning service.

Data Preparation

Data preparation can be one of the heaviest and most sensitive parts of a deep learning project. There are two common ways to prepare large volumes of data for analytics, which are also used to create deep learning datasets from raw data:

Export, transform, load (ETL) — transforms data as it is pulled from the source and creates a ready-made dataset that can be used for analytics purposes.
Export, load, transform (ELT) — provides greater flexibility, lets you store raw data in a data lake and then transform it into the required format on demand.

Check which data services are provided by your cloud vendor and whether they support ETL, ELT, or both. Understand which data storage, database or data warehouse services you will use, and how they can make data preparation easier.

Scale-Up and Scale-Out Training

Data scientists typically start by developing a model on a local notebook, but it is not feasible to train most deep learning models on a local workstation. A key capability of a cloud deep learning service is the ability to integrate with notebooks and push training jobs seamlessly to cloud-based compute instances.

Evaluate the process and how easy it is to run training jobs on hardware like GPUs, TPUs, and FPGAs, manage these jobs across data science teams, visualize and interpret their results.

Deep Learning Frameworks Support

Each cloud machine learning service supports different frameworks. You can typically get the broadest framework support in an IaaS model, when deploying deep learning directly on compute instances. However, if you use a full ML Ops platform, you will be limited to the frameworks it supports.

Look for support of the following frameworks, which your data science team may need to use now or in the future:

Deep learning frameworks — TensorFlow, PyTorch, Keras, MXNet, Deep Java Library
Classical machine learning — Scikit-learn, R, Spark MLlib, H2O.ai, Java-ML
Job scheduling and distribution — Horovod, Kubernetes, Slurm, LSF

Also evaluate the ability to integrate your own code and algorithms with the platform’s library of built-in algorithms. This can improve productivity, because you can draw on existing building blocks and only develop unique aspects of your model.

Pre-Tuned AI Services

Most cloud platforms provide pre-trained, pre-optimized AI services for many applications including:

Image classification
Object recognition
Video data extraction
Language translation
Speech synthesis
Recommendation engines

The advantage of these types of services is that they have been trained on massive data volumes that are not available to individual companies. They can provide very high accuracy for general use cases, and provide excellent performance and low latency in production. Best of all, they are ready to use out of the box.

Monitor Prediction Performance

Deploying a model is only the start, not the end point, of your AI journey. Data changes and user requirements change, and it is essential to monitor a model’s performance over time, tune it, augment it, and if necessary, replace it. Evaluate the tools a cloud service provides for monitoring model performance when it is already in production, and how easy it is to release updates and improvements to live deep learning models.

Deep Learning in the Cloud with MLOps Innovation

Some innovative MLOps solutions automate resource management and orchestration for machine learning infrastructure. You can automatically run as many compute intensive experiments as needed.

For example, AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:

Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
Distributed training on multiple GPU nodes to accelerate model training times,
Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
Visibility into workloads and resource utilization to improve user productivity.

While this solution is not meant for cost reduction (as it won’t necessarily result in decreasing the number of on-prem GPUs) it helps utilize the full capacity of existing GPUs. This simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.

Contact GlobalDots today to get the most out of your resources with today’s latest MLOps solutions.

Originally published by our friends at Run:AI

Latest Articles

DevOps as a Service

How EX.CO Saved $15K Monthly with IaC Transformation

EX.CO is a video technology platform that enables publishers to monetize video content on websites.

Ganesh The Awesome

22nd February, 2024

DevOps as a Service

How Justt Saved $100K Yearly with IaC

Justt is a chargeback mitigation startup based in Tel Aviv. Chargebacks, as defined, are demands by a credit card provider for a retailer to reimburse losses on fraudulent or disputed transactions. Justt’s objective is to assist merchants worldwide in combating false chargebacks using its proprietary artificial intelligence technology.

Ganesh The Awesome

22nd February, 2024

DevOps as a Service

On-Demand Webinar: Securing Content on AWS with Okta

Not implementing OpenID Connect properly in AWS can lead to various consequences, including security breaches, unauthorized access to sensitive information, and compromised user data. However, these risks can be avoided when OpenID Connect is configured carefully, as it is designed to provide a secure and seamless way to authenticate users and control access to protected […]

Ganesh The Awesome

16th April, 2023

DevOps as a Service

On-Demand Webinar: Testing IaS – How to Solve the Common Challenges

Are you struggling to effectively test your Terraform infrastructure code? Even the best plans can fail, resulting in half-formed infrastructure that can have serious consequences for your business, such as lost revenue or damage to your reputation due to downtime, security vulnerabilities, operational inefficiencies, and difficulties scaling. In this webinar, we will introduce an innovative […]

Ganesh The Awesome

27th February, 2023

Back to Resources

How Can You Do Deep Learning in the Cloud?

What Are the Most Popular Deep Learning Services in the Cloud?

IaaS vs. PaaS

Deep Learning on AWS with SageMaker

Google Cloud Machine Learning Services

Microsoft Azure Machine Learning

How Should You Choose a Cloud Deep Learning Platform?

Data Preparation

Scale-Up and Scale-Out Training

Deep Learning Frameworks Support

Pre-Tuned AI Services

Monitor Prediction Performance

Deep Learning in the Cloud with MLOps Innovation

Unlock Your Cloud Potential