Choosing the right cloud instance for training Deep Learning models. Part 1


We at RocketCompute regularly optimize cloud HPC configurations for different workloads (e.g. this case for weather prediction). It is quite common for us to see the price or performance variation as wide as 4x-5x for different HPC cluster configurations. This variation usually heavily depends on network latency within the cluster and unique requirements of a workload (algorithm, software stack, RAM and storage requirements, etc.)

When we started looking at deep learning workloads we discovered that our colleagues most of the time use one of the following rules of thumb while picking cloud configurations for training DL models:

We decided not to take it for granted and do our common routine that we usually do when we are looking for optimal configuration for computationally intensive workloads. (Spoiler: it appears that DL training jobs show the same 4x variation in terms of price/performance on different cloud configurations and variation is task-specific)

If you prefer fewer words and more numbers you can check out a formal white paper that our team put together on the topic here.


The standard approach that the RocketCompute team employs for these tasks is to put together a benchmark that approximates the target workload and can be quickly run on different configurations. The resulting set of benchmarks duration/cost shows the most efficient and fastest set-ups while key hardware performance metrics (e.g. CPU/GPU/RAM utilization, etc.) can shed some light on why certain instances are better than the others.

For benchmarks, we started with reference implementations of the MLPerf training benchmarks set (you can find detailed info here). Out of 8 different neural network implementations available, we picked up 1 model for the most widely used domains. The resulting set is:

We then fixed the number of training steps to reduce benchmark time to 15–45 minutes on the instance with one V100 GPU. For the BERT model, we also reduced the dimensionality of the neural net from BERT Large to BERT Base to fit the model into GPUs with less than 8Gb of available video memory.

The experiment was done using all AWS single GPU instances.

Key Results

Below table shows (in relative terms) how long each benchmark run and how much each run costs. All numbers for a particular network architecture are normalized to the fastest (or cheapest) instance for this benchmark. For example, for BERT the cheapest run was on g4dn.xlarge instance while it costs four times more money to run the same benchmark on g4dn.8xlarge. Similarly, the fastest run for BERT was on p3.2xlarge instance while it took p2.xlarge 4.6x more time to finish the same job.

(*) estimated duration. Actual test was stopped after 300 mins (all other runs took 15–120 mins)

Now let’s look carefully at these numbers:

Looks odd, isn’t it? Let us drill down into hardware metrics to understand what makes the DLRM benchmark so special.

Let us start with memory usage. Below are the graphs showing how much RAM is utilized during the whole model training process for three instances with the same hardware but different RAM and vCPU counts.

g4dn.2xlarge (Nvidia T4, 32 Gb of RAM, 8 vCPUs)
g4dn.4xlarge (Nvidia T4, 64 Gb of RAM, 16 vCPUs)
g4dn.8xlarge (Nvidia T4, 128 Gb of RAM, 32 vCPUs)

Please ignore the yellow area for now, it shows how much RAM Linux kernel is using to cash IO operations. Focus on the green area, this is how much memory is used by the model. As you can see DLRM benchmark requires 42Gb of RAM minimum to put all required data into memory. According to the original DLRM paper (link) all this space is used to convert categorical features into vector representations to feed them into the neural net. In case when memory is not enough the kernel has to dump part of the data on disk (using memory swap). Given that disk is at least by an order of magnitude slower than RAM, no wonder low RAM instances perform so badly.

Now let us look at GPU utilization charts for DLRM benchmark across the whole process of model training.

You can clearly see that the more powerful GPU we use the less it is being utilized (see max gpu_utilisation at the bottom of each chart). This means that the system (or potentially the training algorithm) is not capable of providing GPU with enough work to keep it busy and it becomes even worse the beefier GPU we use.

Another interesting thing here is the chart for the M60. This is a perfect illustration of how lack of RAM looks like on GPU utilization charts. It can be illustrated even better if we compare GPU utilization charts for g4dn.2xlarge where SWAP space is actively used for DLRM benchmark with g4dn.4xlarge where all the data fits into memory.

g4dn.2xlarge (32Gb of RAM, not enough RAM, swap is actively used)
g4dn.4xlarge (64Gb of RAM, swap is not used)

Systems have exactly the same GPUs, same CPU family (though different number of vCPUs, but this is discussed later), most probably sliced out of the same physical server. Lack of 10Gb of RAM contributed additional 40 minutes to the benchmark and reduced max GPU utilization.

Finally, let us check CPU utilization charts.

p2.xlarge (Nvidia K80, 4 vCPU, 61Gb of RAM)
p3.2xlarge (Nvidia V100, 8 vCPU, 61Gb of RAM)

You can see that only one core is used during the training (25% out of 4 vCPUs and 13% out of 8 vCPUs), i.e. all data pre-processing before sending it to GPU is done by a single core. Looks like we found another bottleneck.

Now we have a clear picture of what had happened with DLRM benchmark and why it performed so differently from the other two:


As you can see, even the simplest experiment can show that choosing an instance for training a Deep Learning model is not an obvious task even when you vary only a handful of instance features (like the number of vCPU, RAM, GPU architecture). The best instance will vary depending on the model architecture, code implementation, and amount of data you need to process.

We just barely scratched the surface of different aspects of the training infrastructure and how it can drive performance. We haven’t mentioned software stack at all (especially effects from different drivers and CUDA libraries) as well as distributed training on multi-GPU nodes and multi-node GPU clusters. Each of these topics deserves a separate post and, hopefully, will be covered in future parts of this series.

Egor Bykov