Choosing the right cloud instance for training Deep Learning models. Part 1
We at RocketCompute regularly optimize cloud HPC configurations for different workloads (e.g. this case for weather prediction). It is quite common for us to see the price or performance variation as wide as 4x-5x for different HPC cluster configurations. This variation usually heavily depends on network latency within the cluster and unique requirements of a workload (algorithm, software stack, RAM and storage requirements, etc.)
When we started looking at deep learning workloads we discovered that our colleagues most of the time use one of the following rules of thumb while picking cloud configurations for training DL models:
- The cheapest instance in case of the tight budget
- Instance with the most modern GPU available in case training time should be as low as possible
We decided not to take it for granted and do our common routine that we usually do when we are looking for optimal configuration for computationally intensive workloads. (Spoiler: it appears that DL training jobs show the same 4x variation in terms of price/performance on different cloud configurations and variation is task-specific)
If you prefer fewer words and more numbers you can check out a formal white paper that our team put together on the topic here.
The standard approach that the RocketCompute team employs for these tasks is to put together a benchmark that approximates the target workload and can be quickly run on different configurations. The resulting set of benchmarks duration/cost shows the most efficient and fastest set-ups while key hardware performance metrics (e.g. CPU/GPU/RAM utilization, etc.) can shed some light on why certain instances are better than the others.
For benchmarks, we started with reference implementations of the MLPerf training benchmarks set (you can find detailed info here). Out of 8 different neural network implementations available, we picked up 1 model for the most widely used domains. The resulting set is:
- For image processing — object detection Mask R-CNN model trained on COCO dataset
- For natural language processing — BERT model trained on Wikipedia dump
- For recommendations — DLRM model trained on 1Tb Kaggle AdDisplay Challenge dataset
We then fixed the number of training steps to reduce benchmark time to 15–45 minutes on the instance with one V100 GPU. For the BERT model, we also reduced the dimensionality of the neural net from BERT Large to BERT Base to fit the model into GPUs with less than 8Gb of available video memory.
The experiment was done using all AWS single GPU instances.
Below table shows (in relative terms) how long each benchmark run and how much each run costs. All numbers for a particular network architecture are normalized to the fastest (or cheapest) instance for this benchmark. For example, for BERT the cheapest run was on g4dn.xlarge instance while it costs four times more money to run the same benchmark on g4dn.8xlarge. Similarly, the fastest run for BERT was on p3.2xlarge instance while it took p2.xlarge 4.6x more time to finish the same job.
Now let’s look carefully at these numbers:
- The cheapest instance for BERT and Mask R-CNN appears the most expensive for DLRM (g4dn.xlarge with T4)
- The most expensive instance for Mask R-CNN appears the cheapest for DLRM (p2.xlarge with a half (!) of a dual GPU K80, the oldest GPU available on AWS)
- It appears that for DLRM you can choose a very well-balanced instance both in terms of price and training time (g4dn.4xlarge) and it will be very close to fastest (only 4% longer) and cheapest (only 2% more expensive). And the same instance would be around 2.5x slower and pricier for other architectures
- Generally, the instance with the best training time never gave the best training price and vice versa
Looks odd, isn’t it? Let us drill down into hardware metrics to understand what makes the DLRM benchmark so special.
Let us start with memory usage. Below are the graphs showing how much RAM is utilized during the whole model training process for three instances with the same hardware but different RAM and vCPU counts.
Please ignore the yellow area for now, it shows how much RAM Linux kernel is using to cash IO operations. Focus on the green area, this is how much memory is used by the model. As you can see DLRM benchmark requires 42Gb of RAM minimum to put all required data into memory. According to the original DLRM paper (link) all this space is used to convert categorical features into vector representations to feed them into the neural net. In case when memory is not enough the kernel has to dump part of the data on disk (using memory swap). Given that disk is at least by an order of magnitude slower than RAM, no wonder low RAM instances perform so badly.
Now let us look at GPU utilization charts for DLRM benchmark across the whole process of model training.
You can clearly see that the more powerful GPU we use the less it is being utilized (see max gpu_utilisation at the bottom of each chart). This means that the system (or potentially the training algorithm) is not capable of providing GPU with enough work to keep it busy and it becomes even worse the beefier GPU we use.
Another interesting thing here is the chart for the M60. This is a perfect illustration of how lack of RAM looks like on GPU utilization charts. It can be illustrated even better if we compare GPU utilization charts for g4dn.2xlarge where SWAP space is actively used for DLRM benchmark with g4dn.4xlarge where all the data fits into memory.
Systems have exactly the same GPUs, same CPU family (though different number of vCPUs, but this is discussed later), most probably sliced out of the same physical server. Lack of 10Gb of RAM contributed additional 40 minutes to the benchmark and reduced max GPU utilization.
Finally, let us check CPU utilization charts.
You can see that only one core is used during the training (25% out of 4 vCPUs and 13% out of 8 vCPUs), i.e. all data pre-processing before sending it to GPU is done by a single core. Looks like we found another bottleneck.
Now we have a clear picture of what had happened with DLRM benchmark and why it performed so differently from the other two:
- Significant RAM requirements imposed a harsh handicap for instances with available RAM of less than required. If you look at instances with T4 GPU, the performance is higher for the higher amount of RAM (for the same GPU in the system)
- Single-threaded training script implementation caused an additional bottleneck in data throughput making the number of vCPU in the system irrelevant and giving performance advantage to instances with CPUs showing the best single-core performance (in our case it is g4dn instances with more modern Xeon CPU than others)
As you can see, even the simplest experiment can show that choosing an instance for training a Deep Learning model is not an obvious task even when you vary only a handful of instance features (like the number of vCPU, RAM, GPU architecture). The best instance will vary depending on the model architecture, code implementation, and amount of data you need to process.
We just barely scratched the surface of different aspects of the training infrastructure and how it can drive performance. We haven’t mentioned software stack at all (especially effects from different drivers and CUDA libraries) as well as distributed training on multi-GPU nodes and multi-node GPU clusters. Each of these topics deserves a separate post and, hopefully, will be covered in future parts of this series.