Choosing the right cloud instance for training Deep Learning models. Part 1

Intro

We at RocketCompute regularly optimize cloud HPC configurations for different workloads (e.g. this case for weather prediction). It is quite common for us to see the price or performance variation as wide as 4x-5x for different HPC cluster configurations. This variation usually heavily depends on network latency within the cluster and unique requirements of a workload (algorithm, software stack, RAM and storage requirements, etc.)

  • Instance with the most modern GPU available in case training time should be as low as possible

Approach

The standard approach that the RocketCompute team employs for these tasks is to put together a benchmark that approximates the target workload and can be quickly run on different configurations. The resulting set of benchmarks duration/cost shows the most efficient and fastest set-ups while key hardware performance metrics (e.g. CPU/GPU/RAM utilization, etc.) can shed some light on why certain instances are better than the others.

  • For natural language processing — BERT model trained on Wikipedia dump
  • For recommendations — DLRM model trained on 1Tb Kaggle AdDisplay Challenge dataset

Key Results

Below table shows (in relative terms) how long each benchmark run and how much each run costs. All numbers for a particular network architecture are normalized to the fastest (or cheapest) instance for this benchmark. For example, for BERT the cheapest run was on g4dn.xlarge instance while it costs four times more money to run the same benchmark on g4dn.8xlarge. Similarly, the fastest run for BERT was on p3.2xlarge instance while it took p2.xlarge 4.6x more time to finish the same job.

(*) estimated duration. Actual test was stopped after 300 mins (all other runs took 15–120 mins)
  1. The most expensive instance for Mask R-CNN appears the cheapest for DLRM (p2.xlarge with a half (!) of a dual GPU K80, the oldest GPU available on AWS)
  2. It appears that for DLRM you can choose a very well-balanced instance both in terms of price and training time (g4dn.4xlarge) and it will be very close to fastest (only 4% longer) and cheapest (only 2% more expensive). And the same instance would be around 2.5x slower and pricier for other architectures
  3. Generally, the instance with the best training time never gave the best training price and vice versa
g4dn.2xlarge (Nvidia T4, 32 Gb of RAM, 8 vCPUs)
g4dn.4xlarge (Nvidia T4, 64 Gb of RAM, 16 vCPUs)
g4dn.8xlarge (Nvidia T4, 128 Gb of RAM, 32 vCPUs)
g4dn.2xlarge (32Gb of RAM, not enough RAM, swap is actively used)
g4dn.4xlarge (64Gb of RAM, swap is not used)
p2.xlarge (Nvidia K80, 4 vCPU, 61Gb of RAM)
p3.2xlarge (Nvidia V100, 8 vCPU, 61Gb of RAM)
  • Single-threaded training script implementation caused an additional bottleneck in data throughput making the number of vCPU in the system irrelevant and giving performance advantage to instances with CPUs showing the best single-core performance (in our case it is g4dn instances with more modern Xeon CPU than others)

Conclusion

As you can see, even the simplest experiment can show that choosing an instance for training a Deep Learning model is not an obvious task even when you vary only a handful of instance features (like the number of vCPU, RAM, GPU architecture). The best instance will vary depending on the model architecture, code implementation, and amount of data you need to process.