Choosing the right cloud instance for training Deep Learning models. Part 1


  • The cheapest instance in case of the tight budget
  • Instance with the most modern GPU available in case training time should be as low as possible


  • For image processing — object detection Mask R-CNN model trained on COCO dataset
  • For natural language processing — BERT model trained on Wikipedia dump
  • For recommendations — DLRM model trained on 1Tb Kaggle AdDisplay Challenge dataset

Key Results

(*) estimated duration. Actual test was stopped after 300 mins (all other runs took 15–120 mins)
  1. The cheapest instance for BERT and Mask R-CNN appears the most expensive for DLRM (g4dn.xlarge with T4)
  2. The most expensive instance for Mask R-CNN appears the cheapest for DLRM (p2.xlarge with a half (!) of a dual GPU K80, the oldest GPU available on AWS)
  3. It appears that for DLRM you can choose a very well-balanced instance both in terms of price and training time (g4dn.4xlarge) and it will be very close to fastest (only 4% longer) and cheapest (only 2% more expensive). And the same instance would be around 2.5x slower and pricier for other architectures
  4. Generally, the instance with the best training time never gave the best training price and vice versa
g4dn.2xlarge (Nvidia T4, 32 Gb of RAM, 8 vCPUs)
g4dn.4xlarge (Nvidia T4, 64 Gb of RAM, 16 vCPUs)
g4dn.8xlarge (Nvidia T4, 128 Gb of RAM, 32 vCPUs)
g4dn.2xlarge (32Gb of RAM, not enough RAM, swap is actively used)
g4dn.4xlarge (64Gb of RAM, swap is not used)
p2.xlarge (Nvidia K80, 4 vCPU, 61Gb of RAM)
p3.2xlarge (Nvidia V100, 8 vCPU, 61Gb of RAM)
  • Significant RAM requirements imposed a harsh handicap for instances with available RAM of less than required. If you look at instances with T4 GPU, the performance is higher for the higher amount of RAM (for the same GPU in the system)
  • Single-threaded training script implementation caused an additional bottleneck in data throughput making the number of vCPU in the system irrelevant and giving performance advantage to instances with CPUs showing the best single-core performance (in our case it is g4dn instances with more modern Xeon CPU than others)




