Choosing the right cloud instance for training Deep Learning models. Part 1

Intro

Approach

Key Results

(*) estimated duration. Actual test was stopped after 300 mins (all other runs took 15–120 mins)
g4dn.2xlarge (Nvidia T4, 32 Gb of RAM, 8 vCPUs)
g4dn.4xlarge (Nvidia T4, 64 Gb of RAM, 16 vCPUs)
g4dn.8xlarge (Nvidia T4, 128 Gb of RAM, 32 vCPUs)
g4dn.2xlarge (32Gb of RAM, not enough RAM, swap is actively used)
g4dn.4xlarge (64Gb of RAM, swap is not used)
p2.xlarge (Nvidia K80, 4 vCPU, 61Gb of RAM)
p3.2xlarge (Nvidia V100, 8 vCPU, 61Gb of RAM)

Conclusion