Optimal NVIDIA Stack for training Mask R-CNN

RocketCompute Dev Team
4 min readFeb 26, 2021



Containerization technology is greatly simplified GPU computations for machine learning by wrapping software stack above Kernel level in containers and allowing to juggle with different combinations of frameworks, low level libraries, and hardware drivers. Technologies like nvidia-docker even unlocked new stack combinations (driver plus CUDA toolkit) that were not feasible before. This, however, adds another dimension to the performance optimization problem, and now you not only need to choose optimal hardware for your machine learning task but also variate driver/CUDA combination to fine-tune the performance further.

The goal of this short article is to illustrate that such optimization is not only feasible but also makes sense.


The exercise is very similar to what has been done in my first article:

Workload: Mask R-CNN model trained on COCO dataset. I used reference implementation from MLCommons set. It is built on PyTorch 1.0.1 which supports both CUDA v.9.0 and v.10.0 and you can switch between these two versions without changing any line of code. I tweaked the number of training steps to reduce benchmark time to 20–90 minutes depending on GPU.

Infrastructure: I took one cheapest single-GPU instance for each GPU family available on AWS. There were four of them: p2.xlarge, g3s.xlarge, p3.2xlarge and g4dn.xlarge.

Stack: For installing different driver versions I used the approach that I described in the previous article. Framework together with the corresponding CUDA toolkit was packaged in a docker image together with the benchmark code.

For each set of {GPU instance, driver, CUDA} I did at least three runs.


Below figures shows how benchmark runtime in seconds (Y axis) depends on device driver version (X axis) for each GPU. Curves correspond to a certain CUDA toolkit version (red for v9.0 and blue for v10.0). Each dot on a graph equals to average runtime and the vertical interval shows the standard
error around this mean value. Each point aggregates at least 3 independent benchmark runs.

Figure 1. NVIDIA K80 with 11Gb VRAM, 4vCPU, 61Gb RAM
Figure 2. NVIDIA M60 with 7Gb VRAM, 4vCPU, 31Gb RAM

Instances with K80 (Figure 1) and M60 (Figure 2) show the smallest variation from the driver version (8% and 4% correspondingly). For M60 overall variation of the performance comparable to the variation for a particular driver which does not allow to make any conclusion regarding the best performing combination. For the K80, however, CUDA v.10.0 and driver v.418 on average better than any other combination.

Figure 3. NVIDIA V100 with 16Gb VRAM, 8x vCPU, 61Gb RAM

Instances with V100 revealed the most interesting results (Figure 3). This GPU shows the highest variation of the benchmark score (15%) which is significantly higher than the variation of most individual software stacks. On average there is a clear optimum (CUDA v.10.0 and driver v.418) and a clear worst performing stack (CUDA v.9.0 and driver v.384). The difference between these two combinations can exceed 15% (the highest gap we saw between individual data points was c.20%)

Figure 4. NVIDIA T4 with 15Gb VRAM, 4x vCPU, 16Gb RAM

For T4 (Figure 4), which is the most modern GPU in the set, it is hard to say which combination is the best, despite significant variation of performance (14%). However, there is a clear loser (CUDA v.9.0 and driver v.410). It is
safe to say that recent drivers show comparable performance. This graph missing the result for driver v.384 as this version of the driver is not compatible with this GPU family.


As you can see even a quick straightforward check shows that the right driver/CUDA combination can save up to 15–20% of time/cost and add another layer of optimization after choosing the optimal GPU/instance for the job.

It potentially can cause an even bigger impact for inference workloads by reducing both the amount of compute required and latency of each request.

This exercise was just a quick check of the idea and more sorrow research will follow covering the latest versions of CUDA, drivers, and neural net architectures.

Egor Bykov