There are a lot of guides on how to install Nvidia drivers in the wild, but still, when I needed to set-up a system where I can easily switch between driver versions for our research, I spent several days trying to figure out the most seamless way to achieve that.
Here I want to summarize multiple days of my googling, trials, errors, and the final solution I came up with which appeared to be way simpler than I thought.
I am using Ubuntu distro as a reference, the process should be the same for any version above 16.04. For other distros, the main part of the installation process should be similar with some differences in getting the required system packages and handling already installed graphics drivers.
Below is a summary of my findings that helped me better understand the issue and why certain combinations of Nvidia low level libraries work and others do not.
Each driver version was developed against a certain version of Linux Kernel (which, I believe was mainstream at the time of development). This means that you have almost no chances to install old drivers on Linux distros with more modern kernel and it doesn’t matter which way of installation you choose, most probably such drivers just do not exist (e.g. I was not able to install drivers older than version 440 on Ubuntu 18.04 with kernel 4.15 or higher).
However, due to the backward compatibility of Linux kernels, you can install newer drivers on older kernels (e.g. version 450 or even higher on Ubuntu 16.04)
If you plan to install CUDA libraries on your host as well you do not need to install the driver separately as it goes together with the CUDA package due to certain compatibility requirements. CUDA installation deserves a separate article but can be done similarly.
If you plan to run CUDA dependent code inside a container you do not need to install the CUDA toolkit at all. The only requirement is that the driver package contains the library required to run a CUDA-enabled container (later I will show how to check that).
Download installer from Nvidia site.
According to my observations, there is no difference in what product or product type you are choosing, you will get the same installation file. The only difference is if you choose “Data Center / Tesla” instead of e.g. “GeForce” you will get a drop-down for the CUDA Toolkit version. Don’t be confused you will not get CUDA Toolkit together with the driver. Varying CUDA versions you will get a minimal driver version compatible with chosen CUDA Toolkit version. (e.g. if you pick 10.0 as the CUDA version you will get driver v.410, for CUDA v.11.0 you will get driver v.450).
Be aware that your GPU generation should be compatible with chosen CUDA Toolking version. The older you GPU the more options you have in terms of driver/CUDA versions.
After setting up all parameters you will download a self-installation archive. This archive contains pre-compiled libraries and source files necessary to build kernel modules. To build the driver you will need the following packages:
- gcc (C compiler)
- make (utility used to automate compilation process)
- linux-headers (kernel source files necessary to build kernel modules)
One line installation:
$ sudo apt install gcc make linux-headers-$(uname -r)
“$(uname -r)” will add the kernel version of your system to the package name to grab exactly what is required in your case.
The next steps will depend on your particular set-up.
Headless Server without pre-installed drivers
This is the simplest case, you have a VM server with no windows manager (X server) booted from a pre-built image with no drivers (like AWS AMI without pre-installed drivers) and no X server running. Then you can simply run:
$ sudo ./NVIDIA-Linux-x86_64–450.80.02.run --silent
“--silent” option will turn off the interactive UI and will use default answers to all questions that are usually asked. No output except error messages will be printed.
After that you are good to go, no reboot required. You can verify that installation went well by listing loaded kernel modules having “nvidia” in its name:
$ lsmod | grep nvidia
you should get 3 modules: nvidia, nvidia_drm, and nvidia_modeset. Or you can run:
which will show the current driver version, and list of all GPUs present in the system.
Note the CUDA version in the output. It is the highest version of the CUDA toolkit compatible with this driver, it does not mean that you have ANY version of the CUDA toolkit installed in your system. If you do not see this in the nvidia-smi output (I saw this several times) it means that the driver was installed without a library responsible for processing requests from CUDA libraries and your CUDA enabled containers will not work. In this case, you need to reinstall the driver (or even try a different version of it).
If you want to investigate the content of the package you can unpack it without installation by running:
$ ./NVIDIA-Linux-x86_64–450.80.02.run --extract-only
$ ./NVIDIA-Linux-x86_64–450.80.02.run --target [NewDir]
You can scrutinize other available installation options by running:
$ ./NVIDIA-Linux-x86_64–450.80.02.run --advanced-options
Desktop or Server running windows manager
For bare-metal or VM installation undertaken with GPGPU present during the installation, you need to disable default drivers first. To do this you need to create file /etc/modprobe.d/blacklist-nouveau.conf (you can use any name you like) and put the following lines into it:
options nouveau modeset=0
Usually, after that, you need to rebuild initramfs which is used to boot the system (this step might not be required for Ubuntu 20.04, but if you still see nouveau in the output from lsmod you need to do this step):
$ sudo update-initramfs -u
After reboot, you can do the steps described for the headless server above to install the driver.
Uninstalling Nvidia driver
If you already have an Nvidia driver installed on your system and want to install a different version of it (or simply get rid of it) you can uninstall the driver by running
$ ./NVIDIA-Linux-x86_64–450.80.02.run --uninstall
If you want to switch back to nouveau driver don’t forget to delete /etc/modprobe.d/blacklist-nouveau.conf (and rebuild initramfs if necessary) after deleting the Nvidia driver.
Sometimes, installing kernel updates will break compatibility with the installed driver. In this case, just uninstall and reinstall it again against the updated kernel (the process is the same).
I hope this sheds some light on the driver installation process and it will never trouble you again. Let me know if you would like to see a similar article on installing the CUDA toolkit or on how nvidia-docker enables containers with CUDA toolkit to run on the host with an incompatible driver.