MJUN Tech Note

Building a Machine Learning Server: Installing NVIDIA Driver, CUDA, and Docker on Ubuntu

Hello. This time I’ll write about the steps to install NVIDIA CUDA and build an environment for running PyTorch, TensorFlow, etc. using NVIDIA GPUs.

While there are many such articles out there, this article differs from others in that it assumes use as a server shared in laboratories or companies.

I’ve been relying on the following “What’s the current status of NVIDIA Docker? (20.09 version)” written by Mr. Sasaki from NVIDIA, but since the situation has changed in recent years, I’ll summarize it here.

This article aims to build an environment with the following:

  • Ubuntu 22.04 LTS Server (Desktop is also fine)
  • Docker
  • NVIDIA Driver, CUDA, NVIDIA Docker

Ubuntu Installation and Configuration

First, install Ubuntu 22.04 LTS Server on the server. At the time of writing, 24.04 LTS will be released in about a month, but 22.04 should be fine until various libraries support it.

Please install Ubuntu using your preferred method. I usually install with the following settings:

  • Select server minimize
  • Don’t install third-party libraries
  • partition: select use as boot, allocate 1GB for boot area, mount the rest to / (root)
  • Install OpenSSH-Server

Even when bundling multiple NICs for bonding, it’s easier to modify netplan after installation.

After installing Ubuntu, log into the console and configure settings for remote work.

Package Installation

First, install your favorite libraries:

sudo apt update
sudo apt upgrade

sudo apt install -y \
    avahi-daemon git vim emacs build-essential \
    wget curl jq ffmpeg htop tmux screen parallel \
    imagemagick geeqie iputils-ping net-tools zsh

# Install packages for Python building
sudo apt install -y \
    build-essential libssl-dev zlib1g-dev \
    libbz2-dev libreadline-dev libsqlite3-dev curl \
    libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev \
    libffi-dev liblzma-dev

Brief explanation of the installed packages:

  • git, wget, curl Essential
  • avahi-daemon mDNS library; with this installed, you can access from Windows, macOS, Linux via ssh server.local etc. from local LAN without setting up DNS
  • emacs, vim Installing both avoids editor wars
  • build-essential Contains various libraries for building
  • jq JSON parser. ML dataset information is sometimes provided in JSON format
  • ffmpeg Process videos from command line
  • imagemagick Handle images from command line
  • tmux, screen Maintain shells after logout, useful for running long training scripts. nohup works too
  • parallel Server machines usually have many cores, so use parallel processing for efficiency. xargs works too
  • htop View system usage. top works but htop is more readable
  • geeqie Image viewer. Light enough to use over X window
  • iputils-ping, net-tools Use ping to check other servers and internet connectivity
  • zsh Install your favorite shell

The second apt install is for packages to build Python. Since some server users might use pyenv, let’s install these.

Network Configuration

Next, configure the network. Ubuntu Server uses netplan for network configuration, so configure yaml files under /etc/netplan. For Ubuntu Desktop, use NetworkManager. To fix IP addresses, configure as follows:

Note that from Ubuntu 22.04, gateway4 notation changed to routes.

# This is the network config written by 'subiquity'
network:
  ethernets:
    enp42s0:
      dhcp4: false
      addresses:
        - 192.168.0.2/24
      nameservers:
        addresses: [192.168.0.1]
      # The following is deprecated
      # gateway4: 192.168.0.1
      # Write with routes
      routes:
        - to: default
          via: 192.168.0.1
  version: 2

After configuration, enable settings with sudo netplan apply.

SSH Configuration

Next, configure SSH-Server. This will vary by environment, so please configure as needed. For local LAN placement, setting PermitRootLogin and PasswordAuthentication no should suffice.

Once this is done, you can work remotely, so put the server in a rack or wherever. If configuring NFS or LDAP, do it here or after racking.

Docker Installation

Next, install Docker. Recently Docker Desktop is available for Linux too, but Engine should suffice for Linux.

The official site below has “Install using the convenience script” at the bottom, allowing easy installation by downloading the installation script.

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

For shared servers, using rootless Docker is also an option. Note that commands for introducing NVIDIA Docker differ slightly in that case.

Also, add users to the docker group so they can use docker without sudo:

sudo gpasswd -a [username] docker

Installing NVIDIA Driver, CUDA, Docker

Next, install drivers and CUDA for handling NVIDIA GPUs, essential for machine learning.

Registering apt Repository

From the following site, select your environment and install cuda-keyring:

For Ubuntu 22.04, select:

  • Linux
  • x86_64 (arm64-sbsa for ARM)
  • Ubuntu
  • 22.04
  • dev(network)

Then execute the displayed script up to sudo apt-get update:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -u cuda-keyring_1.1-1_all.deb
sudo apt-get update

Installing NVIDIA Driver, CUDA

Next, install CUDA, but first check this table:

Looking at the above table, sudo apt install cuda would install both driver and CUDA, but when wanting to upgrade only one of driver or CUDA, dependencies would require upgrading both. Therefore, here we install cuda-toolkit and cuda-driver to separate dependencies:

sudo apt install -y cuda-toolkit
sudo apt install -y cuda-driver

For CUDA versions, since users basically use Docker, the host machine can just have the latest version, so promote Docker usage.

Installing cuDNN

cuDNN stands for NVIDIA CUDA Deep Neural Network library, and when enabled, you can use GPU-optimized implementations for some DNN processing.

As usual, from the following documentation, we see installing cudnn suffices: (I think it used to be libcudnn8 etc.)

sudo apt install cudnn

NVIDIA Docker

NVIDIA Docker is now part of the NVIDIA Container Toolkit system. So to handle NVIDIA GPUs from Docker containers, install nvidia-container-toolkit:

sudo apt-get install -y nvidia-container-toolkit

Next, configure Docker. Note that commands differ for rootless Docker (refer to Docker official documentation):

sudo nvidia-ctk runtime configure --runtime=docker

Next, modify Docker config. As mentioned in the following Qiita article and GitHub Issues, there’s a phenomenon where “GPUs that were usable in running Docker containers become unusable after a while.” This seems to be because Docker’s cgroup management uses systemd, causing GPU recognition issues when systemctl daemon-reload is executed.

When Docker is installed by default, docker info shows:

docker info

# (omitted)
 Cgroup Driver: systemd
 Cgroup Version: 2

The file to edit is /etc/docker/daemon.json:

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

The line to add here is "exec-opts": ["native.cgroupdriver=cgroupfs"]. After writing, restart Docker and confirm the change:

sudo systemctl restart docker
docker info

# (omitted)
 Cgroup Driver: cgroupfs
 Cgroup Version: 2

If it shows cgroupfs, you’re good.

Reboot & Verification

Reboot the server to enable NVIDIA Driver:

sudo reboot now

After reboot, add PATH to .bashrc:

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export CUDA_HOME="/usr/local/cuda"
source ~/.bashrc

Then verify GPU recognition on both host machine and inside containers:

  • Host Machine
nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:03:00.0 Off |                  Off |
|  0%   46C    P8              34W / 450W |    112MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:81:00.0 Off |                  Off |
|  0%   50C    P8              28W / 450W |     15MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

nvidia-smi -L

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-0b...)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-0e...)
  • Container
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:12.3.2-base-ubuntu22.04' locally
12.3.2-base-ubuntu22.04: Pulling from nvidia/cuda
01007420e9b0: Pull complete
bfc08b17629d: Pull complete
86fc789646b5: Pull complete
6b62141c2a21: Pull complete
e0e30e504698: Pull complete
Digest: sha256:8cecfe099315f73127d6d5cc43fce32c7ffff4ea0460eefac48f2b7d811ce857
Status: Downloaded newer image for nvidia/cuda:12.3.2-base-ubuntu22.04
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:03:00.0 Off |                  Off |
|  0%   46C    P8              33W / 450W |    112MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:81:00.0 Off |                  Off |
|  0%   50C    P8              28W / 450W |     15MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

If recognized as above, you’re good.

Next, verify CUDA version. If displayed as follows, you’re OK:

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

Stopping Automatic Updates for NVIDIA Driver and CUDA

Ubuntu’s package manager apt has automatic package update mechanisms, but when NVIDIA Driver auto-updates, it causes mismatches between loaded drivers and actual versions, triggering Failed to initialize NVML: Driver/library version mismatch errors when running nvidia-smi, making status invisible and preventing new job submissions. So for long-term operation, I recommend stopping automatic updates for NVIDIA libraries.

To stop automatic updates, edit /etc/apt/apt.conf.d/50unattended-upgrades. Add .*nvidia, .*libnvidia, .*cuda to Unattended-Upgrade::Package-Blacklist at the top:

// Python regular expressions, matching packages to exclude from upgrading
Unattended-Upgrade::Package-Blacklist {
    // The following matches all packages starting with linux-
//  "linux-";

    // Use $ to explicitly define the end of a package name. Without
    // the $, "libc6" would match all of them.
//  "libc6$";
//  "libc6-dev$";
//  "libc6-i686$";

    // Special characters need escaping
//  "libstdc\+\+6$";

    // The following matches packages like xen-system-amd64, xen-utils-4.1,
    // xenstore-utils and libxenstore3.0
//  "(lib)?xen(store)?";

    // For more information about Python regular expressions, see
    // https://docs.python.org/3/howto/regex.html
    ".*nvidia";
    ".*libnvidia";
    ".*cuda";
};

This stops automatic updates. When upgrading, manually execute sudo apt upgrade cuda-driver cuda-toolkit.

Incidentally, datacenter GPUs (Tesla, RTX A6000, RTX 6000Ada, A100, H100, etc.) have Forward capability, allowing execution of newer CUDA versions even with older Driver versions. Therefore, DGX-A100 etc. are configured not to auto-update, so basically stopping automatic updates should be fine.

This completes the minimum requirements for building a machine learning server. Now add users or configure LDAP, NFS as needed. Good work!