GPU Sharing with LXC

Sources

Troubleshooting

Check Device ids

  • Run ls -ls /dev/nvidia*
🚀 ls /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 May 27 14:49 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 May 27 14:49 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 May 27 14:49 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 May 27 14:50 /dev/nvidia-modeset
crw-rw-rw- 1 root root 234,   0 May 27 14:50 /dev/nvidia-uvm
crw-rw-rw- 1 root root 234,   1 May 27 14:50 /dev/nvidia-uvm-tools
 
/dev/nvidia-caps:
total 0
cr-------- 1 root root 238, 1 May 27 14:50 nvidia-cap1
cr--r--r-- 1 root root 238, 2 May 27 14:50 nvidia-cap2
  • Check lxc config at /etc/pve/lxc/<id>.conf, match ids with the numbers from above
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 235:* rwm
lxc.cgroup2.devices.allow: c 510:* rwm
lxc.mount.entry: /dev/nvidia0 /dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidia1 /dev/nvidia1 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl /dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset /dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm /dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools /dev/nvidia-uvm-tools none bind,optional,create=file

Confirm nvidia-smi in the LXC

When /usr/bin/nvidia-smi is a folder instead of an executable, the error permission denied: nvidia-smi occurs. To fix:

  • Remove the current folder (or back it up)
sudo rm -r /usr/bin/nvidia-smi
  • Create a new symlink
ln -sfn /etc/alternatives/nvidia--nvidia-smi /usr/bin/nvidia-smi
  • Try nvidia-smi
🚀 ❯ nvidia-smi
Mon May 27 19:37:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2070 ...    On  |   00000000:01:00.0 Off |                  N/A |
| 41%   36C    P8             17W /  215W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1060 3GB    On  |   00000000:02:00.0 Off |                  N/A |
| 42%   43C    P8              6W /  120W |       2MiB /   3072MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
 
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Check host modules

  • View /etc/modules.load.d/modules.conf
nvidia
nvidia_uvm
  • Update initramfs and reboot
update-initramfs -u -k all
reboot

Modprobe

sudo modprobe --remove nvidia_uvm
sudo modprobe nvidia_uvm