Created attachment 1699875 [details] oc logs nvidia-driver-daemonset-vh8cg Description of problem: After deploying NVIDIA GPU operator on OCP 4.4.10, the pods are crashing: NAME READY STATUS RESTARTS AGE gpu-operator-669d8b6959-xkcwr 1/1 Running 0 2m55s ip-10-0-207-226us-west-2computeinternal-debug 1/1 Running 0 8m51s nvidia-container-toolkit-daemonset-26wpr 1/1 Running 0 154m nvidia-container-toolkit-daemonset-2v6fr 1/1 Running 0 154m nvidia-container-toolkit-daemonset-vhg77 1/1 Running 0 154m nvidia-dcgm-exporter-842tp 0/1 CreateContainerError 1 154m nvidia-dcgm-exporter-fv6w4 0/1 CreateContainerError 1 154m nvidia-dcgm-exporter-mxb8v 0/1 CrashLoopBackOff 1 154m nvidia-device-plugin-daemonset-42r4p 1/1 Running 0 128m nvidia-device-plugin-daemonset-w8ks7 1/1 Running 0 154m nvidia-device-plugin-daemonset-zd7qt 0/1 CreateContainerError 0 154m nvidia-device-plugin-validation 0/1 Pending 0 154m nvidia-driver-daemonset-rndlp 0/1 CrashLoopBackOff 5 24m nvidia-driver-daemonset-vh8cg 0/1 CrashLoopBackOff 6 24m nvidia-driver-daemonset-wd8zl 1/1 Running 2 24m nvidia-driver-validation 0/1 CreateContainerError 0 2m46s Version-Release number of selected component (if applicable): OCP version 4.4.10 CoreOS kernel version on OCP 4.4.10: $ uname -a Linux ip-10-0-207-226 4.18.0-147.20.1.el8_1.x86_64 #1 SMP Wed Jun 10 19:19:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux NVIDIA driver container image: nvidia/driver:440.64.00-rhcos4.4 How reproducible: Deploy NVIDIA GPU operator using OLM on OCP 4.4.10 Actual results: NVIDIA GPU operator pods are crashing Expected results: NVIDIA GPU operator pods are running Additional info: OCP 4.4.10 uses CoreOS kernel version 4.18.0-147.20.1.el8_1.x86_64. The RPM packages for this kernel version are not available in the repos enabled in the nvidia/driver:440.64.00-rhcos4.4 image (default UBI repos). Instead, I had to: $ subscription-manager repos --enable rhel-8-for-x86_64-baseos-eus-rpms $ subscription-manager release --set=8.1 to get access to the kernel packages for version kernel version 4.18.0-147.20.1.el8_1.x86_64: $ dnf info kernel-devel Updating Subscription Management repositories. Red Hat Enterprise Linux 8 for x86_64 - BaseOS (RPMs) 15 kB/s | 4.1 kB 00:00 Red Hat Enterprise Linux 8 for x86_64 - BaseOS - Extended Update Support (RPMs) 9.5 kB/s | 4.1 kB 00:00 Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs) 13 kB/s | 4.5 kB 00:00 Available Packages Name : kernel-devel Version : 4.18.0 Release : 147.20.1.el8_1 Architecture : x86_64 Size : 13 M Source : kernel-4.18.0-147.20.1.el8_1.src.rpm Repository : rhel-8-for-x86_64-baseos-eus-rpms Summary : Development package for building kernel modules to match the kernel URL : http://www.kernel.org/ License : GPLv2 and Redistributable, no modification permitted Description : This package provides kernel headers and makefiles sufficient to build modules : against the kernel package.
This bug relates to https://github.com/openshift/enhancements/pull/357
The NV GPU operator is not tight to a openshift release. It works for "any" release. It depends only on the kernel. Setting this to 4.7 for tracking. Once the bug is resolved we can install GPU operator in 4.3, 4.4 4.5 and so on .
*** Bug 1862229 has been marked as a duplicate of this bug. ***
The newest version of the GPU operator supports 4.4, 4.5 and 4.6. Please update to the latest z-stream of OpenShift 4.4, 4.5 or 4.6. Closing this we have verified on both sides (NV, RH) that the new GPU operator works.