Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1853726

Summary: NVIDIA GPU operator fails to compile kernel modules on OCP 4.4.10
Product: OpenShift Container Platform Reporter: Ales Nosek <anosek>
Component: Special Resource OperatorAssignee: Zvonko Kosic <zkosic>
Status: CLOSED UPSTREAM QA Contact: Walid A. <wabouham>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.4CC: antgarci, aos-bugs, btomlins, dfeddema, kpouget, xtian
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-05 12:04:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oc logs nvidia-driver-daemonset-vh8cg none

Description Ales Nosek 2020-07-03 17:01:05 UTC
Created attachment 1699875 [details]
oc logs nvidia-driver-daemonset-vh8cg

Description of problem:

After deploying NVIDIA GPU operator on OCP 4.4.10, the pods are crashing:

NAME                                            READY   STATUS                 RESTARTS   AGE
gpu-operator-669d8b6959-xkcwr                   1/1     Running                0          2m55s
ip-10-0-207-226us-west-2computeinternal-debug   1/1     Running                0          8m51s
nvidia-container-toolkit-daemonset-26wpr        1/1     Running                0          154m
nvidia-container-toolkit-daemonset-2v6fr        1/1     Running                0          154m
nvidia-container-toolkit-daemonset-vhg77        1/1     Running                0          154m
nvidia-dcgm-exporter-842tp                      0/1     CreateContainerError   1          154m
nvidia-dcgm-exporter-fv6w4                      0/1     CreateContainerError   1          154m
nvidia-dcgm-exporter-mxb8v                      0/1     CrashLoopBackOff       1          154m
nvidia-device-plugin-daemonset-42r4p            1/1     Running                0          128m
nvidia-device-plugin-daemonset-w8ks7            1/1     Running                0          154m
nvidia-device-plugin-daemonset-zd7qt            0/1     CreateContainerError   0          154m
nvidia-device-plugin-validation                 0/1     Pending                0          154m
nvidia-driver-daemonset-rndlp                   0/1     CrashLoopBackOff       5          24m
nvidia-driver-daemonset-vh8cg                   0/1     CrashLoopBackOff       6          24m
nvidia-driver-daemonset-wd8zl                   1/1     Running                2          24m
nvidia-driver-validation                        0/1     CreateContainerError   0          2m46s

Version-Release number of selected component (if applicable):

OCP version 4.4.10

CoreOS kernel version on OCP 4.4.10:
$ uname -a
Linux ip-10-0-207-226 4.18.0-147.20.1.el8_1.x86_64 #1 SMP Wed Jun 10 19:19:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

NVIDIA driver container image: nvidia/driver:440.64.00-rhcos4.4

How reproducible:

Deploy NVIDIA GPU operator using OLM on OCP 4.4.10

Actual results:

NVIDIA GPU operator pods are crashing

Expected results:

NVIDIA GPU operator pods are running

Additional info:

OCP 4.4.10 uses CoreOS kernel version 4.18.0-147.20.1.el8_1.x86_64. The RPM packages for this kernel version are not available in the repos enabled in the nvidia/driver:440.64.00-rhcos4.4 image (default UBI repos). Instead, I had to:

$ subscription-manager repos --enable rhel-8-for-x86_64-baseos-eus-rpms
$ subscription-manager release --set=8.1

to get access to the kernel packages for version kernel version 4.18.0-147.20.1.el8_1.x86_64:

$ dnf info kernel-devel
Updating Subscription Management repositories.
Red Hat Enterprise Linux 8 for x86_64 - BaseOS (RPMs)                                                                                                                                                                                                        15 kB/s | 4.1 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS - Extended Update Support (RPMs)                                                                                                                                                                             9.5 kB/s | 4.1 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs)                                                                                                                                                                                                     13 kB/s | 4.5 kB     00:00
Available Packages
Name         : kernel-devel
Version      : 4.18.0
Release      : 147.20.1.el8_1
Architecture : x86_64
Size         : 13 M
Source       : kernel-4.18.0-147.20.1.el8_1.src.rpm
Repository   : rhel-8-for-x86_64-baseos-eus-rpms
Summary      : Development package for building kernel modules to match the kernel
URL          : http://www.kernel.org/
License      : GPLv2 and Redistributable, no modification permitted
Description  : This package provides kernel headers and makefiles sufficient to build modules
             : against the kernel package.

Comment 1 Zvonko Kosic 2020-07-23 16:30:37 UTC
This bug relates to https://github.com/openshift/enhancements/pull/357

Comment 6 Zvonko Kosic 2020-09-29 13:03:12 UTC
The NV GPU operator is not tight to a openshift release. It works for "any" release. It depends only on the kernel. 
Setting this to 4.7 for tracking. Once the bug is resolved we can install GPU operator in 4.3, 4.4 4.5 and so on .

Comment 8 Zvonko Kosic 2020-10-20 11:12:08 UTC
*** Bug 1862229 has been marked as a duplicate of this bug. ***

Comment 9 Zvonko Kosic 2020-11-05 12:04:28 UTC
The newest version of the GPU operator supports 4.4, 4.5 and 4.6. Please update to the latest z-stream of OpenShift 4.4, 4.5 or 4.6. Closing this we have verified on both sides  (NV, RH) that the new GPU operator works.