1853726 – NVIDIA GPU operator fails to compile kernel modules on OCP 4.4.10

Bug 1853726 - NVIDIA GPU operator fails to compile kernel modules on OCP 4.4.10

Summary: NVIDIA GPU operator fails to compile kernel modules on OCP 4.4.10

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Special Resource Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Zvonko Kosic
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1862229 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-03 17:01 UTC by Ales Nosek
Modified:	2020-11-05 12:04 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-05 12:04:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
oc logs nvidia-driver-daemonset-vh8cg (3.65 KB, text/plain) 2020-07-03 17:01 UTC, Ales Nosek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift-psap special-resource-operator pull 27	0	None	closed	Bug 1853726: Out-of-tree driver with SRO implementation.	2020-12-08 17:49:32 UTC
Red Hat Knowledge Base (Solution)	5232901	0	None	None	None	2020-10-20 11:12:07 UTC

Description Ales Nosek 2020-07-03 17:01:05 UTC

Created attachment 1699875 [details]
oc logs nvidia-driver-daemonset-vh8cg

Description of problem:

After deploying NVIDIA GPU operator on OCP 4.4.10, the pods are crashing:

NAME                                            READY   STATUS                 RESTARTS   AGE
gpu-operator-669d8b6959-xkcwr                   1/1     Running                0          2m55s
ip-10-0-207-226us-west-2computeinternal-debug   1/1     Running                0          8m51s
nvidia-container-toolkit-daemonset-26wpr        1/1     Running                0          154m
nvidia-container-toolkit-daemonset-2v6fr        1/1     Running                0          154m
nvidia-container-toolkit-daemonset-vhg77        1/1     Running                0          154m
nvidia-dcgm-exporter-842tp                      0/1     CreateContainerError   1          154m
nvidia-dcgm-exporter-fv6w4                      0/1     CreateContainerError   1          154m
nvidia-dcgm-exporter-mxb8v                      0/1     CrashLoopBackOff       1          154m
nvidia-device-plugin-daemonset-42r4p            1/1     Running                0          128m
nvidia-device-plugin-daemonset-w8ks7            1/1     Running                0          154m
nvidia-device-plugin-daemonset-zd7qt            0/1     CreateContainerError   0          154m
nvidia-device-plugin-validation                 0/1     Pending                0          154m
nvidia-driver-daemonset-rndlp                   0/1     CrashLoopBackOff       5          24m
nvidia-driver-daemonset-vh8cg                   0/1     CrashLoopBackOff       6          24m
nvidia-driver-daemonset-wd8zl                   1/1     Running                2          24m
nvidia-driver-validation                        0/1     CreateContainerError   0          2m46s

Version-Release number of selected component (if applicable):

OCP version 4.4.10

CoreOS kernel version on OCP 4.4.10:
$ uname -a
Linux ip-10-0-207-226 4.18.0-147.20.1.el8_1.x86_64 #1 SMP Wed Jun 10 19:19:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

NVIDIA driver container image: nvidia/driver:440.64.00-rhcos4.4

How reproducible:

Deploy NVIDIA GPU operator using OLM on OCP 4.4.10

Actual results:

NVIDIA GPU operator pods are crashing

Expected results:

NVIDIA GPU operator pods are running

Additional info:

OCP 4.4.10 uses CoreOS kernel version 4.18.0-147.20.1.el8_1.x86_64. The RPM packages for this kernel version are not available in the repos enabled in the nvidia/driver:440.64.00-rhcos4.4 image (default UBI repos). Instead, I had to:

$ subscription-manager repos --enable rhel-8-for-x86_64-baseos-eus-rpms
$ subscription-manager release --set=8.1

to get access to the kernel packages for version kernel version 4.18.0-147.20.1.el8_1.x86_64:

$ dnf info kernel-devel
Updating Subscription Management repositories.
Red Hat Enterprise Linux 8 for x86_64 - BaseOS (RPMs)                                                                                                                                                                                                        15 kB/s | 4.1 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS - Extended Update Support (RPMs)                                                                                                                                                                             9.5 kB/s | 4.1 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs)                                                                                                                                                                                                     13 kB/s | 4.5 kB     00:00
Available Packages
Name         : kernel-devel
Version      : 4.18.0
Release      : 147.20.1.el8_1
Architecture : x86_64
Size         : 13 M
Source       : kernel-4.18.0-147.20.1.el8_1.src.rpm
Repository   : rhel-8-for-x86_64-baseos-eus-rpms
Summary      : Development package for building kernel modules to match the kernel
URL          : http://www.kernel.org/
License      : GPLv2 and Redistributable, no modification permitted
Description  : This package provides kernel headers and makefiles sufficient to build modules
             : against the kernel package.

Comment 1 Zvonko Kosic 2020-07-23 16:30:37 UTC

This bug relates to https://github.com/openshift/enhancements/pull/357

Comment 6 Zvonko Kosic 2020-09-29 13:03:12 UTC

The NV GPU operator is not tight to a openshift release. It works for "any" release. It depends only on the kernel. 
Setting this to 4.7 for tracking. Once the bug is resolved we can install GPU operator in 4.3, 4.4 4.5 and so on .

Comment 8 Zvonko Kosic 2020-10-20 11:12:08 UTC

*** Bug 1862229 has been marked as a duplicate of this bug. ***

Comment 9 Zvonko Kosic 2020-11-05 12:04:28 UTC

The newest version of the GPU operator supports 4.4, 4.5 and 4.6. Please update to the latest z-stream of OpenShift 4.4, 4.5 or 4.6. Closing this we have verified on both sides  (NV, RH) that the new GPU operator works.

Note You need to log in before you can comment on or make changes to this bug.