Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1867854

Summary: NVIDIA GPU Operator 1.1.7 v2 fails to install on Openshift version 4.3.1 from Operator Hub
Product: OpenShift Container Platform Reporter: Diane Feddema <dfeddema>
Component: Special Resource OperatorAssignee: Zvonko Kosic <zkosic>
Status: CLOSED DUPLICATE QA Contact: Walid A. <wabouham>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, carangog
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-10 01:43:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub none

Description Diane Feddema 2020-08-11 02:44:04 UTC
Created attachment 1711026 [details]
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub

Created attachment 1711026 [details]
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub

Created attachment 1711026 [details]
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub

Description of problem: Nvidia GPU Operator version 1.1.7 v2 fails to install from Operator Hub on baremetal Openshift 4.3.1 Cluster.  
The install fails with error "Failed: Unknown" (see attached screenshot). 

When the Operator Hub Installation failed I moved on to the next installation method (the helm installation) in the Nvidia installation instructions here: 
https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html
The helm installation method worked correctly.  

The output below shows that the helm install method (for Nvidia GPU Operator 1.1.7 v2) was successful on this OCP 4.3.1 baremetal cluster.  

[core@r2bcsah diane]$ oc get pods
NAME                                       READY   STATUS              RESTARTS   AGE
nvidia-container-toolkit-daemonset-24t2h   1/1     Running             0          3m41s
nvidia-container-toolkit-daemonset-7qmzl   1/1     Running             0          3m41s
nvidia-dcgm-exporter-cvxjx                 1/1     Running             0          3m41s
nvidia-dcgm-exporter-dxz5v                 1/1     Running             0          3m41s
nvidia-device-plugin-daemonset-8zrkt       1/1     Running             0          3m41s
nvidia-device-plugin-daemonset-dt2jl       1/1     Running             0          3m41s
nvidia-device-plugin-validation            0/1     ContainerCreating   0          3m42s
nvidia-driver-daemonset-k5hhz              1/1     Running             0          3m42s
nvidia-driver-daemonset-vx6jg              1/1     Running             0          3m42s
nvidia-driver-validation                   0/1     Completed           2          3m42s
[core@r2bcsah diane]$ oc get logs nvidia-driver-validation
error: the server doesn't have a resource type "logs"
[core@r2bcsah diane]$ oc logs -f nvidia-driver-validation
> Using CUDA Device [0]: Tesla T4
> GPU Device has SM 7.5 compute capability
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED


Version-Release number of selected component (if applicable): 1.1.7 v2


How reproducible:
100%

Steps to Reproduce:
1.Use Nvidia instructions to install the Nvidia GPU Operator from Operator Hub 
https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html
    a.  create a new project with the name “gpu-operator-resources”

2. 
    a From the side menu, select Operators > OperatorHub, then search for the NVIDIA GPU Operator.
    b. click Install

3. Upgrade status = Upgrading, 0 installed 1 failed 
!Failed: Unknown 

Actual results:
!Failed: Unknown 
See attached image
_______________________________________________
Yaml for the gpu-certified-operator (below) shows "state: UpgradePending" and  
    - lastTransitionTime: '2020-08-10T23:05:42Z'
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: 'False'
      type: CatalogSourcesUnhealthy
______________________________________________
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: '2020-08-10T23:05:42Z'
  generation: 1
  name: gpu-operator-certified
  namespace: gpu-operator-resources
  resourceVersion: '76284146'
  selfLink: >-
    /apis/operators.coreos.com/v1alpha1/namespaces/gpu-operator-resources/subscriptions/gpu-operator-certified
  uid: b2444727-6229-4fa1-a8f0-bc0848d5b4f3
spec:
  channel: stable
  installPlanApproval: Manual
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
  startingCSV: gpu-operator-certified.v1.1.7-r2
status:
  catalogHealth:
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: certified-operators
        namespace: openshift-marketplace
        resourceVersion: '76282021'
        uid: b20d8e59-ad1a-4246-a471-e637f25f8409
      healthy: true
      lastUpdated: '2020-08-10T23:05:42Z'
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: community-operators
        namespace: openshift-marketplace
        resourceVersion: '76282022'
        uid: b1c00afe-006e-46ce-bb64-cbd002d91c12
      healthy: true
      lastUpdated: '2020-08-10T23:05:42Z'
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: redhat-operators
        namespace: openshift-marketplace
        resourceVersion: '76282030'
        uid: 0f185a7e-7432-4c9c-bd9b-46814a2f570a
      healthy: true
      lastUpdated: '2020-08-10T23:05:42Z'
  conditions:
    - lastTransitionTime: '2020-08-10T23:05:42Z'
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: 'False'
      type: CatalogSourcesUnhealthy
    - lastTransitionTime: '2020-08-10T23:06:08Z'
      reason: InstallComponentFailed
      status: 'True'
      type: InstallPlanFailed
  currentCSV: gpu-operator-certified.v1.1.7-r2
  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-lml4s
    namespace: gpu-operator-resources
    resourceVersion: '76284025'
    uid: b90f9ce7-f65d-4875-9ec5-008bc8abaad3
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-lml4s
    uuid: b90f9ce7-f65d-4875-9ec5-008bc8abaad3
  lastUpdated: '2020-08-10T23:06:08Z'
  state: UpgradePending


Expected results:
I expected the Nvidia GPU Operator to install via Operator Hub as documented here: 
https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html


Additional info:
I attempted to install the NFD Operator from Operator Hub and could only get it to succeed by setting "Approval" to "manual" and then clicking on the "Install Plan" to approve it manually.

Comment 1 Diane Feddema 2020-08-21 17:11:17 UTC
The nvidia-driver log contained this error:
[core@r2bcsah ~]$ oc logs nvidia-driver-daemonset-ctr4h | grep -i error
Error: Unable to find a match: kernel-headers-4.18.0-147.20.1.el8_1.x86_64 kernel-devel-4.18.0-147.20.1.el8_1.x86_64

This is a known problem:
https://access.redhat.com/solutions/5232481
https://bugzilla.redhat.com/show_bug.cgi?id=1862229

Comment 2 Carlos Eduardo Arango Gutierrez 2020-09-10 01:40:54 UTC
Is this issue still ongoing, or the solution provided on 
https://access.redhat.com/solutions/5232481 
Addresses the problem?
thanks

Comment 3 Carlos Eduardo Arango Gutierrez 2020-09-10 01:43:06 UTC

*** This bug has been marked as a duplicate of bug 1862229 ***

Comment 4 Diane Feddema 2020-09-15 02:50:20 UTC
(In reply to Carlos Eduardo Arango Gutierrez from comment #2)
> Is this issue still ongoing, or the solution provided on 
> https://access.redhat.com/solutions/5232481 
> Addresses the problem?
> thanks

As far as I can tell this issue has not yet been resolved by Nvidia.