Bug 1867854 - NVIDIA GPU Operator 1.1.7 v2 fails to install on Openshift version 4.3.1 from Operator Hub
Summary: NVIDIA GPU Operator 1.1.7 v2 fails to install on Openshift version 4.3.1 from...
Keywords:
Status: CLOSED DUPLICATE of bug 1862229
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.3.0
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Zvonko Kosic
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-11 02:44 UTC by Diane Feddema
Modified: 2020-09-15 02:50 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-10 01:43:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub (228.73 KB, image/png)
2020-08-11 02:44 UTC, Diane Feddema
no flags Details

Description Diane Feddema 2020-08-11 02:44:04 UTC
Created attachment 1711026 [details]
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub

Created attachment 1711026 [details]
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub

Created attachment 1711026 [details]
Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub

Description of problem: Nvidia GPU Operator version 1.1.7 v2 fails to install from Operator Hub on baremetal Openshift 4.3.1 Cluster.  
The install fails with error "Failed: Unknown" (see attached screenshot). 

When the Operator Hub Installation failed I moved on to the next installation method (the helm installation) in the Nvidia installation instructions here: 
https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html
The helm installation method worked correctly.  

The output below shows that the helm install method (for Nvidia GPU Operator 1.1.7 v2) was successful on this OCP 4.3.1 baremetal cluster.  

[core@r2bcsah diane]$ oc get pods
NAME                                       READY   STATUS              RESTARTS   AGE
nvidia-container-toolkit-daemonset-24t2h   1/1     Running             0          3m41s
nvidia-container-toolkit-daemonset-7qmzl   1/1     Running             0          3m41s
nvidia-dcgm-exporter-cvxjx                 1/1     Running             0          3m41s
nvidia-dcgm-exporter-dxz5v                 1/1     Running             0          3m41s
nvidia-device-plugin-daemonset-8zrkt       1/1     Running             0          3m41s
nvidia-device-plugin-daemonset-dt2jl       1/1     Running             0          3m41s
nvidia-device-plugin-validation            0/1     ContainerCreating   0          3m42s
nvidia-driver-daemonset-k5hhz              1/1     Running             0          3m42s
nvidia-driver-daemonset-vx6jg              1/1     Running             0          3m42s
nvidia-driver-validation                   0/1     Completed           2          3m42s
[core@r2bcsah diane]$ oc get logs nvidia-driver-validation
error: the server doesn't have a resource type "logs"
[core@r2bcsah diane]$ oc logs -f nvidia-driver-validation
> Using CUDA Device [0]: Tesla T4
> GPU Device has SM 7.5 compute capability
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED


Version-Release number of selected component (if applicable): 1.1.7 v2


How reproducible:
100%

Steps to Reproduce:
1.Use Nvidia instructions to install the Nvidia GPU Operator from Operator Hub 
https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html
    a.  create a new project with the name “gpu-operator-resources”

2. 
    a From the side menu, select Operators > OperatorHub, then search for the NVIDIA GPU Operator.
    b. click Install

3. Upgrade status = Upgrading, 0 installed 1 failed 
!Failed: Unknown 

Actual results:
!Failed: Unknown 
See attached image
_______________________________________________
Yaml for the gpu-certified-operator (below) shows "state: UpgradePending" and  
    - lastTransitionTime: '2020-08-10T23:05:42Z'
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: 'False'
      type: CatalogSourcesUnhealthy
______________________________________________
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: '2020-08-10T23:05:42Z'
  generation: 1
  name: gpu-operator-certified
  namespace: gpu-operator-resources
  resourceVersion: '76284146'
  selfLink: >-
    /apis/operators.coreos.com/v1alpha1/namespaces/gpu-operator-resources/subscriptions/gpu-operator-certified
  uid: b2444727-6229-4fa1-a8f0-bc0848d5b4f3
spec:
  channel: stable
  installPlanApproval: Manual
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
  startingCSV: gpu-operator-certified.v1.1.7-r2
status:
  catalogHealth:
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: certified-operators
        namespace: openshift-marketplace
        resourceVersion: '76282021'
        uid: b20d8e59-ad1a-4246-a471-e637f25f8409
      healthy: true
      lastUpdated: '2020-08-10T23:05:42Z'
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: community-operators
        namespace: openshift-marketplace
        resourceVersion: '76282022'
        uid: b1c00afe-006e-46ce-bb64-cbd002d91c12
      healthy: true
      lastUpdated: '2020-08-10T23:05:42Z'
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: redhat-operators
        namespace: openshift-marketplace
        resourceVersion: '76282030'
        uid: 0f185a7e-7432-4c9c-bd9b-46814a2f570a
      healthy: true
      lastUpdated: '2020-08-10T23:05:42Z'
  conditions:
    - lastTransitionTime: '2020-08-10T23:05:42Z'
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: 'False'
      type: CatalogSourcesUnhealthy
    - lastTransitionTime: '2020-08-10T23:06:08Z'
      reason: InstallComponentFailed
      status: 'True'
      type: InstallPlanFailed
  currentCSV: gpu-operator-certified.v1.1.7-r2
  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-lml4s
    namespace: gpu-operator-resources
    resourceVersion: '76284025'
    uid: b90f9ce7-f65d-4875-9ec5-008bc8abaad3
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-lml4s
    uuid: b90f9ce7-f65d-4875-9ec5-008bc8abaad3
  lastUpdated: '2020-08-10T23:06:08Z'
  state: UpgradePending


Expected results:
I expected the Nvidia GPU Operator to install via Operator Hub as documented here: 
https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html


Additional info:
I attempted to install the NFD Operator from Operator Hub and could only get it to succeed by setting "Approval" to "manual" and then clicking on the "Install Plan" to approve it manually.

Comment 1 Diane Feddema 2020-08-21 17:11:17 UTC
The nvidia-driver log contained this error:
[core@r2bcsah ~]$ oc logs nvidia-driver-daemonset-ctr4h | grep -i error
Error: Unable to find a match: kernel-headers-4.18.0-147.20.1.el8_1.x86_64 kernel-devel-4.18.0-147.20.1.el8_1.x86_64

This is a known problem:
https://access.redhat.com/solutions/5232481
https://bugzilla.redhat.com/show_bug.cgi?id=1862229

Comment 2 Carlos Eduardo Arango Gutierrez 2020-09-10 01:40:54 UTC
Is this issue still ongoing, or the solution provided on 
https://access.redhat.com/solutions/5232481 
Addresses the problem?
thanks

Comment 3 Carlos Eduardo Arango Gutierrez 2020-09-10 01:43:06 UTC

*** This bug has been marked as a duplicate of bug 1862229 ***

Comment 4 Diane Feddema 2020-09-15 02:50:20 UTC
(In reply to Carlos Eduardo Arango Gutierrez from comment #2)
> Is this issue still ongoing, or the solution provided on 
> https://access.redhat.com/solutions/5232481 
> Addresses the problem?
> thanks

As far as I can tell this issue has not yet been resolved by Nvidia.


Note You need to log in before you can comment on or make changes to this bug.