Created attachment 1711026 [details] Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub Created attachment 1711026 [details] Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub Created attachment 1711026 [details] Openshift Console view of failed Nvidia GPU Operator installation from Operator Hub Description of problem: Nvidia GPU Operator version 1.1.7 v2 fails to install from Operator Hub on baremetal Openshift 4.3.1 Cluster. The install fails with error "Failed: Unknown" (see attached screenshot). When the Operator Hub Installation failed I moved on to the next installation method (the helm installation) in the Nvidia installation instructions here: https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html The helm installation method worked correctly. The output below shows that the helm install method (for Nvidia GPU Operator 1.1.7 v2) was successful on this OCP 4.3.1 baremetal cluster. [core@r2bcsah diane]$ oc get pods NAME READY STATUS RESTARTS AGE nvidia-container-toolkit-daemonset-24t2h 1/1 Running 0 3m41s nvidia-container-toolkit-daemonset-7qmzl 1/1 Running 0 3m41s nvidia-dcgm-exporter-cvxjx 1/1 Running 0 3m41s nvidia-dcgm-exporter-dxz5v 1/1 Running 0 3m41s nvidia-device-plugin-daemonset-8zrkt 1/1 Running 0 3m41s nvidia-device-plugin-daemonset-dt2jl 1/1 Running 0 3m41s nvidia-device-plugin-validation 0/1 ContainerCreating 0 3m42s nvidia-driver-daemonset-k5hhz 1/1 Running 0 3m42s nvidia-driver-daemonset-vx6jg 1/1 Running 0 3m42s nvidia-driver-validation 0/1 Completed 2 3m42s [core@r2bcsah diane]$ oc get logs nvidia-driver-validation error: the server doesn't have a resource type "logs" [core@r2bcsah diane]$ oc logs -f nvidia-driver-validation > Using CUDA Device [0]: Tesla T4 > GPU Device has SM 7.5 compute capability [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Version-Release number of selected component (if applicable): 1.1.7 v2 How reproducible: 100% Steps to Reproduce: 1.Use Nvidia instructions to install the Nvidia GPU Operator from Operator Hub https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html a. create a new project with the name “gpu-operator-resources” 2. a From the side menu, select Operators > OperatorHub, then search for the NVIDIA GPU Operator. b. click Install 3. Upgrade status = Upgrading, 0 installed 1 failed !Failed: Unknown Actual results: !Failed: Unknown See attached image _______________________________________________ Yaml for the gpu-certified-operator (below) shows "state: UpgradePending" and - lastTransitionTime: '2020-08-10T23:05:42Z' message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: 'False' type: CatalogSourcesUnhealthy ______________________________________________ apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: creationTimestamp: '2020-08-10T23:05:42Z' generation: 1 name: gpu-operator-certified namespace: gpu-operator-resources resourceVersion: '76284146' selfLink: >- /apis/operators.coreos.com/v1alpha1/namespaces/gpu-operator-resources/subscriptions/gpu-operator-certified uid: b2444727-6229-4fa1-a8f0-bc0848d5b4f3 spec: channel: stable installPlanApproval: Manual name: gpu-operator-certified source: certified-operators sourceNamespace: openshift-marketplace startingCSV: gpu-operator-certified.v1.1.7-r2 status: catalogHealth: - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: certified-operators namespace: openshift-marketplace resourceVersion: '76282021' uid: b20d8e59-ad1a-4246-a471-e637f25f8409 healthy: true lastUpdated: '2020-08-10T23:05:42Z' - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: community-operators namespace: openshift-marketplace resourceVersion: '76282022' uid: b1c00afe-006e-46ce-bb64-cbd002d91c12 healthy: true lastUpdated: '2020-08-10T23:05:42Z' - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: redhat-operators namespace: openshift-marketplace resourceVersion: '76282030' uid: 0f185a7e-7432-4c9c-bd9b-46814a2f570a healthy: true lastUpdated: '2020-08-10T23:05:42Z' conditions: - lastTransitionTime: '2020-08-10T23:05:42Z' message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: 'False' type: CatalogSourcesUnhealthy - lastTransitionTime: '2020-08-10T23:06:08Z' reason: InstallComponentFailed status: 'True' type: InstallPlanFailed currentCSV: gpu-operator-certified.v1.1.7-r2 installPlanRef: apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan name: install-lml4s namespace: gpu-operator-resources resourceVersion: '76284025' uid: b90f9ce7-f65d-4875-9ec5-008bc8abaad3 installplan: apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan name: install-lml4s uuid: b90f9ce7-f65d-4875-9ec5-008bc8abaad3 lastUpdated: '2020-08-10T23:06:08Z' state: UpgradePending Expected results: I expected the Nvidia GPU Operator to install via Operator Hub as documented here: https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html Additional info: I attempted to install the NFD Operator from Operator Hub and could only get it to succeed by setting "Approval" to "manual" and then clicking on the "Install Plan" to approve it manually.
The nvidia-driver log contained this error: [core@r2bcsah ~]$ oc logs nvidia-driver-daemonset-ctr4h | grep -i error Error: Unable to find a match: kernel-headers-4.18.0-147.20.1.el8_1.x86_64 kernel-devel-4.18.0-147.20.1.el8_1.x86_64 This is a known problem: https://access.redhat.com/solutions/5232481 https://bugzilla.redhat.com/show_bug.cgi?id=1862229
Is this issue still ongoing, or the solution provided on https://access.redhat.com/solutions/5232481 Addresses the problem? thanks
*** This bug has been marked as a duplicate of bug 1862229 ***
(In reply to Carlos Eduardo Arango Gutierrez from comment #2) > Is this issue still ongoing, or the solution provided on > https://access.redhat.com/solutions/5232481 > Addresses the problem? > thanks As far as I can tell this issue has not yet been resolved by Nvidia.