Hide Forgot
Created attachment 1688185 [details] This is the output of oc logs for the nvidia gpu driver container Description of problem: Gpu driver container pod hits error, looks like it is unable to find elfutils-libelf-devel.x86_64 package Version-Release number of selected component (if applicable): 4.5 How reproducible: 100% Steps to Reproduce: 1. Deploy ipi cluster on RHCOS 2. Deploy SRO from github master Actual results: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 7m1s default-scheduler Successfully assigned nvidia-gpu/nvidia-gpu-driver-container-rhel8-tzbr9 to ip-10-0-148-114.us-east-2.compute.internal Normal AddedInterface 7m multus Add eth0 [10.129.4.22/23] Normal Started 6m8s (x4 over 6m59s) kubelet, ip-10-0-148-114.us-east-2.compute.internal Started container nvidia-gpu-driver-container-rhel8 Normal Pulling 5m18s (x5 over 6m59s) kubelet, ip-10-0-148-114.us-east-2.compute.internal Pulling image "image-registry.openshift-image-registry.svc:5000/nvidia-gpu/nvidia-gpu-driver-container:v4.18.0-147.8.1.el8_1.x86_64" Normal Pulled 5m18s (x5 over 6m59s) kubelet, ip-10-0-148-114.us-east-2.compute.internal Successfully pulled image "image-registry.openshift-image-registry.svc:5000/nvidia-gpu/nvidia-gpu-driver-container:v4.18.0-147.8.1.el8_1.x86_64" Normal Created 5m18s (x5 over 6m59s) kubelet, ip-10-0-148-114.us-east-2.compute.internal Created container nvidia-gpu-driver-container-rhel8 Warning BackOff 111s (x23 over 6m54s) kubelet, ip-10-0-148-114.us-east-2.compute.internal Back-off restarting failed container $ oc get pods NAME READY STATUS RESTARTS AGE nvidia-gpu-driver-build-1-build 0/1 Completed 0 14m nvidia-gpu-driver-container-rhel8-tzbr9 0/1 CrashLoopBackOff 6 6m43s special-resource-operator-76b658c584-lxzwr 1/1 Running 0 14m Expected results: Container running successfully Additional info: Using $oc logs nvidia-gpu-driver-container-rhel8-tzbr9 I can see the following error message. "Error: Unable to find a match: elfutils-libelf-devel.x86_64"
The cluster is not entitled, please entitle the cluster and try again. This is not a bug.