Description of problem: On OCP 4.5 IPI installed cluster on AWS, after setting up cluster-wide entitlement, the Special Resources Operator (SRO) failed to deploy successfully from github, with nvidia-gpu-device-plugin pod in it:CrashLoopBackOff Version-Release number of selected component (if applicable): # oc version Client Version: 4.5.0-0.nightly-2020-06-01-081609 Server Version: 4.5.0-0.nightly-2020-06-01-165039 Kubernetes Version: v1.18.3+a637491 How reproducible: Everytime Steps to Reproduce: 1. IPI install of OCP 4.5 on AWS 2. Setup cluster-wide entitlement according to the following doc https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift especially the section on cluster-wide entitlements. Deploy NFD 4.5 operator from OperatorHub 3. Deploy Node Feature Discovery (NFD 4.5) Operator from OperatorHub 4. Create a new machineset to add a GPU node (g3.4xlarge) 5. Deploy SRO operator: cd $GOPATH/src/github.com/openshift-psap git clone https://github.com/openshift-psap/special-resource-operator.git cd special-resource-operator PULLPOLICY=Always make deploy # oc describe pod nvidia-gpu-device-plugin-2q2nh -n nvidia-gpu Name: nvidia-gpu-device-plugin-2q2nh Namespace: nvidia-gpu Priority: 0 Node: ip-10-0-148-145.us-west-2.compute.internal/10.0.148.145 Start Time: Tue, 16 Jun 2020 20:51:42 +0000 Labels: app=nvidia-gpu-device-plugin controller-revision-hash=6f66d85f7b pod-template-generation=1 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.131.2.29" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.131.2.29" ], "default": true, "dns": {} }] openshift.io/scc: privileged scheduler.alpha.kubernetes.io/critical-pod: Status: Pending IP: 10.131.2.29 IPs: IP: 10.131.2.29 Controlled By: DaemonSet/nvidia-gpu-device-plugin Init Containers: specialresource-runtime-validation-nvidia-gpu: Container ID: cri-o://9f527b6ba3868f6f0cda3b06b579c1da7dfb94c31117cf45145ea292df65c716 Image: nvidia/samples:cuda10.2-vectorAdd Image ID: docker.io/nvidia/samples@sha256:19f202eecbfa5211be8ce0f0dc02deb44a25a7e3584c993b22b1654f3569ffaf Port: <none> Host Port: <none> Command: /bin/entrypoint.sh State: Terminated Reason: Error Exit Code: 1 Started: Wed, 17 Jun 2020 05:36:19 +0000 Finished: Wed, 17 Jun 2020 05:36:19 +0000 Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 17 Jun 2020 05:31:07 +0000 Finished: Wed, 17 Jun 2020 05:31:07 +0000 Ready: False Restart Count: 107 Environment: <none> Mounts: /bin/entrypoint.sh from init-entrypoint (ro,path="entrypoint.sh") /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro) Containers: nvidia-gpu-device-plugin-ctr: Container ID: Image: nvidia/k8s-device-plugin:1.11 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: NVIDIA_VISIBLE_DEVICES: all Mounts: /var/lib/kubelet/device-plugins from device-plugin-nvidia-gpu (rw) /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: device-plugin-nvidia-gpu: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: init-entrypoint: Type: ConfigMap (a volume populated by a ConfigMap) Name: specialresource-driver-validation-entrypoint-nvidia-gpu Optional: false nvidia-gpu-device-plugin-token-gcxlz: Type: Secret (a volume populated by a Secret) SecretName: nvidia-gpu-device-plugin-token-gcxlz Optional: false QoS Class: BestEffort Node-Selectors: feature.node.kubernetes.io/pci-10de.present=true node-role.kubernetes.io/worker= Tolerations: CriticalAddonsOnly node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/pid-pressure:NoSchedule node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unschedulable:NoSchedule nvidia.com/gpu:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 164m (x76 over 8h) kubelet, ip-10-0-148-145.us-west-2.compute.internal Pulling image "nvidia/samples:cuda10.2-vectorAdd" Warning BackOff 4m37s (x2382 over 8h) kubelet, ip-10-0-148-145.us-west-2.compute.internal Back-off restarting failed container Actual results: # oc get pods -n nvidia-gpu NAME READY STATUS RESTARTS AGE nvidia-gpu-device-plugin-2q2nh 0/1 Init:CrashLoopBackOff 6 8m44s nvidia-gpu-driver-build-1-build 0/1 Completed 0 16m nvidia-gpu-driver-container-rhel8-gd4f6 1/1 Running 0 14m nvidia-gpu-runtime-enablement-rzddc 1/1 Running 0 12m special-resource-operator-76b658c584-q25fs 1/1 Running 0 17m Expected results: the nvidia gpu driver stack deployed successfully, with nvidia-gpu-device-plugin and nvidia-gpu-device-plugin-validation Running. Additional info: Link to events, pod logs, operator logs, and build logs from oc commands is provided in subsequent comment.
Now that the hooks dir are fixed and verified we need to see why the NVIDIA containers are failing. The Drivers work on a G4 and M60 GPU so the drivers are not the issue.
Yes this is fixed in SRO. We are in sync with upstream.
Walid please use branch master not simple-kmod-v2 this one is old.
Do not use SRO to deploy NVIDIA stack, please use the official NV GPU operator.