Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1847805

Summary: OCP 4.5: Failed to deploy SRO on entitled cluster - nvidia-gpu-device-plugin pod in Init:CrashLoopBackOff
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: Special Resource OperatorAssignee: Zvonko Kosic <zkosic>
Status: CLOSED DEFERRED QA Contact: Walid A. <wabouham>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.5CC: akamra, anosek, aos-bugs, apjagtap, btomlins, carangog, dagray, ddharwar, jniu, lserot, mifiedle
Target Milestone: ---Keywords: Reopened, TestBlocker
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-09 12:02:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1854200    
Bug Blocks:    

Description Walid A. 2020-06-17 05:53:56 UTC
Description of problem:
On OCP 4.5 IPI installed cluster on AWS, after setting up cluster-wide entitlement, the Special Resources Operator (SRO) failed to deploy successfully from github, with nvidia-gpu-device-plugin pod in it:CrashLoopBackOff 

Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.5.0-0.nightly-2020-06-01-081609
Server Version: 4.5.0-0.nightly-2020-06-01-165039
Kubernetes Version: v1.18.3+a637491


How reproducible:
Everytime

Steps to Reproduce:
1.  IPI install of OCP 4.5 on AWS
2.  Setup cluster-wide entitlement according to the following doc https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift especially the section on cluster-wide entitlements. Deploy NFD 4.5 operator from OperatorHub
3.  Deploy Node Feature Discovery (NFD 4.5) Operator from OperatorHub
4.  Create a new machineset to add a GPU node (g3.4xlarge)
5.  Deploy SRO operator:
      cd $GOPATH/src/github.com/openshift-psap
      git clone https://github.com/openshift-psap/special-resource-operator.git
      cd special-resource-operator
      PULLPOLICY=Always make deploy

# oc describe pod nvidia-gpu-device-plugin-2q2nh -n nvidia-gpu
Name:         nvidia-gpu-device-plugin-2q2nh
Namespace:    nvidia-gpu
Priority:     0
Node:         ip-10-0-148-145.us-west-2.compute.internal/10.0.148.145
Start Time:   Tue, 16 Jun 2020 20:51:42 +0000
Labels:       app=nvidia-gpu-device-plugin
              controller-revision-hash=6f66d85f7b
              pod-template-generation=1
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.2.29"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.2.29"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: privileged
              scheduler.alpha.kubernetes.io/critical-pod: 
Status:       Pending
IP:           10.131.2.29
IPs:
  IP:           10.131.2.29
Controlled By:  DaemonSet/nvidia-gpu-device-plugin
Init Containers:
  specialresource-runtime-validation-nvidia-gpu:
    Container ID:  cri-o://9f527b6ba3868f6f0cda3b06b579c1da7dfb94c31117cf45145ea292df65c716
    Image:         nvidia/samples:cuda10.2-vectorAdd
    Image ID:      docker.io/nvidia/samples@sha256:19f202eecbfa5211be8ce0f0dc02deb44a25a7e3584c993b22b1654f3569ffaf
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/entrypoint.sh
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Jun 2020 05:36:19 +0000
      Finished:     Wed, 17 Jun 2020 05:36:19 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Jun 2020 05:31:07 +0000
      Finished:     Wed, 17 Jun 2020 05:31:07 +0000
    Ready:          False
    Restart Count:  107
    Environment:    <none>
    Mounts:
      /bin/entrypoint.sh from init-entrypoint (ro,path="entrypoint.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro)
Containers:
  nvidia-gpu-device-plugin-ctr:
    Container ID:   
    Image:          nvidia/k8s-device-plugin:1.11
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin-nvidia-gpu (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin-nvidia-gpu:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  init-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      specialresource-driver-validation-entrypoint-nvidia-gpu
    Optional:  false
  nvidia-gpu-device-plugin-token-gcxlz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-gpu-device-plugin-token-gcxlz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  feature.node.kubernetes.io/pci-10de.present=true
                 node-role.kubernetes.io/worker=
Tolerations:     
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason   Age                    From                                                 Message
  ----     ------   ----                   ----                                                 -------
  Normal   Pulling  164m (x76 over 8h)     kubelet, ip-10-0-148-145.us-west-2.compute.internal  Pulling image "nvidia/samples:cuda10.2-vectorAdd"
  Warning  BackOff  4m37s (x2382 over 8h)  kubelet, ip-10-0-148-145.us-west-2.compute.internal  Back-off restarting failed container


Actual results:
# oc get pods -n nvidia-gpu
NAME                                         READY   STATUS                  RESTARTS   AGE
nvidia-gpu-device-plugin-2q2nh               0/1     Init:CrashLoopBackOff   6          8m44s
nvidia-gpu-driver-build-1-build              0/1     Completed               0          16m
nvidia-gpu-driver-container-rhel8-gd4f6      1/1     Running                 0          14m
nvidia-gpu-runtime-enablement-rzddc          1/1     Running                 0          12m
special-resource-operator-76b658c584-q25fs   1/1     Running                 0          17m


Expected results:
the nvidia gpu driver stack deployed successfully, with nvidia-gpu-device-plugin and nvidia-gpu-device-plugin-validation Running.


Additional info:
Link to events, pod logs, operator logs, and build logs from oc commands is provided in subsequent comment.

Comment 4 Zvonko Kosic 2020-07-08 13:44:19 UTC
Now that the hooks dir are fixed and verified we need to see why the NVIDIA containers are failing. 
The Drivers work on a G4 and M60 GPU so the drivers are not the issue.

Comment 7 Zvonko Kosic 2020-11-05 13:52:38 UTC
Yes this is fixed in SRO. We are in sync with upstream.

Comment 9 Zvonko Kosic 2021-01-18 12:07:14 UTC
Walid please use branch master not simple-kmod-v2 this one is old.

Comment 10 Zvonko Kosic 2021-02-09 12:02:29 UTC
Do not use SRO to deploy NVIDIA stack, please use the official NV GPU operator.