Bug 1847805 - OCP 4.5: Failed to deploy SRO on entitled cluster - nvidia-gpu-device-plugin pod in Init:CrashLoopBackOff
Summary: OCP 4.5: Failed to deploy SRO on entitled cluster - nvidia-gpu-device-plugin...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.5.z
Assignee: Zvonko Kosic
QA Contact: Walid A.
URL:
Whiteboard:
Depends On: 1854200
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-17 05:53 UTC by Walid A.
Modified: 2023-12-15 18:10 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-09 12:02:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Walid A. 2020-06-17 05:53:56 UTC
Description of problem:
On OCP 4.5 IPI installed cluster on AWS, after setting up cluster-wide entitlement, the Special Resources Operator (SRO) failed to deploy successfully from github, with nvidia-gpu-device-plugin pod in it:CrashLoopBackOff 

Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.5.0-0.nightly-2020-06-01-081609
Server Version: 4.5.0-0.nightly-2020-06-01-165039
Kubernetes Version: v1.18.3+a637491


How reproducible:
Everytime

Steps to Reproduce:
1.  IPI install of OCP 4.5 on AWS
2.  Setup cluster-wide entitlement according to the following doc https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift especially the section on cluster-wide entitlements. Deploy NFD 4.5 operator from OperatorHub
3.  Deploy Node Feature Discovery (NFD 4.5) Operator from OperatorHub
4.  Create a new machineset to add a GPU node (g3.4xlarge)
5.  Deploy SRO operator:
      cd $GOPATH/src/github.com/openshift-psap
      git clone https://github.com/openshift-psap/special-resource-operator.git
      cd special-resource-operator
      PULLPOLICY=Always make deploy

# oc describe pod nvidia-gpu-device-plugin-2q2nh -n nvidia-gpu
Name:         nvidia-gpu-device-plugin-2q2nh
Namespace:    nvidia-gpu
Priority:     0
Node:         ip-10-0-148-145.us-west-2.compute.internal/10.0.148.145
Start Time:   Tue, 16 Jun 2020 20:51:42 +0000
Labels:       app=nvidia-gpu-device-plugin
              controller-revision-hash=6f66d85f7b
              pod-template-generation=1
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.2.29"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.2.29"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: privileged
              scheduler.alpha.kubernetes.io/critical-pod: 
Status:       Pending
IP:           10.131.2.29
IPs:
  IP:           10.131.2.29
Controlled By:  DaemonSet/nvidia-gpu-device-plugin
Init Containers:
  specialresource-runtime-validation-nvidia-gpu:
    Container ID:  cri-o://9f527b6ba3868f6f0cda3b06b579c1da7dfb94c31117cf45145ea292df65c716
    Image:         nvidia/samples:cuda10.2-vectorAdd
    Image ID:      docker.io/nvidia/samples@sha256:19f202eecbfa5211be8ce0f0dc02deb44a25a7e3584c993b22b1654f3569ffaf
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/entrypoint.sh
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Jun 2020 05:36:19 +0000
      Finished:     Wed, 17 Jun 2020 05:36:19 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Jun 2020 05:31:07 +0000
      Finished:     Wed, 17 Jun 2020 05:31:07 +0000
    Ready:          False
    Restart Count:  107
    Environment:    <none>
    Mounts:
      /bin/entrypoint.sh from init-entrypoint (ro,path="entrypoint.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro)
Containers:
  nvidia-gpu-device-plugin-ctr:
    Container ID:   
    Image:          nvidia/k8s-device-plugin:1.11
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin-nvidia-gpu (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin-nvidia-gpu:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  init-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      specialresource-driver-validation-entrypoint-nvidia-gpu
    Optional:  false
  nvidia-gpu-device-plugin-token-gcxlz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-gpu-device-plugin-token-gcxlz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  feature.node.kubernetes.io/pci-10de.present=true
                 node-role.kubernetes.io/worker=
Tolerations:     
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason   Age                    From                                                 Message
  ----     ------   ----                   ----                                                 -------
  Normal   Pulling  164m (x76 over 8h)     kubelet, ip-10-0-148-145.us-west-2.compute.internal  Pulling image "nvidia/samples:cuda10.2-vectorAdd"
  Warning  BackOff  4m37s (x2382 over 8h)  kubelet, ip-10-0-148-145.us-west-2.compute.internal  Back-off restarting failed container


Actual results:
# oc get pods -n nvidia-gpu
NAME                                         READY   STATUS                  RESTARTS   AGE
nvidia-gpu-device-plugin-2q2nh               0/1     Init:CrashLoopBackOff   6          8m44s
nvidia-gpu-driver-build-1-build              0/1     Completed               0          16m
nvidia-gpu-driver-container-rhel8-gd4f6      1/1     Running                 0          14m
nvidia-gpu-runtime-enablement-rzddc          1/1     Running                 0          12m
special-resource-operator-76b658c584-q25fs   1/1     Running                 0          17m


Expected results:
the nvidia gpu driver stack deployed successfully, with nvidia-gpu-device-plugin and nvidia-gpu-device-plugin-validation Running.


Additional info:
Link to events, pod logs, operator logs, and build logs from oc commands is provided in subsequent comment.

Comment 4 Zvonko Kosic 2020-07-08 13:44:19 UTC
Now that the hooks dir are fixed and verified we need to see why the NVIDIA containers are failing. 
The Drivers work on a G4 and M60 GPU so the drivers are not the issue.

Comment 7 Zvonko Kosic 2020-11-05 13:52:38 UTC
Yes this is fixed in SRO. We are in sync with upstream.

Comment 9 Zvonko Kosic 2021-01-18 12:07:14 UTC
Walid please use branch master not simple-kmod-v2 this one is old.

Comment 10 Zvonko Kosic 2021-02-09 12:02:29 UTC
Do not use SRO to deploy NVIDIA stack, please use the official NV GPU operator.


Note You need to log in before you can comment on or make changes to this bug.