Bug 1847805 - OCP 4.5: Failed to deploy SRO on entitled cluster - nvidia-gpu-device-plugin pod in Init:CrashLoopBackOff
Summary: OCP 4.5: Failed to deploy SRO on entitled cluster - nvidia-gpu-device-plugin...
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resources Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.5.z
Assignee: Zvonko Kosic
QA Contact: Walid A.
URL:
Whiteboard:
Depends On: 1854200
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-17 05:53 UTC by Walid A.
Modified: 2020-08-25 22:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Walid A. 2020-06-17 05:53:56 UTC
Description of problem:
On OCP 4.5 IPI installed cluster on AWS, after setting up cluster-wide entitlement, the Special Resources Operator (SRO) failed to deploy successfully from github, with nvidia-gpu-device-plugin pod in it:CrashLoopBackOff 

Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.5.0-0.nightly-2020-06-01-081609
Server Version: 4.5.0-0.nightly-2020-06-01-165039
Kubernetes Version: v1.18.3+a637491


How reproducible:
Everytime

Steps to Reproduce:
1.  IPI install of OCP 4.5 on AWS
2.  Setup cluster-wide entitlement according to the following doc https://www.openshift.com/blog/how-to-use-entitled-image-builds-to-build-drivercontainers-with-ubi-on-openshift especially the section on cluster-wide entitlements. Deploy NFD 4.5 operator from OperatorHub
3.  Deploy Node Feature Discovery (NFD 4.5) Operator from OperatorHub
4.  Create a new machineset to add a GPU node (g3.4xlarge)
5.  Deploy SRO operator:
      cd $GOPATH/src/github.com/openshift-psap
      git clone https://github.com/openshift-psap/special-resource-operator.git
      cd special-resource-operator
      PULLPOLICY=Always make deploy

# oc describe pod nvidia-gpu-device-plugin-2q2nh -n nvidia-gpu
Name:         nvidia-gpu-device-plugin-2q2nh
Namespace:    nvidia-gpu
Priority:     0
Node:         ip-10-0-148-145.us-west-2.compute.internal/10.0.148.145
Start Time:   Tue, 16 Jun 2020 20:51:42 +0000
Labels:       app=nvidia-gpu-device-plugin
              controller-revision-hash=6f66d85f7b
              pod-template-generation=1
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.2.29"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.2.29"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: privileged
              scheduler.alpha.kubernetes.io/critical-pod: 
Status:       Pending
IP:           10.131.2.29
IPs:
  IP:           10.131.2.29
Controlled By:  DaemonSet/nvidia-gpu-device-plugin
Init Containers:
  specialresource-runtime-validation-nvidia-gpu:
    Container ID:  cri-o://9f527b6ba3868f6f0cda3b06b579c1da7dfb94c31117cf45145ea292df65c716
    Image:         nvidia/samples:cuda10.2-vectorAdd
    Image ID:      docker.io/nvidia/samples@sha256:19f202eecbfa5211be8ce0f0dc02deb44a25a7e3584c993b22b1654f3569ffaf
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/entrypoint.sh
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Jun 2020 05:36:19 +0000
      Finished:     Wed, 17 Jun 2020 05:36:19 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 17 Jun 2020 05:31:07 +0000
      Finished:     Wed, 17 Jun 2020 05:31:07 +0000
    Ready:          False
    Restart Count:  107
    Environment:    <none>
    Mounts:
      /bin/entrypoint.sh from init-entrypoint (ro,path="entrypoint.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro)
Containers:
  nvidia-gpu-device-plugin-ctr:
    Container ID:   
    Image:          nvidia/k8s-device-plugin:1.11
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin-nvidia-gpu (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-gpu-device-plugin-token-gcxlz (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin-nvidia-gpu:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  init-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      specialresource-driver-validation-entrypoint-nvidia-gpu
    Optional:  false
  nvidia-gpu-device-plugin-token-gcxlz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-gpu-device-plugin-token-gcxlz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  feature.node.kubernetes.io/pci-10de.present=true
                 node-role.kubernetes.io/worker=
Tolerations:     
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason   Age                    From                                                 Message
  ----     ------   ----                   ----                                                 -------
  Normal   Pulling  164m (x76 over 8h)     kubelet, ip-10-0-148-145.us-west-2.compute.internal  Pulling image "nvidia/samples:cuda10.2-vectorAdd"
  Warning  BackOff  4m37s (x2382 over 8h)  kubelet, ip-10-0-148-145.us-west-2.compute.internal  Back-off restarting failed container


Actual results:
# oc get pods -n nvidia-gpu
NAME                                         READY   STATUS                  RESTARTS   AGE
nvidia-gpu-device-plugin-2q2nh               0/1     Init:CrashLoopBackOff   6          8m44s
nvidia-gpu-driver-build-1-build              0/1     Completed               0          16m
nvidia-gpu-driver-container-rhel8-gd4f6      1/1     Running                 0          14m
nvidia-gpu-runtime-enablement-rzddc          1/1     Running                 0          12m
special-resource-operator-76b658c584-q25fs   1/1     Running                 0          17m


Expected results:
the nvidia gpu driver stack deployed successfully, with nvidia-gpu-device-plugin and nvidia-gpu-device-plugin-validation Running.


Additional info:
Link to events, pod logs, operator logs, and build logs from oc commands is provided in subsequent comment.

Comment 4 Zvonko Kosic 2020-07-08 13:44:19 UTC
Now that the hooks dir are fixed and verified we need to see why the NVIDIA containers are failing. 
The Drivers work on a G4 and M60 GPU so the drivers are not the issue.


Note You need to log in before you can comment on or make changes to this bug.