Description of problem: On a 4.2 OCP cluster in AWS (3 master and 2 worker nodes m5.xlarge), the special-resource-operator SRO fails to deploy the nvidia driver stack after successfully deploying the NFD (Node Feature Discovery) operator. Before deploying SRO, the cluster was expanded by adding a new machineset for a g3.4xlarge (also tested with g3.8xlarge) GPU enabled instance, which was appropriately labeled by NFD. Errors seen: # oc get pods -n openshift-sro NAME READY STATUS RESTARTS AGE nvidia-driver-daemonset-ckjjp 1/1 Running 0 3m21s nvidia-driver-validation 0/1 Error 0 2m41s # oc logs -n openshift-sro nvidia-driver-validation Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! [Vector addition of 50000 elements] # oc get events -n openshift-sro LAST SEEN TYPE REASON OBJECT MESSAGE 3m17s Normal Scheduled pod/nvidia-driver-daemonset-ckjjp Successfully assigned openshift-sro/nvidia-driver-daemonset-ckjjp to ip-10-0-128-14.us-west-2.compute.internal 3m15s Normal Pulling pod/nvidia-driver-daemonset-ckjjp Pulling image "quay.io/zvonkok/nvidia-driver:v410.79-4.18.0-80.7.2.el8_0.x86_64" 2m40s Normal Pulled pod/nvidia-driver-daemonset-ckjjp Successfully pulled image "quay.io/zvonkok/nvidia-driver:v410.79-4.18.0-80.7.2.el8_0.x86_64" 2m40s Normal Created pod/nvidia-driver-daemonset-ckjjp Created container nvidia-driver-ctr 2m40s Normal Started pod/nvidia-driver-daemonset-ckjjp Started container nvidia-driver-ctr 3m32s Normal Scheduled pod/nvidia-driver-daemonset-h847x Successfully assigned openshift-sro/nvidia-driver-daemonset-h847x to ip-10-0-128-14.us-west-2.compute.internal 89s Warning FailedMount pod/nvidia-driver-daemonset-h847x Unable to mount volumes for pod "nvidia-driver-daemonset-h847x_openshift-sro(2b3fbc53-c37a-11e9-beb7-0612e649dc6a)": timeout expired waiting for volumes to attach or mount for pod "openshift-sro"/"nvidia-driver-daemonset-h847x". list of unmounted volumes=[host run-nvidia config nvidia-driver-token-vm5zq]. list of unattached volumes=[host run-nvidia config nvidia-driver-token-vm5zq] 3m32s Normal SuccessfulCreate daemonset/nvidia-driver-daemonset Created pod: nvidia-driver-daemonset-h847x 3m27s Warning FailedCreate daemonset/nvidia-driver-daemonset Error creating: pods "nvidia-driver-daemonset-" is forbidden: error looking up service account openshift-sro/nvidia-driver: serviceaccount "nvidia-driver" not found 3m22s Warning FailedCreate daemonset/nvidia-driver-daemonset Error creating: pods "nvidia-driver-daemonset-" is forbidden: error looking up service account openshift-sro/nvidia-driver: serviceaccount "nvidia-driver" not found 3m17s Normal SuccessfulCreate daemonset/nvidia-driver-daemonset Created pod: nvidia-driver-daemonset-ckjjp 2m37s Normal Scheduled pod/nvidia-driver-validation Successfully assigned openshift-sro/nvidia-driver-validation to ip-10-0-128-14.us-west-2.compute.internal 2m35s Normal Pulling pod/nvidia-driver-validation Pulling image "quay.io/zvonkok/cuda-vector-add:v0.1" 98s Normal Pulled pod/nvidia-driver-validation Successfully pulled image "quay.io/zvonkok/cuda-vector-add:v0.1" 98s Normal Created pod/nvidia-driver-validation Created container cuda-vector-add 98s Normal Started pod/nvidia-driver-validation Started container cuda-vector-add Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-20-111632 How reproducible: Reproduced 3 times on 3 nightly builds from 2019-08-18, 08-19, and 08-20 Steps to Reproduce: 1. Install OCP 4.2 cluster with nightly build 4.2.0-0.nightly-2019-08-20-111632 on AWS, m5xlarge instance type, 3 master and 2 worker nodes 2. deploy NFD operator: - export GOPATH=/root/go - cd $GOPATH/src/github.com/openshift - git clone https://github.com/openshift/cluster-nfd-operator.git - cd cluster-nfd-operator - make deploy 3. Ensure NFD feature working, check labels on all nodes 4. create machineset to create a new GPU g3.4xlarge node 5. deploy SRO special resource operator - cd $GOPATH/src/github.com - git clone https://github.com/zvonkok/special-resource-operator.git - cd special-resource-operator - make deploy 6. oc get pods -n openshift-sro-operator make sure openshift-sro-operator is running 7. oc get pods -n openshift-sro Actual results: # oc get pods -n openshift-sro NAME READY STATUS RESTARTS AGE nvidia-driver-daemonset-ckjjp 1/1 Running 0 3m21s nvidia-driver-validation 0/1 Error 0 2m41s Expected results: Should see all these pods created the nvidia-driver stack nvidia-device-plugin-daemonset-zt8jf 1/1 Running nvidia-device-plugin-validation 0/1 Completed nvidia-driver-daemonset-jpgs4 1/1 Running nvidia-driver-validation 0/1 Completed Additional info: Links to logs will be in next comment
This was fixed with the latest driver-container update. Please have a look.
Yes I confirmed it is fixed with the last driver-container update.
Verified that we can successfully deploy SRO on OCP 4.2 with build 4.2.0-0.nightly-2019-08-28-235925
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922