1744111 – OCP 4.2 special resource operator (SRO) fails to deploy with nvidia-driver-validation Failed to allocate device vector A error

Bug 1744111 - OCP 4.2 special resource operator (SRO) fails to deploy with nvidia-driver-validation Failed to allocate device vector A error

Summary: OCP 4.2 special resource operator (SRO) fails to deploy with nvidia-driver-v...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Feature Discovery Operator
Sub Component:
Version:	4.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Zvonko Kosic
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-21 11:34 UTC by Walid A.
Modified:	2019-10-16 06:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:37:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:37:11 UTC

Description Walid A. 2019-08-21 11:34:10 UTC

Description of problem:
On a 4.2 OCP cluster in AWS (3 master and 2 worker nodes m5.xlarge), the special-resource-operator SRO fails to deploy the nvidia driver stack after successfully deploying the NFD (Node Feature Discovery) operator.  Before deploying SRO, the cluster was expanded by adding a new machineset for a g3.4xlarge (also tested with g3.8xlarge) GPU enabled instance, which was appropriately labeled by NFD.

Errors seen:

# oc get pods -n openshift-sro
NAME                            READY   STATUS    RESTARTS   AGE
nvidia-driver-daemonset-ckjjp   1/1     Running   0          3m21s
nvidia-driver-validation        0/1     Error     0          2m41s

# oc logs -n openshift-sro nvidia-driver-validation
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

# oc get events -n openshift-sro
LAST SEEN   TYPE      REASON             OBJECT                              MESSAGE
3m17s       Normal    Scheduled          pod/nvidia-driver-daemonset-ckjjp   Successfully assigned openshift-sro/nvidia-driver-daemonset-ckjjp to ip-10-0-128-14.us-west-2.compute.internal
3m15s       Normal    Pulling            pod/nvidia-driver-daemonset-ckjjp   Pulling image "quay.io/zvonkok/nvidia-driver:v410.79-4.18.0-80.7.2.el8_0.x86_64"
2m40s       Normal    Pulled             pod/nvidia-driver-daemonset-ckjjp   Successfully pulled image "quay.io/zvonkok/nvidia-driver:v410.79-4.18.0-80.7.2.el8_0.x86_64"
2m40s       Normal    Created            pod/nvidia-driver-daemonset-ckjjp   Created container nvidia-driver-ctr
2m40s       Normal    Started            pod/nvidia-driver-daemonset-ckjjp   Started container nvidia-driver-ctr
3m32s       Normal    Scheduled          pod/nvidia-driver-daemonset-h847x   Successfully assigned openshift-sro/nvidia-driver-daemonset-h847x to ip-10-0-128-14.us-west-2.compute.internal
89s         Warning   FailedMount        pod/nvidia-driver-daemonset-h847x   Unable to mount volumes for pod "nvidia-driver-daemonset-h847x_openshift-sro(2b3fbc53-c37a-11e9-beb7-0612e649dc6a)": timeout expired waiting for volumes to attach or mount for pod "openshift-sro"/"nvidia-driver-daemonset-h847x". list of unmounted volumes=[host run-nvidia config nvidia-driver-token-vm5zq]. list of unattached volumes=[host run-nvidia config nvidia-driver-token-vm5zq]
3m32s       Normal    SuccessfulCreate   daemonset/nvidia-driver-daemonset   Created pod: nvidia-driver-daemonset-h847x
3m27s       Warning   FailedCreate       daemonset/nvidia-driver-daemonset   Error creating: pods "nvidia-driver-daemonset-" is forbidden: error looking up service account openshift-sro/nvidia-driver: serviceaccount "nvidia-driver" not found
3m22s       Warning   FailedCreate       daemonset/nvidia-driver-daemonset   Error creating: pods "nvidia-driver-daemonset-" is forbidden: error looking up service account openshift-sro/nvidia-driver: serviceaccount "nvidia-driver" not found
3m17s       Normal    SuccessfulCreate   daemonset/nvidia-driver-daemonset   Created pod: nvidia-driver-daemonset-ckjjp
2m37s       Normal    Scheduled          pod/nvidia-driver-validation        Successfully assigned openshift-sro/nvidia-driver-validation to ip-10-0-128-14.us-west-2.compute.internal
2m35s       Normal    Pulling            pod/nvidia-driver-validation        Pulling image "quay.io/zvonkok/cuda-vector-add:v0.1"
98s         Normal    Pulled             pod/nvidia-driver-validation        Successfully pulled image "quay.io/zvonkok/cuda-vector-add:v0.1"
98s         Normal    Created            pod/nvidia-driver-validation        Created container cuda-vector-add
98s         Normal    Started            pod/nvidia-driver-validation        Started container cuda-vector-add


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-20-111632

How reproducible:
Reproduced 3 times on 3 nightly builds from 2019-08-18, 08-19, and 08-20

Steps to Reproduce:
1. Install OCP 4.2 cluster with nightly build 4.2.0-0.nightly-2019-08-20-111632 on AWS, m5xlarge instance type, 3 master and 2 worker nodes
2. deploy NFD operator:
   - export GOPATH=/root/go
   - cd $GOPATH/src/github.com/openshift
   - git clone https://github.com/openshift/cluster-nfd-operator.git
   - cd cluster-nfd-operator
   - make deploy
3. Ensure NFD feature working, check labels on all nodes
4. create machineset to create a new GPU g3.4xlarge node
5. deploy SRO special resource operator
   - cd $GOPATH/src/github.com
   - git clone https://github.com/zvonkok/special-resource-operator.git
   - cd special-resource-operator
   - make deploy
6. oc get pods -n openshift-sro-operator
   make sure openshift-sro-operator is running
7. oc get pods -n openshift-sro
   

Actual results:
# oc get pods -n openshift-sro
NAME                            READY   STATUS    RESTARTS   AGE
nvidia-driver-daemonset-ckjjp   1/1     Running   0          3m21s
nvidia-driver-validation        0/1     Error     0          2m41s

Expected results:
Should see all these pods created the nvidia-driver stack
nvidia-device-plugin-daemonset-zt8jf        1/1     Running     
nvidia-device-plugin-validation             0/1     Completed
nvidia-driver-daemonset-jpgs4               1/1     Running
nvidia-driver-validation                    0/1     Completed



Additional info:
Links to logs will be in next comment

Comment 2 Zvonko Kosic 2019-08-30 12:23:25 UTC

This was fixed with the latest driver-container update. Please have a look.

Comment 3 Walid A. 2019-08-30 18:47:23 UTC

Yes I confirmed it is fixed with the last driver-container update.

Comment 4 Walid A. 2019-09-04 11:52:06 UTC

Verified that we can successfully deploy SRO on OCP 4.2 with build 4.2.0-0.nightly-2019-08-28-235925

Comment 5 errata-xmlrpc 2019-10-16 06:37:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.