Description of problem: When deploying the Special Resources Operator on an OCP 4.3 cluster, the NVIDIA driver daemonset failed to find an image matching the new kernel version on the gpu enabled worker node: 4.18.0-147.el8.x86_64 # oc get pods -n openshift-sro NAME READY STATUS RESTARTS AGE nvidia-driver-daemonset-wgkd5 0/1 ErrImagePull 0 14s special-resource-operator-5b9d778bf8-p7f7t 1/1 Running 0 28s # oc get events -n openshift-sro LAST SEEN TYPE REASON OBJECT MESSAGE <unknown> Normal Scheduled pod/nvidia-driver-daemonset-wgkd5 Successfully assigned openshift-sro/nvidia-driver-daemonset-wgkd5 to ip-10-0-134-112.us-west-1.compute.internal 3s Normal Pulling pod/nvidia-driver-daemonset-wgkd5 Pulling image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64" 3s Warning Failed pod/nvidia-driver-daemonset-wgkd5 Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18.0-147.el8.x86_64 in quay.io/openshift-psap/nvidia-driver: manifest unknown: manifest unknown 3s Warning Failed pod/nvidia-driver-daemonset-wgkd5 Error: ErrImagePull 17s Normal BackOff pod/nvidia-driver-daemonset-wgkd5 Back-off pulling image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64" 17s Warning Failed pod/nvidia-driver-daemonset-wgkd5 Error: ImagePullBackOff 27s Normal SuccessfulCreate daemonset/nvidia-driver-daemonset Created pod: nvidia-driver-daemonset-wgkd5 <unknown> Normal Scheduled pod/special-resource-operator-5b9d778bf8-p7f7t Successfully assigned openshift-sro/special-resource-operator-5b9d778bf8-p7f7t to ip-10-0-134-112.us-west-1.compute.internal 33s Normal Pulling pod/special-resource-operator-5b9d778bf8-p7f7t Pulling image "quay.io/openshift-psap/special-resource-operator:release-4.2" 28s Normal Pulled pod/special-resource-operator-5b9d778bf8-p7f7t Successfully pulled image "quay.io/openshift-psap/special-resource-operator:release-4.2" 28s Normal Created pod/special-resource-operator-5b9d778bf8-p7f7t Created container special-resource-operator 28s Normal Started pod/special-resource-operator-5b9d778bf8-p7f7t Started container special-resource-operator 41s Normal SuccessfulCreate replicaset/special-resource-operator-5b9d778bf8 Created pod: special-resource-operator-5b9d778bf8-p7f7t 41s Normal ScalingReplicaSet deployment/special-resource-operator Scaled up replica set special-resource-operator-5b9d778bf8 to 1 # Also there was no release-4.3 branch for SRO so i used laterst version release-4.2 to deploy SRO. Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-11-02-092336 True False 23h Cluster version is 4.3.0-0.nightly-2019-11-02-092336 Server Version: 4.3.0-0.nightly-2019-11-02-092336 Kubernetes Version: v1.16.2 How reproducible: Always Steps to Reproduce: 1. IPI install OCP cluster 3 masters and 3 worker nodes on AWS with payload registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-02-092336 2. install NFD operator cd $GOPATH/src/github.com/openshift git clone https://github.com/openshift/cluster-nfd-operator.git cd cluster-nfd-operator git checkout release-4.3 make deploy 3. install SRO operator: cd $GOPATH/src/github.com/openshift-psap git clone https://github.com/openshift-psap/special-resource-operator.git cd special-resource-operator git checkout release-4.2 . <=== Note could not find release-4.3 make deploy Actual results: Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18.0-147.el8.x86_64 in quay.io/openshift-psap/nvidia-driver: manifest unknown: manifest unknown Expected results: the nvidia driver stack deployed # oc get pods -n openshift-sro NAME READY STATUS RESTARTS AGE nvidia-dcgm-exporter-49bgx 2/2 Running 0 nvidia-device-plugin-daemonset-khq4n 1/1 Running 0 nvidia-device-plugin-validation 0/1 Completed 0 nvidia-driver-daemonset-9tmb9 1/1 Running 0 nvidia-driver-validation 0/1 Completed 0 nvidia-feature-discovery-4f5q4 1/1 Running 0 nvidia-grafana-67bdb6d6-s62dl 1/1 Running 0 special-resource-operator-77cd96658f-b2mk5 1/1 Running 0 Additional info:
Depends on https://github.com/cri-o/cri-o/pull/3016 and https://github.com/openshift/machine-config-operator/pull/1299
Verified SRO is successfully deployed on OCP 4.3 ipi cluster with fips enabled: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-12-10-034925 True False 19h Cluster version is 4.3.0-0.nightly-2019-12-10-034925 # oc version Client Version: openshift-clients-4.3.0-201910250623-42-gc276ecb7 Server Version: 4.3.0-0.nightly-2019-12-10-034925 Kubernetes Version: v1.16.2 # oc get pods -n openshift-sro NAME READY STATUS RESTARTS AGE nvidia-dcgm-exporter-wcklj 2/2 Running 0 17h nvidia-device-plugin-daemonset-t6fw4 1/1 Running 0 17h nvidia-device-plugin-validation 0/1 Completed 0 17h nvidia-driver-daemonset-md77q 1/1 Running 0 17h nvidia-driver-internal-1-build 0/1 Completed 0 18h nvidia-driver-validation 0/1 Completed 0 17h nvidia-feature-discovery-5xpqv 1/1 Running 0 17h nvidia-grafana-5688f57fbc-b2sxk 1/1 Running 0 17h special-resource-operator-5c8866f7cf-vg4pb 1/1 Running 0 18h Also executed GPU workloads successfully.
Hello, I am so sorry for interruption and/or if I am in the wrong field, but I am RHOCP TAM for NEC who just received question regarding to availability of image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18. 0-147.0.3.el8_1.x86_64 " from my partner NEC as below: They are trying to deploy SRO based on current version of NVidia document [0], but they are asking whether current documenation works as described (or we are going to trying to verify it on this BZ?) For me (TAM), it seems steps in the current published vesion is for older version, in my understanding. However that would be likely affected by newer, being prepareing steps in this BZ and therefore it seems to be not working currently ?? Is my understanding correct ? I am sorry for bothering you, but may I get your help to clarify most current status of this, if you would not mind ? NEC's question: ~~~ We are currently evaluating GPU based on the following articles (current published version). [0] https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html#openshift-gpu-support I succeeded in deploying the SRO Operator, but after that, when I tried to download the nvidia-driver image, I received the following unauthorized error. Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18 .0-147.0.3.el8_1.x86_64 in quay.io/openshift-psap/nvidia-driver: unauthorized: access to the requested resource is not authorized We assumes that the download is probably failing due to missing credentials for quay.io/openshift-psap, but we can't find any related information regarding this issue anywhere. Apparently in this bug 1769174, it appears that the images seems to be successful and downloadable, but we could not make it though in our side. Therefore we would like to report this issue, but it is unclear where to report this issue. We assume that we should report this issue to Red Hat, as we could find bug 1769174 describes related steps, but would you please clarify which of Red Hat or NVIDIA is responsible for the images in quay.io/openshift-psap, and please let us know which (and where) of Red Hat or NVIDIA we should report this issue to ? ~~~ I am grateful for your help and clarification. Thank you, BR, Masaki
(In reply to Masaki Furuta from comment #5) > Hello, > > I am so sorry for interruption and/or if I am in the wrong field, but I am > RHOCP TAM for NEC who just received question regarding to availability of > image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18. > 0-147.0.3.el8_1.x86_64 " from my partner NEC as below: > > They are trying to deploy SRO based on current version of NVidia document > [0], but they are asking whether current documenation works as described (or > we are going to trying to verify it on this BZ?) > > For me (TAM), it seems steps in the current published vesion is for older > version, in my understanding. However that would be likely affected by > newer, being prepareing steps in this BZ and therefore it seems to be not > working currently ?? > > Is my understanding correct ? > > I am sorry for bothering you, but may I get your help to clarify most > current status of this, if you would not mind ? > > NEC's question: > ~~~ > We are currently evaluating GPU based on the following articles (current > published version). > [0] > https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/ > index.html#openshift-gpu-support > > I succeeded in deploying the SRO Operator, but after that, when I tried to > download the nvidia-driver image, I received the following unauthorized > error. > Failed to pull image > "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64": > rpc error: code = Unknown desc = Error reading manifest v430.34-4.18 > .0-147.0.3.el8_1.x86_64 in quay.io/openshift-psap/nvidia-driver: > unauthorized: access to the requested resource is not authorized > > We assumes that the download is probably failing due to missing credentials > for quay.io/openshift-psap, but we can't find any related information > regarding this issue anywhere. > > Apparently in this bug 1769174, it appears that the images seems to be > successful and downloadable, but we could not make it though in our side. > Therefore we would like to report this issue, but it is unclear where to > report this issue. > > We assume that we should report this issue to Red Hat, as we could find bug > 1769174 describes related steps, but would you please clarify which of Red > Hat or NVIDIA is responsible for the images in quay.io/openshift-psap, and > please let us know which (and where) of Red Hat or NVIDIA we should report > this issue to ? > ~~~ > > I am grateful for your help and clarification. > > Thank you, > > BR, > Masaki Masaki, please use https://github.com/openshift-psap/special-resource-operator the master branch works for openshift-4.0 - 4.2
(In reply to Zvonko Kosic from comment #6) ... > Masaki, > > please use > > https://github.com/openshift-psap/special-resource-operator > > the master branch works for openshift-4.0 - 4.2 Hi Zvonko Kosic, Thank you for your help and response to my sudden question which was originally recevided from NEC. According to additional feedback from NEC which I have recevied (and I am going to attach screen dump from NEC on this BZ privately only for red hat internal), even using suggested repo from you, it seems they still experienced same issue; "Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18.0-147.0.3.el8_1.x86_64 in quay.io/openshift-psap/nvidia-driver: unauthorized: access to the requested resource is not authorized". I am sorry if I have been mistaken, but as we can see it in following [0] and [1] ( commit 5c5c90e [2] ), it seems that master and release-4.2 repo will try to refer same 4.3 latest image and shouldn't it be 4.2 image in the case of release-4.2 branch ? [0] special-resource-operator/0001-state-driver.yaml at master · openshift-psap/special-resource-operator https://github.com/openshift-psap/special-resource-operator/blob/master/assets/0001-state-driver.yaml#L144 [1] special-resource-operator/0001-state-driver.yaml at release-4.2 · openshift-psap/special-resource-operator https://github.com/openshift-psap/special-resource-operator/blob/release-4.2/assets/0001-state-driver.yaml#L144 [2] Added building on cluster · openshift-psap/special-resource-operator@5c5c90e https://github.com/openshift-psap/special-resource-operator/commit/5c5c90ef09a09335c79af269ba7e7b2b76b7c5c7 I am so sorry for bothering and mixing/asking question to those 2 version (4.2 and 4.3) on this 4.3 BZ. If this is not quick question and fix ( it meant we need some time to do ), in order to prevent mixing them further, and I won't make noise for question to difference version on this 4.3 bz, Could I file/clone this BZ for 4.2 (and would you please advise me on that if it's possible) ? I am grateful for your continued help and support. Thank you, BR, Masaki
(In reply to Masaki Furuta from comment #7) > (In reply to Zvonko Kosic from comment #6) > ... ... > If this is not quick question and fix ( it meant we need some time to do ), > in order to prevent mixing them further, and I won't make noise for question > to difference version on this 4.3 bz, Could I file/clone this BZ for 4.2 > (and would you please advise me on that if it's possible) ? I am cloning this BZ for 4.2 as private one below: - 1794257 – OCP 4.2 : the master branch on https://github.com/openshift-psap/special-resource-operator for openshift-4.0 - 4.2 failed with "Failed to pull image quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64" https://bugzilla.redhat.com/show_bug.cgi?id=1794257 Would you pleae find it ? I am grateful for your continued help and support. Thank you, BR, Masaki
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days