Bug 1769174 - OCP 4.3 : Special Resources Operator (SRO) fails to deploy the NVIDIA driver daemonset, need new driver image matching new kernel version on worker nodes
Summary: OCP 4.3 : Special Resources Operator (SRO) fails to deploy the NVIDIA driver...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.3.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 4.3.0
Assignee: Zvonko Kosic
QA Contact: Walid A.
URL:
Whiteboard:
Depends On: 1777838
Blocks: 1794257
TreeView+ depends on / blocked
 
Reported: 2019-11-06 02:58 UTC by Walid A.
Modified: 2023-09-14 05:45 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1794257 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:11:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:11:34 UTC

Description Walid A. 2019-11-06 02:58:58 UTC
Description of problem:
When deploying the Special Resources Operator on an OCP 4.3 cluster, the NVIDIA driver daemonset failed to find an image matching the new kernel version on the gpu enabled worker node:

4.18.0-147.el8.x86_64

# oc get pods -n openshift-sro
NAME                                         READY   STATUS         RESTARTS   AGE
nvidia-driver-daemonset-wgkd5                0/1     ErrImagePull   0          14s
special-resource-operator-5b9d778bf8-p7f7t   1/1     Running        0          28s

# oc get events -n openshift-sro
LAST SEEN   TYPE      REASON              OBJECT                                            MESSAGE
<unknown>   Normal    Scheduled           pod/nvidia-driver-daemonset-wgkd5                 Successfully assigned openshift-sro/nvidia-driver-daemonset-wgkd5 to ip-10-0-134-112.us-west-1.compute.internal
3s          Normal    Pulling             pod/nvidia-driver-daemonset-wgkd5                 Pulling image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64"
3s          Warning   Failed              pod/nvidia-driver-daemonset-wgkd5                 Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18.0-147.el8.x86_64 in quay.io/openshift-psap/nvidia-driver: manifest unknown: manifest unknown
3s          Warning   Failed              pod/nvidia-driver-daemonset-wgkd5                 Error: ErrImagePull
17s         Normal    BackOff             pod/nvidia-driver-daemonset-wgkd5                 Back-off pulling image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64"
17s         Warning   Failed              pod/nvidia-driver-daemonset-wgkd5                 Error: ImagePullBackOff
27s         Normal    SuccessfulCreate    daemonset/nvidia-driver-daemonset                 Created pod: nvidia-driver-daemonset-wgkd5
<unknown>   Normal    Scheduled           pod/special-resource-operator-5b9d778bf8-p7f7t    Successfully assigned openshift-sro/special-resource-operator-5b9d778bf8-p7f7t to ip-10-0-134-112.us-west-1.compute.internal
33s         Normal    Pulling             pod/special-resource-operator-5b9d778bf8-p7f7t    Pulling image "quay.io/openshift-psap/special-resource-operator:release-4.2"
28s         Normal    Pulled              pod/special-resource-operator-5b9d778bf8-p7f7t    Successfully pulled image "quay.io/openshift-psap/special-resource-operator:release-4.2"
28s         Normal    Created             pod/special-resource-operator-5b9d778bf8-p7f7t    Created container special-resource-operator
28s         Normal    Started             pod/special-resource-operator-5b9d778bf8-p7f7t    Started container special-resource-operator
41s         Normal    SuccessfulCreate    replicaset/special-resource-operator-5b9d778bf8   Created pod: special-resource-operator-5b9d778bf8-p7f7t
41s         Normal    ScalingReplicaSet   deployment/special-resource-operator              Scaled up replica set special-resource-operator-5b9d778bf8 to 1
# 

Also there was no release-4.3 branch for SRO so i used laterst version release-4.2 to deploy SRO.

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-02-092336   True        False         23h     Cluster version is 4.3.0-0.nightly-2019-11-02-092336
Server Version: 4.3.0-0.nightly-2019-11-02-092336
Kubernetes Version: v1.16.2


How reproducible:
Always

Steps to Reproduce:
1. IPI install OCP cluster 3 masters and 3 worker nodes on AWS with payload registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-02-092336
2. install NFD operator
   cd $GOPATH/src/github.com/openshift
   git clone https://github.com/openshift/cluster-nfd-operator.git
   cd cluster-nfd-operator
   git checkout release-4.3
   make deploy
3. install SRO operator:
   cd $GOPATH/src/github.com/openshift-psap
   git clone https://github.com/openshift-psap/special-resource-operator.git
   cd special-resource-operator
   git checkout release-4.2 . <=== Note could not find release-4.3
   make deploy
   

Actual results:
Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.el8.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18.0-147.el8.x86_64 in quay.io/openshift-psap/nvidia-driver: manifest unknown: manifest unknown


Expected results:
the nvidia driver stack deployed
# oc get pods -n openshift-sro
NAME                                         READY   STATUS      RESTARTS   AGE
nvidia-dcgm-exporter-49bgx                   2/2     Running     0          
nvidia-device-plugin-daemonset-khq4n         1/1     Running     0          
nvidia-device-plugin-validation              0/1     Completed   0          
nvidia-driver-daemonset-9tmb9                1/1     Running     0          
nvidia-driver-validation                     0/1     Completed   0          
nvidia-feature-discovery-4f5q4               1/1     Running     0          
nvidia-grafana-67bdb6d6-s62dl                1/1     Running     0          
special-resource-operator-77cd96658f-b2mk5   1/1     Running     0          


Additional info:

Comment 4 Walid A. 2019-12-11 15:03:52 UTC
Verified SRO is successfully deployed on OCP 4.3 ipi cluster with fips enabled:

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-10-034925   True        False         19h     Cluster version is 4.3.0-0.nightly-2019-12-10-034925

# oc version
Client Version: openshift-clients-4.3.0-201910250623-42-gc276ecb7
Server Version: 4.3.0-0.nightly-2019-12-10-034925
Kubernetes Version: v1.16.2

# oc get pods -n openshift-sro
NAME                                         READY   STATUS      RESTARTS   AGE
nvidia-dcgm-exporter-wcklj                   2/2     Running     0          17h
nvidia-device-plugin-daemonset-t6fw4         1/1     Running     0          17h
nvidia-device-plugin-validation              0/1     Completed   0          17h
nvidia-driver-daemonset-md77q                1/1     Running     0          17h
nvidia-driver-internal-1-build               0/1     Completed   0          18h
nvidia-driver-validation                     0/1     Completed   0          17h
nvidia-feature-discovery-5xpqv               1/1     Running     0          17h
nvidia-grafana-5688f57fbc-b2sxk              1/1     Running     0          17h
special-resource-operator-5c8866f7cf-vg4pb   1/1     Running     0          18h

Also executed GPU workloads successfully.

Comment 5 Masaki Furuta ( RH ) 2020-01-16 01:34:05 UTC
Hello,

I am so sorry for interruption and/or if I am in the wrong field, but I am RHOCP TAM for NEC who just received question regarding to availability of image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18. 0-147.0.3.el8_1.x86_64 " from my partner NEC as below:

They are trying to deploy SRO based on current version of NVidia document [0], but they are asking whether current documenation works as described (or we are going to trying to verify it on this BZ?)

For me (TAM), it seems steps in the current published vesion is for older version, in my understanding. However that would be likely affected by newer, being prepareing steps in this BZ and therefore it seems to be not working currently ??

Is my understanding correct ?

I am sorry for bothering you, but may I get your help to clarify most current status of this, if you would not mind ?

NEC's question:
~~~
We are currently evaluating GPU based on the following articles (current published version).
 [0] https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html#openshift-gpu-support

I succeeded in deploying the SRO Operator, but after that, when I tried to download the nvidia-driver image, I received the following unauthorized error.
  Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18 .0-147.0.3.el8_1.x86_64 in quay.io/openshift-psap/nvidia-driver: unauthorized: access to the requested resource is not authorized

We assumes that the download is probably failing due to missing credentials for quay.io/openshift-psap, but we can't find any related information regarding this issue anywhere.

Apparently in this bug 1769174, it appears that the images seems to be successful and downloadable, but we could not make it though in our side. Therefore we would like to report this issue, but it is unclear where to report this issue. 

We assume that we should report this issue to Red Hat, as we could find bug 1769174 describes related steps, but would you please clarify which of Red Hat or NVIDIA is responsible for the images in quay.io/openshift-psap, and please let us know which (and where) of Red Hat or NVIDIA we should report this issue to ?
~~~

I am grateful for your help and clarification.

Thank you,

BR,
Masaki

Comment 6 Zvonko Kosic 2020-01-16 01:54:26 UTC
(In reply to Masaki Furuta from comment #5)
> Hello,
> 
> I am so sorry for interruption and/or if I am in the wrong field, but I am
> RHOCP TAM for NEC who just received question regarding to availability of
> image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.
> 0-147.0.3.el8_1.x86_64 " from my partner NEC as below:
> 
> They are trying to deploy SRO based on current version of NVidia document
> [0], but they are asking whether current documenation works as described (or
> we are going to trying to verify it on this BZ?)
> 
> For me (TAM), it seems steps in the current published vesion is for older
> version, in my understanding. However that would be likely affected by
> newer, being prepareing steps in this BZ and therefore it seems to be not
> working currently ??
> 
> Is my understanding correct ?
> 
> I am sorry for bothering you, but may I get your help to clarify most
> current status of this, if you would not mind ?
> 
> NEC's question:
> ~~~
> We are currently evaluating GPU based on the following articles (current
> published version).
>  [0]
> https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/
> index.html#openshift-gpu-support
> 
> I succeeded in deploying the SRO Operator, but after that, when I tried to
> download the nvidia-driver image, I received the following unauthorized
> error.
>   Failed to pull image
> "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64":
> rpc error: code = Unknown desc = Error reading manifest v430.34-4.18
> .0-147.0.3.el8_1.x86_64 in quay.io/openshift-psap/nvidia-driver:
> unauthorized: access to the requested resource is not authorized
> 
> We assumes that the download is probably failing due to missing credentials
> for quay.io/openshift-psap, but we can't find any related information
> regarding this issue anywhere.
> 
> Apparently in this bug 1769174, it appears that the images seems to be
> successful and downloadable, but we could not make it though in our side.
> Therefore we would like to report this issue, but it is unclear where to
> report this issue. 
> 
> We assume that we should report this issue to Red Hat, as we could find bug
> 1769174 describes related steps, but would you please clarify which of Red
> Hat or NVIDIA is responsible for the images in quay.io/openshift-psap, and
> please let us know which (and where) of Red Hat or NVIDIA we should report
> this issue to ?
> ~~~
> 
> I am grateful for your help and clarification.
> 
> Thank you,
> 
> BR,
> Masaki

Masaki, 

please use

https://github.com/openshift-psap/special-resource-operator 

the master branch works for openshift-4.0 - 4.2

Comment 7 Masaki Furuta ( RH ) 2020-01-20 06:42:32 UTC
(In reply to Zvonko Kosic from comment #6)
...
> Masaki, 
> 
> please use
> 
> https://github.com/openshift-psap/special-resource-operator 
> 
> the master branch works for openshift-4.0 - 4.2

Hi Zvonko Kosic,

Thank you for your help and response to my sudden question which was originally recevided from NEC.

According to additional feedback from NEC which I have recevied (and I am going to attach screen dump from NEC on this BZ privately only for red hat internal), even using suggested repo from you, it seems they still experienced same issue;
   "Failed to pull image "quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64": rpc error: code = Unknown desc = Error reading manifest v430.34-4.18.0-147.0.3.el8_1.x86_64 in quay.io/openshift-psap/nvidia-driver: unauthorized: access to the requested resource is not authorized".

I am sorry if I have been mistaken, but as we can see it in following [0] and [1] ( commit 5c5c90e [2] ), it seems that master and release-4.2 repo will try to refer same 4.3 latest image and shouldn't it be 4.2 image in the case of release-4.2 branch ?
 
  [0] special-resource-operator/0001-state-driver.yaml at master · openshift-psap/special-resource-operator
      https://github.com/openshift-psap/special-resource-operator/blob/master/assets/0001-state-driver.yaml#L144

  [1] special-resource-operator/0001-state-driver.yaml at release-4.2 · openshift-psap/special-resource-operator
      https://github.com/openshift-psap/special-resource-operator/blob/release-4.2/assets/0001-state-driver.yaml#L144

  [2] Added building on cluster · openshift-psap/special-resource-operator@5c5c90e
      https://github.com/openshift-psap/special-resource-operator/commit/5c5c90ef09a09335c79af269ba7e7b2b76b7c5c7

I am so sorry for bothering and mixing/asking question to those 2 version (4.2 and 4.3) on this 4.3 BZ.

If this is not quick question and fix ( it meant we need some time to do ), in order to prevent mixing them further, and I won't make noise for question to difference version on this 4.3 bz, Could I file/clone this BZ for 4.2 (and would you please advise me on that if it's possible) ?

I am grateful for your continued help and support.

Thank you,

BR,
Masaki

Comment 10 Masaki Furuta ( RH ) 2020-01-23 05:32:38 UTC
(In reply to Masaki Furuta from comment #7)
> (In reply to Zvonko Kosic from comment #6)
> ...
...
> If this is not quick question and fix ( it meant we need some time to do ),
> in order to prevent mixing them further, and I won't make noise for question
> to difference version on this 4.3 bz, Could I file/clone this BZ for 4.2
> (and would you please advise me on that if it's possible) ?

I am cloning this BZ for 4.2 as private one below:

   - 1794257 – OCP 4.2 : the master branch on https://github.com/openshift-psap/special-resource-operator for openshift-4.0 - 4.2 failed with "Failed to pull image quay.io/openshift-psap/nvidia-driver:v430.34-4.18.0-147.0.3.el8_1.x86_64"
     https://bugzilla.redhat.com/show_bug.cgi?id=1794257

Would you pleae find it ?

I am grateful for your continued help and support.

Thank you,

BR,
Masaki

Comment 11 errata-xmlrpc 2020-01-23 11:11:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 12 Red Hat Bugzilla 2023-09-14 05:45:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.