2001761 – Unable to apply cluster operator storage for SNO on GCP platform.

Bug 2001761 - Unable to apply cluster operator storage for SNO on GCP platform.

Summary: Unable to apply cluster operator storage for SNO on GCP platform.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jan Safranek
QA Contact:	Rohit Patil
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2084219
TreeView+	depends on / blocked

Reported:	2021-09-07 06:50 UTC by Chao Yang
Modified:	2022-05-11 16:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2084219 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:07:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-storage-operator pull 215	0	None	open	Bug 2001761: Fix RelatedObjects when RBAC API is missing	2021-09-20 15:04:44 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:08:19 UTC

Description Chao Yang 2021-09-07 06:50:12 UTC

Description of problem:
Unable to apply 4.9.0-0.nightly-2021-09-06-055314: the cluster operator storage has not yet successfully rolled out

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-09-06-055314

How reproducible:
once

Steps to Reproduce:
1.Install log:
09-07 10:19:44.005  level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: 
09-07 10:19:44.005  level=info msg=Cluster operator insights Disabled is False with AsExpected: 
09-07 10:19:44.005  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
09-07 10:19:44.005  level=error msg=Cluster operator storage Degraded is True with CSIDriverStarter_SyncError: CSIDriverStarterDegraded: [no matches for kind "Role" in version "rbac.authorization.k8s.io/v1", no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1", no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1", no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1"]

    message: 'CSIDriverStarterDegraded: [no matches for kind "Role" in version "rbac.authorization.k8s.io/v1",
      no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1",
      no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1",
      no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1"]'
    reason: CSIDriverStarter_SyncError
2.
oc get co etcd storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd      4.9.0-0.nightly-2021-09-06-055314   True        False         False      4h49m   
storage   4.9.0-0.nightly-2021-09-06-055314   False       True          True       4h50m   GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment

oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
gcp-pd-csi-driver-controller-f79d97985-pglqj   10/10   Running   0          4h45m
gcp-pd-csi-driver-node-c64t2                   3/3     Running   0          4h45m
gcp-pd-csi-driver-operator-549bd69cbd-v89qr    1/1     Running   0          4h46m
3.oc logs gcp-pd-csi-driver-operator-549bd69cbd-v89qr 
I0907 01:34:08.200756       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"gcp-pd-csi-driver-operator", UID:"63478839-002e-44c6-8149-c501548c7805", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RoleBindingCreated' Created RoleBinding.rbac.authorization.k8s.io/gcp-pd-csi-driver-prometheus -n openshift-cluster-csi-drivers because it was missing
E0907 01:35:14.050917       1 base_controller.go:272] StaticResourceController reconciliation failed: "servicemonitor.yaml" (string): etcdserver: leader changed
W0907 01:35:14.584580       1 base_controller.go:236] Updating status of "GCPPDDriverNodeServiceController" failed: Operation cannot be fulfilled on clustercsidrivers.operator.openshift.io "pd.csi.storage.gke.io": the object has been modified; please apply your changes to the latest version and try again
E0907 01:35:14.584628       1 base_controller.go:272] GCPPDDriverNodeServiceController reconciliation failed: etcdserver: leader changed
W0907 01:35:14.598068       1 base_controller.go:236] Updating status of "GCPPDDriverControllerServiceController" failed: Operation cannot be fulfilled on clustercsidrivers.operator.openshift.io "pd.csi.storage.gke.io": the object has been modified; please apply your changes to the latest version and try again
E0907 01:35:14.598109       1 base_controller.go:272] GCPPDDriverControllerServiceController reconciliation failed: etcdserver: leader changed
E0907 01:35:14.711187       1 base_controller.go:272] StaticResourceController reconciliation failed: ["rbac/controller_privileged_binding.yaml" (string): etcdserver: leader changed, Operation cannot be fulfilled on clustercsidrivers.operator.openshift.io "pd.csi.storage.gke.io": the object has been modified; please apply your changes to the latest version and try again]

Actual results:
Unable to apply 4.9.0-0.nightly-2021-09-06-055314: the cluster operator storage has not yet successfully rolled out

Expected results:
Cluster operator storage should install successfully.

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Chao Yang 2021-09-07 07:03:07 UTC

http://virt-openshift-05.lab.eng.nay.redhat.com/chaoyang/must-gather.tar

Comment 2 Jan Safranek 2021-09-07 10:25:19 UTC

It looks like the API server returned 'no matches for kind "Role" in version "rbac.authorization.k8s.io/v1"' when CSO / its RESTMapper was trying to get related objects. The error was cached by RESTMapper and when the API server got RBAC API, the call was not retried.


I can't reproduce the issue, CSO needs to hit the right time when RBAC API is not registered at the API server yet.

Comment 3 Jan Safranek 2021-09-07 11:12:49 UTC

Experimental PR upstream: https://github.com/kubernetes/kubernetes/pull/104814

Comment 8 Rohit Patil 2021-11-22 05:49:41 UTC

Test status: PASS

Tested on build: 4.10.0-0.nightly-2021-11-21-005535 for below mentioned flexy templates: 
1) aos-4_10/upi-on-gcp/versioned-installer-csidriver-sno-ci
2) aos-4_10/upi-on-gcp/versioned-installer-sno

Result: 
Additional info: Created sc, pvc, pod and checked for Running status. 

1) aos-4_10/upi-on-gcp/versioned-installer-csidriver-sno-ci

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/54221/parameters/
versioned-installer-csidriver-sno-ci

rohitpatil@ropatil-mac Downloads % oc get nodes
NAME                                                     STATUS   ROLES           AGE   VERSION
ropatil22112021-dzqgn-master-0.c.openshift-qe.internal   Ready    master,worker   44m   v1.22.1+35a59a5

rohitpatil@ropatil-mac Downloads % oc get co etcd storage
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd      4.10.0-0.nightly-2021-11-21-005535   True        False         False      51m     
storage   4.10.0-0.nightly-2021-11-21-005535   True        False         False      52m  

rohitpatil@ropatil-mac Downloads % oc get sc,pvc,pod -n testgcp
NAME                                                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
storageclass.storage.k8s.io/ebs                      pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   false                  75s
storageclass.storage.k8s.io/standard                 kubernetes.io/gce-pd    Delete          WaitForFirstConsumer   true                   46m
storageclass.storage.k8s.io/standard-csi (default)   pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   45m

NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/mypvc-csi   Bound    pvc-76128e36-decd-4416-a0f4-7ef763b4e9f0   1Gi        RWO            ebs            58s

NAME                            READY   STATUS    RESTARTS   AGE
pod/mydep-csi-57b5dd78b-kd9lc   1/1     Running   0          46s





2) aos-4_10/upi-on-gcp/versioned-installer-sno
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/54222/parameters/

rohitpatil@ropatil-mac Downloads % oc get nodes
NAME                                                     STATUS   ROLES           AGE   VERSION
ropatil2211-sno-457bw-master-0.c.openshift-qe.internal   Ready    master,worker   33m   v1.22.1+35a59a5

rohitpatil@ropatil-mac Downloads % oc get co etcd storage
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd      4.10.0-0.nightly-2021-11-21-005535   True        False         False      51m     
storage   4.10.0-0.nightly-2021-11-21-005535   True        False         False      52m  

rohitpatil@ropatil-mac Downloads % oc get sc,pvc,pod -n testgcp
NAME                                             PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
storageclass.storage.k8s.io/ebs                  pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   false                  2m39s
storageclass.storage.k8s.io/standard (default)   kubernetes.io/gce-pd    Delete          WaitForFirstConsumer   true                   39m
storageclass.storage.k8s.io/standard-csi         pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   39m

NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/mypvc-csi   Bound    pvc-d36f7ccb-d8a8-4e01-bbc5-c6d129acec7c   1Gi        RWO            ebs            106s

NAME                            READY   STATUS    RESTARTS   AGE
pod/mydep-csi-57b5dd78b-gwskq   1/1     Running   0          47s

Comment 12 errata-xmlrpc 2022-03-10 16:07:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.