Description of problem: Unable to apply 4.9.0-0.nightly-2021-09-06-055314: the cluster operator storage has not yet successfully rolled out Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-09-06-055314 How reproducible: once Steps to Reproduce: 1.Install log: 09-07 10:19:44.005 level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: 09-07 10:19:44.005 level=info msg=Cluster operator insights Disabled is False with AsExpected: 09-07 10:19:44.005 level=info msg=Cluster operator network ManagementStateDegraded is False with : 09-07 10:19:44.005 level=error msg=Cluster operator storage Degraded is True with CSIDriverStarter_SyncError: CSIDriverStarterDegraded: [no matches for kind "Role" in version "rbac.authorization.k8s.io/v1", no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1", no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1", no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1"] message: 'CSIDriverStarterDegraded: [no matches for kind "Role" in version "rbac.authorization.k8s.io/v1", no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1", no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1", no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1"]' reason: CSIDriverStarter_SyncError 2. oc get co etcd storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.9.0-0.nightly-2021-09-06-055314 True False False 4h49m storage 4.9.0-0.nightly-2021-09-06-055314 False True True 4h50m GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment oc get pods NAME READY STATUS RESTARTS AGE gcp-pd-csi-driver-controller-f79d97985-pglqj 10/10 Running 0 4h45m gcp-pd-csi-driver-node-c64t2 3/3 Running 0 4h45m gcp-pd-csi-driver-operator-549bd69cbd-v89qr 1/1 Running 0 4h46m 3.oc logs gcp-pd-csi-driver-operator-549bd69cbd-v89qr I0907 01:34:08.200756 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"gcp-pd-csi-driver-operator", UID:"63478839-002e-44c6-8149-c501548c7805", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RoleBindingCreated' Created RoleBinding.rbac.authorization.k8s.io/gcp-pd-csi-driver-prometheus -n openshift-cluster-csi-drivers because it was missing E0907 01:35:14.050917 1 base_controller.go:272] StaticResourceController reconciliation failed: "servicemonitor.yaml" (string): etcdserver: leader changed W0907 01:35:14.584580 1 base_controller.go:236] Updating status of "GCPPDDriverNodeServiceController" failed: Operation cannot be fulfilled on clustercsidrivers.operator.openshift.io "pd.csi.storage.gke.io": the object has been modified; please apply your changes to the latest version and try again E0907 01:35:14.584628 1 base_controller.go:272] GCPPDDriverNodeServiceController reconciliation failed: etcdserver: leader changed W0907 01:35:14.598068 1 base_controller.go:236] Updating status of "GCPPDDriverControllerServiceController" failed: Operation cannot be fulfilled on clustercsidrivers.operator.openshift.io "pd.csi.storage.gke.io": the object has been modified; please apply your changes to the latest version and try again E0907 01:35:14.598109 1 base_controller.go:272] GCPPDDriverControllerServiceController reconciliation failed: etcdserver: leader changed E0907 01:35:14.711187 1 base_controller.go:272] StaticResourceController reconciliation failed: ["rbac/controller_privileged_binding.yaml" (string): etcdserver: leader changed, Operation cannot be fulfilled on clustercsidrivers.operator.openshift.io "pd.csi.storage.gke.io": the object has been modified; please apply your changes to the latest version and try again] Actual results: Unable to apply 4.9.0-0.nightly-2021-09-06-055314: the cluster operator storage has not yet successfully rolled out Expected results: Cluster operator storage should install successfully. Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
http://virt-openshift-05.lab.eng.nay.redhat.com/chaoyang/must-gather.tar
It looks like the API server returned 'no matches for kind "Role" in version "rbac.authorization.k8s.io/v1"' when CSO / its RESTMapper was trying to get related objects. The error was cached by RESTMapper and when the API server got RBAC API, the call was not retried. I can't reproduce the issue, CSO needs to hit the right time when RBAC API is not registered at the API server yet.
Experimental PR upstream: https://github.com/kubernetes/kubernetes/pull/104814
Test status: PASS Tested on build: 4.10.0-0.nightly-2021-11-21-005535 for below mentioned flexy templates: 1) aos-4_10/upi-on-gcp/versioned-installer-csidriver-sno-ci 2) aos-4_10/upi-on-gcp/versioned-installer-sno Result: Additional info: Created sc, pvc, pod and checked for Running status. 1) aos-4_10/upi-on-gcp/versioned-installer-csidriver-sno-ci https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/54221/parameters/ versioned-installer-csidriver-sno-ci rohitpatil@ropatil-mac Downloads % oc get nodes NAME STATUS ROLES AGE VERSION ropatil22112021-dzqgn-master-0.c.openshift-qe.internal Ready master,worker 44m v1.22.1+35a59a5 rohitpatil@ropatil-mac Downloads % oc get co etcd storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.10.0-0.nightly-2021-11-21-005535 True False False 51m storage 4.10.0-0.nightly-2021-11-21-005535 True False False 52m rohitpatil@ropatil-mac Downloads % oc get sc,pvc,pod -n testgcp NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE storageclass.storage.k8s.io/ebs pd.csi.storage.gke.io Delete WaitForFirstConsumer false 75s storageclass.storage.k8s.io/standard kubernetes.io/gce-pd Delete WaitForFirstConsumer true 46m storageclass.storage.k8s.io/standard-csi (default) pd.csi.storage.gke.io Delete WaitForFirstConsumer true 45m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/mypvc-csi Bound pvc-76128e36-decd-4416-a0f4-7ef763b4e9f0 1Gi RWO ebs 58s NAME READY STATUS RESTARTS AGE pod/mydep-csi-57b5dd78b-kd9lc 1/1 Running 0 46s 2) aos-4_10/upi-on-gcp/versioned-installer-sno https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/54222/parameters/ rohitpatil@ropatil-mac Downloads % oc get nodes NAME STATUS ROLES AGE VERSION ropatil2211-sno-457bw-master-0.c.openshift-qe.internal Ready master,worker 33m v1.22.1+35a59a5 rohitpatil@ropatil-mac Downloads % oc get co etcd storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.10.0-0.nightly-2021-11-21-005535 True False False 51m storage 4.10.0-0.nightly-2021-11-21-005535 True False False 52m rohitpatil@ropatil-mac Downloads % oc get sc,pvc,pod -n testgcp NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE storageclass.storage.k8s.io/ebs pd.csi.storage.gke.io Delete WaitForFirstConsumer false 2m39s storageclass.storage.k8s.io/standard (default) kubernetes.io/gce-pd Delete WaitForFirstConsumer true 39m storageclass.storage.k8s.io/standard-csi pd.csi.storage.gke.io Delete WaitForFirstConsumer true 39m NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/mypvc-csi Bound pvc-d36f7ccb-d8a8-4e01-bbc5-c6d129acec7c 1Gi RWO ebs 106s NAME READY STATUS RESTARTS AGE pod/mydep-csi-57b5dd78b-gwskq 1/1 Running 0 47s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056