Bug 1961120

Summary: CSI driver operators fail when upgrading a cluster
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: melbeher
Storage sub component: Operators QA Contact: Chao Yang <chaoyang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, chaoyang, wking
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:08:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Safranek 2021-05-17 10:00:00 UTC
Description of problem:
Cluster upgrade from 4.7 to 4.8-ish version failed with:

Operator degraded (AWSEBSCSIDriverOperatorCR_AWSEBSDriverServiceMonitorController_SyncError): AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverServiceMonitorControllerDegraded: "servicemonitor.yaml" (string): servicemonitors.monitoring.coreos.com "aws-ebs-csi-driver-controller-monitor" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator" cannot update resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "openshift-cluster-csi-drivers"
AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverServiceMonitorControllerDegraded: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1393205025966133248

The reason is that ServiceMonitor is being updated, not created, and the operator does not have permissions for it:

https://github.com/openshift/cluster-storage-operator/blob/195603230796c2a7189f6daf45cb58b1a4fb72a3/assets/csidriveroperators/aws-ebs/03_role.yaml#L34-L40

Please check all storage operators (CSI driver operators, CSI snapshot controller operator, vsphere-problem-detector) and ensure that they have permissions to update / patch / maybe delete ServiceMonitor.

Comment 1 melbeher 2021-05-17 16:23:06 UTC
CSI driver operators & vsphere-problem-detector has been fixed here https://github.com/openshift/cluster-storage-operator/pull/167

Comment 2 melbeher 2021-05-17 16:30:37 UTC
Local Storage Operator has been fixed here https://github.com/openshift/local-storage-operator/pull/237

Comment 4 Chao Yang 2021-05-24 08:05:51 UTC
Passed on aws when upgrade from 4.7.0-0.nightly-2021-05-20-112118 to 4.8.0-0.nightly-2021-05-21-233425
aws-ebs-csi-driver-operator and local-storage-operator should be passed.

Comment 5 Chao Yang 2021-05-25 01:44:26 UTC
passed for vsphere-problem-detector and snapshot

Comment 8 errata-xmlrpc 2021-07-27 23:08:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438