+++ This bug was initially created as a clone of Bug #1907812 +++ Description of problem: 4.6.8 successfully upgrades to latest 4.7.0-0.nightly-2020-12-15-042043. Then downgrade to 4.6, stuck in: “Unable to apply 4.6.8: the cluster operator storage is degraded” Adding TestBlocker because blocking the test of epic issue MSTR-1055. Version-Release number of selected component (if applicable): 4.6.8 upgrade to 4.7.0-0.nightly-2020-12-15-042043, then downgrade back to 4.6.8 How reproducible: Tried once so far Steps to Reproduce: 1. Successfully install 4.6.8 IPI AWS env 2. Successfully upgrade to 4.7.0-0.nightly-2020-12-15-042043 3. Then downgrade to 4.6.8 Actual results: Step 3 fails with clusteroperator storage stuck: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-15-042043 True True 116m Unable to apply 4.6.8: the cluster operator storage is degraded $ oc describe co storage Name: storage ... Spec: Status: Conditions: Last Transition Time: 2020-12-15T07:04:10Z Message: AWSEBSCSIDriverOperatorCRDegraded: ResourceSyncControllerDegraded: configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator" cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed" Reason: AWSEBSCSIDriverOperatorCR_ResourceSyncController_Error Status: True Type: Degraded Last Transition Time: 2020-12-15T07:04:47Z Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2020-12-15T06:05:22Z Reason: AsExpected Status: True Type: Available Last Transition Time: 2020-12-15T02:24:45Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: Name: openshift-cluster-storage-operator Resource: namespaces Group: Name: openshift-cluster-csi-drivers Resource: namespaces Group: Name: openshift-manila-csi-driver Resource: namespaces Group: operator.openshift.io Name: cluster Resource: storages Group: operator.openshift.io Name: ebs.csi.aws.com Resource: clustercsidrivers Group: operator.openshift.io Name: csi.ovirt.org Resource: clustercsidrivers Group: operator.openshift.io Name: manila.csi.openstack.org Resource: clustercsidrivers Versions: Name: operator Version: 4.6.8 Name: AWSEBSCSIDriverOperator Version: 4.6.8 Events: <none> Expected results: Should downgrade successfully Additional info: In the past, 4.6 to 4.5 downgrade bugs were found in other clusteroperators: bug 1868376, bug 1885848, bug 1877316, and they were fixed. 4.7 to 4.6 downgrade should succeed too. --- Additional comment from Jan Safranek on 2020-12-15 17:20:57 UTC --- Xingxing, please attach must-gather next time, it will speed up investigation a lot! Working theory: 1. In 4.7, we introduced syncing of AWS CA bundle into the driver namespace, so the driver can talk to AWS API. https://github.com/openshift/aws-ebs-csi-driver-operator/pull/102 2. When downgrading to 4.6, 4.6 RBAC is applied first. 3. 4.7 AWS operator is still running and tries to sync the CA bundle, but it already misses RBAC to do so. Sync fails and corresponding condition is raised: message: 'configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator" cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"' reason: Error status: "True" type: ResourceSyncControllerDegraded 4. CSO Deployment is downgraded to 4.6, which downgrades AWS operator to 4.6 version. 5. 4.6 AWS operator runs just great, but it does not sync CA bundle, neither it runs any other syncer and *nothing clears ResourceSyncControllerDegraded condition* -> operator is degraded forever. Workaround: oc delete clustercsidriver --all CSO will re-create it and everything should re-sync. Brainstorming some solutions: I. The operator somehow clears all conditions it does not manage. But how does it know? II. Deploy 4.7 RBAC to sync the CA bundle as a separate ClusterRole / ClusterRoleBinding. Downgrade won't remove it and 4.7 operator won't report Degraded. In other words, when adding anything to RBAC, always add it as a separate ClusterRole to prevent similar errors in the future. --- Additional comment from Jan Safranek on 2021-01-05 10:08:55 UTC --- There must be two fixes: * in 4.7: use a separate RBAC objects for kube-cloud-config config map, so it's not removed when downgrading to 4.6 * In 4.6.z: remove ResourceSyncControllerDegraded condition if it's set by the 4.7 version of the operator for any reason.
Sorry, missed bug #1900239 *** This bug has been marked as a duplicate of bug 1900239 ***
Sorry again, wrong bug.
Waiting for 1907812 to get VERIFIED by QA. BTW, this BZ is just a preventive measure, 1907812 should be enough to fix this.
Tried to verify this PR before pre-merge, the PR LGTM. The upgrade path 4.6->4.7->4.6 CSO upgrade successfully: $ oc get co storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE storage 4.6.0-0.ci.test-2021-02-08-085615-ci-ln-ikkh8ck True False False 6m35s
Verified with: 4.6.0-0.nightly-2021-02-13-034601
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0510