+++ This bug was initially created as a clone of Bug #1907812 +++
Description of problem:
4.6.8 successfully upgrades to latest 4.7.0-0.nightly-2020-12-15-042043. Then downgrade to 4.6, stuck in:
“Unable to apply 4.6.8: the cluster operator storage is degraded”
Adding TestBlocker because blocking the test of epic issue MSTR-1055.
Version-Release number of selected component (if applicable):
4.6.8 upgrade to 4.7.0-0.nightly-2020-12-15-042043, then downgrade back to 4.6.8
Tried once so far
Steps to Reproduce:
1. Successfully install 4.6.8 IPI AWS env
2. Successfully upgrade to 4.7.0-0.nightly-2020-12-15-042043
3. Then downgrade to 4.6.8
Step 3 fails with clusteroperator storage stuck:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2020-12-15-042043 True True 116m Unable to apply 4.6.8: the cluster operator storage is degraded
$ oc describe co storage
Last Transition Time: 2020-12-15T07:04:10Z
Message: AWSEBSCSIDriverOperatorCRDegraded: ResourceSyncControllerDegraded: configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator" cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"
Last Transition Time: 2020-12-15T07:04:47Z
Last Transition Time: 2020-12-15T06:05:22Z
Last Transition Time: 2020-12-15T02:24:45Z
Should downgrade successfully
In the past, 4.6 to 4.5 downgrade bugs were found in other clusteroperators: bug 1868376, bug 1885848, bug 1877316, and they were fixed. 4.7 to 4.6 downgrade should succeed too.
--- Additional comment from Jan Safranek on 2020-12-15 17:20:57 UTC ---
Xingxing, please attach must-gather next time, it will speed up investigation a lot!
1. In 4.7, we introduced syncing of AWS CA bundle into the driver namespace, so the driver can talk to AWS API. https://github.com/openshift/aws-ebs-csi-driver-operator/pull/102
2. When downgrading to 4.6, 4.6 RBAC is applied first.
3. 4.7 AWS operator is still running and tries to sync the CA bundle, but it already misses RBAC to do so. Sync fails and corresponding condition is raised:
message: 'configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator"
cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"'
4. CSO Deployment is downgraded to 4.6, which downgrades AWS operator to 4.6 version.
5. 4.6 AWS operator runs just great, but it does not sync CA bundle, neither it runs any other syncer and *nothing clears ResourceSyncControllerDegraded condition*
-> operator is degraded forever.
Workaround: oc delete clustercsidriver --all
CSO will re-create it and everything should re-sync.
Brainstorming some solutions:
I. The operator somehow clears all conditions it does not manage. But how does it know?
II. Deploy 4.7 RBAC to sync the CA bundle as a separate ClusterRole / ClusterRoleBinding. Downgrade won't remove it and 4.7 operator won't report Degraded. In other words, when adding anything to RBAC, always add it as a separate ClusterRole to prevent similar errors in the future.
--- Additional comment from Jan Safranek on 2021-01-05 10:08:55 UTC ---
There must be two fixes:
* in 4.7: use a separate RBAC objects for kube-cloud-config config map, so it's not removed when downgrading to 4.6
* In 4.6.z: remove ResourceSyncControllerDegraded condition if it's set by the 4.7 version of the operator for any reason.
Sorry, missed bug #1900239
*** This bug has been marked as a duplicate of bug 1900239 ***
Sorry again, wrong bug.
Waiting for 1907812 to get VERIFIED by QA. BTW, this BZ is just a preventive measure, 1907812 should be enough to fix this.
Tried to verify this PR before pre-merge, the PR LGTM.
The upgrade path 4.6->4.7->4.6
CSO upgrade successfully:
$ oc get co storage
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
storage 4.6.0-0.ci.test-2021-02-08-085615-ci-ln-ikkh8ck True False False 6m35s
Verified with: 4.6.0-0.nightly-2021-02-13-034601
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6.18 bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.