Bug 1912720

Summary: [4.6] 4.7 to 4.6 downgrade stuck in clusteroperator storage
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, jsafrane, piqin, wduan, xxia, yunjiang
Version: 4.7Keywords: Reopened
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1907812 Environment:
Last Closed: 2021-02-22 13:54:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1907812    
Bug Blocks:    

Description Jan Safranek 2021-01-05 10:35:00 UTC
+++ This bug was initially created as a clone of Bug #1907812 +++

Description of problem:
4.6.8 successfully upgrades to latest 4.7.0-0.nightly-2020-12-15-042043. Then downgrade to 4.6, stuck in:
“Unable to apply 4.6.8: the cluster operator storage is degraded”

Adding TestBlocker because blocking the test of epic issue MSTR-1055.

Version-Release number of selected component (if applicable):
4.6.8 upgrade to 4.7.0-0.nightly-2020-12-15-042043, then downgrade back to 4.6.8

How reproducible:
Tried once so far

Steps to Reproduce:
1. Successfully install 4.6.8 IPI AWS env
2. Successfully upgrade to 4.7.0-0.nightly-2020-12-15-042043
3. Then downgrade to 4.6.8

Actual results:
Step 3 fails with clusteroperator storage stuck:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-15-042043   True        True          116m    Unable to apply 4.6.8: the cluster operator storage is degraded

$ oc describe co storage
Name:         storage
    Last Transition Time:  2020-12-15T07:04:10Z
    Message:               AWSEBSCSIDriverOperatorCRDegraded: ResourceSyncControllerDegraded: configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator" cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"
    Reason:                AWSEBSCSIDriverOperatorCR_ResourceSyncController_Error
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-12-15T07:04:47Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-12-15T06:05:22Z
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-12-15T02:24:45Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Name:      openshift-cluster-storage-operator
    Resource:  namespaces
    Name:      openshift-cluster-csi-drivers
    Resource:  namespaces
    Name:      openshift-manila-csi-driver
    Resource:  namespaces
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  storages
    Group:     operator.openshift.io
    Name:      ebs.csi.aws.com
    Resource:  clustercsidrivers
    Group:     operator.openshift.io
    Name:      csi.ovirt.org
    Resource:  clustercsidrivers
    Group:     operator.openshift.io
    Name:      manila.csi.openstack.org
    Resource:  clustercsidrivers
    Name:     operator
    Version:  4.6.8
    Name:     AWSEBSCSIDriverOperator
    Version:  4.6.8
Events:       <none>

Expected results:
Should downgrade successfully

Additional info:
In the past, 4.6 to 4.5 downgrade bugs were found in other clusteroperators: bug 1868376, bug 1885848, bug 1877316, and they were fixed. 4.7 to 4.6 downgrade should succeed too.

--- Additional comment from Jan Safranek on 2020-12-15 17:20:57 UTC ---

Xingxing, please attach must-gather next time, it will speed up investigation a lot!

Working theory:

1. In 4.7, we introduced syncing of AWS CA bundle into the driver namespace, so the driver can talk to AWS API. https://github.com/openshift/aws-ebs-csi-driver-operator/pull/102

2. When downgrading to 4.6, 4.6 RBAC is applied first.

3. 4.7 AWS operator is still running and tries to sync the CA bundle, but it already misses RBAC to do so. Sync fails and corresponding condition is raised:

      message: 'configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator"
        cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"'
      reason: Error
      status: "True"
      type: ResourceSyncControllerDegraded

4. CSO Deployment is downgraded to 4.6, which downgrades AWS operator to 4.6 version.

5. 4.6 AWS operator runs just great, but it does not sync CA bundle, neither it runs any other syncer and *nothing clears ResourceSyncControllerDegraded condition*
-> operator is degraded forever.

Workaround: oc delete clustercsidriver --all
CSO will re-create it and everything should re-sync.

Brainstorming some solutions:
I. The operator somehow clears all conditions it does not manage. But how does it know?
II. Deploy 4.7 RBAC to sync the CA bundle as a separate ClusterRole / ClusterRoleBinding. Downgrade won't remove it and 4.7 operator won't report Degraded. In other words, when adding anything to RBAC, always add it as a separate ClusterRole to prevent similar errors in the future.

--- Additional comment from Jan Safranek on 2021-01-05 10:08:55 UTC ---

There must be two fixes:

* in 4.7: use a separate RBAC objects for kube-cloud-config config map, so it's not removed when downgrading to 4.6
* In 4.6.z: remove ResourceSyncControllerDegraded condition if it's set by the 4.7 version of the operator for any reason.

Comment 1 Jan Safranek 2021-01-05 15:05:45 UTC
Sorry, missed bug #1900239

*** This bug has been marked as a duplicate of bug 1900239 ***

Comment 2 Jan Safranek 2021-01-05 15:05:55 UTC
Sorry again, wrong bug.

Comment 3 Jan Safranek 2021-01-15 11:05:05 UTC
Waiting for 1907812 to get VERIFIED by QA. BTW, this BZ is just a preventive measure, 1907812 should be enough to fix this.

Comment 4 Qin Ping 2021-02-08 11:54:03 UTC
Tried to verify this PR before pre-merge, the PR LGTM.

The upgrade path 4.6->4.7->4.6

CSO upgrade successfully: 
$ oc get co storage
NAME      VERSION                                           AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.6.0-0.ci.test-2021-02-08-085615-ci-ln-ikkh8ck   True        False         False      6m35s

Comment 7 Qin Ping 2021-02-18 03:22:37 UTC
Verified with: 4.6.0-0.nightly-2021-02-13-034601

Comment 9 errata-xmlrpc 2021-02-22 13:54:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.