Bug 1912720 - [4.6] 4.7 to 4.6 downgrade stuck in clusteroperator storage
Summary: [4.6] 4.7 to 4.6 downgrade stuck in clusteroperator storage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.z
Assignee: Jan Safranek
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On: 1907812
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-05 10:35 UTC by Jan Safranek
Modified: 2021-02-22 13:54 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1907812
Environment:
Last Closed: 2021-02-22 13:54:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift aws-ebs-csi-driver-operator pull 105 0 None closed Bug 1912720: Remove stale ResourceSyncControllerDegraded condition 2021-02-18 01:21:18 UTC
Red Hat Product Errata RHBA-2021:0510 0 None None None 2021-02-22 13:54:47 UTC

Description Jan Safranek 2021-01-05 10:35:00 UTC
+++ This bug was initially created as a clone of Bug #1907812 +++

Description of problem:
4.6.8 successfully upgrades to latest 4.7.0-0.nightly-2020-12-15-042043. Then downgrade to 4.6, stuck in:
“Unable to apply 4.6.8: the cluster operator storage is degraded”

Adding TestBlocker because blocking the test of epic issue MSTR-1055.

Version-Release number of selected component (if applicable):
4.6.8 upgrade to 4.7.0-0.nightly-2020-12-15-042043, then downgrade back to 4.6.8

How reproducible:
Tried once so far

Steps to Reproduce:
1. Successfully install 4.6.8 IPI AWS env
2. Successfully upgrade to 4.7.0-0.nightly-2020-12-15-042043
3. Then downgrade to 4.6.8

Actual results:
Step 3 fails with clusteroperator storage stuck:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-15-042043   True        True          116m    Unable to apply 4.6.8: the cluster operator storage is degraded

$ oc describe co storage
Name:         storage
...
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-12-15T07:04:10Z
    Message:               AWSEBSCSIDriverOperatorCRDegraded: ResourceSyncControllerDegraded: configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator" cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"
    Reason:                AWSEBSCSIDriverOperatorCR_ResourceSyncController_Error
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-12-15T07:04:47Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-12-15T06:05:22Z
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-12-15T02:24:45Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:
    Name:      openshift-cluster-storage-operator
    Resource:  namespaces
    Group:
    Name:      openshift-cluster-csi-drivers
    Resource:  namespaces
    Group:
    Name:      openshift-manila-csi-driver
    Resource:  namespaces
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  storages
    Group:     operator.openshift.io
    Name:      ebs.csi.aws.com
    Resource:  clustercsidrivers
    Group:     operator.openshift.io
    Name:      csi.ovirt.org
    Resource:  clustercsidrivers
    Group:     operator.openshift.io
    Name:      manila.csi.openstack.org
    Resource:  clustercsidrivers
  Versions:
    Name:     operator
    Version:  4.6.8
    Name:     AWSEBSCSIDriverOperator
    Version:  4.6.8
Events:       <none>


Expected results:
Should downgrade successfully

Additional info:
In the past, 4.6 to 4.5 downgrade bugs were found in other clusteroperators: bug 1868376, bug 1885848, bug 1877316, and they were fixed. 4.7 to 4.6 downgrade should succeed too.

--- Additional comment from Jan Safranek on 2020-12-15 17:20:57 UTC ---

Xingxing, please attach must-gather next time, it will speed up investigation a lot!

Working theory:

1. In 4.7, we introduced syncing of AWS CA bundle into the driver namespace, so the driver can talk to AWS API. https://github.com/openshift/aws-ebs-csi-driver-operator/pull/102

2. When downgrading to 4.6, 4.6 RBAC is applied first.

3. 4.7 AWS operator is still running and tries to sync the CA bundle, but it already misses RBAC to do so. Sync fails and corresponding condition is raised:

      message: 'configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator"
        cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"'
      reason: Error
      status: "True"
      type: ResourceSyncControllerDegraded

4. CSO Deployment is downgraded to 4.6, which downgrades AWS operator to 4.6 version.

5. 4.6 AWS operator runs just great, but it does not sync CA bundle, neither it runs any other syncer and *nothing clears ResourceSyncControllerDegraded condition*
-> operator is degraded forever.

Workaround: oc delete clustercsidriver --all
CSO will re-create it and everything should re-sync.

Brainstorming some solutions:
I. The operator somehow clears all conditions it does not manage. But how does it know?
II. Deploy 4.7 RBAC to sync the CA bundle as a separate ClusterRole / ClusterRoleBinding. Downgrade won't remove it and 4.7 operator won't report Degraded. In other words, when adding anything to RBAC, always add it as a separate ClusterRole to prevent similar errors in the future.

--- Additional comment from Jan Safranek on 2021-01-05 10:08:55 UTC ---

There must be two fixes:

* in 4.7: use a separate RBAC objects for kube-cloud-config config map, so it's not removed when downgrading to 4.6
* In 4.6.z: remove ResourceSyncControllerDegraded condition if it's set by the 4.7 version of the operator for any reason.

Comment 1 Jan Safranek 2021-01-05 15:05:45 UTC
Sorry, missed bug #1900239

*** This bug has been marked as a duplicate of bug 1900239 ***

Comment 2 Jan Safranek 2021-01-05 15:05:55 UTC
Sorry again, wrong bug.

Comment 3 Jan Safranek 2021-01-15 11:05:05 UTC
Waiting for 1907812 to get VERIFIED by QA. BTW, this BZ is just a preventive measure, 1907812 should be enough to fix this.

Comment 4 Qin Ping 2021-02-08 11:54:03 UTC
Tried to verify this PR before pre-merge, the PR LGTM.

The upgrade path 4.6->4.7->4.6

CSO upgrade successfully: 
$ oc get co storage
NAME      VERSION                                           AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.6.0-0.ci.test-2021-02-08-085615-ci-ln-ikkh8ck   True        False         False      6m35s

Comment 7 Qin Ping 2021-02-18 03:22:37 UTC
Verified with: 4.6.0-0.nightly-2021-02-13-034601

Comment 9 errata-xmlrpc 2021-02-22 13:54:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0510


Note You need to log in before you can comment on or make changes to this bug.