Bug 1794360 - Operators was not reset during downgrade from v4.4 to v4.3
Summary: Operators was not reset during downgrade from v4.4 to v4.3
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Abhinav Dahiya
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-23 11:59 UTC by liujia
Modified: 2020-06-18 12:57 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-24 18:34:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description liujia 2020-01-23 11:59:23 UTC
Description of problem:
Downgrade from v4.4 to v4.3.0, kube-storage-version-migrator 
was not reset and still at v4.4 after downgrade.

Version-Release number of the following components:
4.4.0-0.nightly-2020-01-23-054055

How reproducible:
always

Steps to Reproduce:
1. install v4.3.0, no kube-storage-version-migrator operator in "oc get co"
# ./oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0     True        False         False      3m39s
cloud-credential                           4.3.0     True        False         False      24m
cluster-autoscaler                         4.3.0     True        False         False      8m44s
console                                    4.3.0     True        False         False      4m45s
dns                                        4.3.0     True        False         False      13m
image-registry                             4.3.0     True        False         False      8m16s
ingress                                    4.3.0     True        False         False      9m7s
insights                                   4.3.0     True        False         False      19m
kube-apiserver                             4.3.0     True        False         False      12m
kube-controller-manager                    4.3.0     True        False         False      11m
kube-scheduler                             4.3.0     True        False         False      12m
machine-api                                4.3.0     True        False         False      18m
machine-config                             4.3.0     True        False         False      11m
marketplace                                4.3.0     True        False         False      9m20s
monitoring                                 4.3.0     True        False         False      3m23s
network                                    4.3.0     True        False         False      19m
node-tuning                                4.3.0     True        False         False      10m
openshift-apiserver                        4.3.0     True        False         False      11m
openshift-controller-manager               4.3.0     True        False         False      12m
openshift-samples                          4.3.0     True        False         False      8m21s
operator-lifecycle-manager                 4.3.0     True        False         False      18m
operator-lifecycle-manager-catalog         4.3.0     True        False         False      18m
operator-lifecycle-manager-packageserver   4.3.0     True        False         False      11m
service-ca                                 4.3.0     True        False         False      19m
service-catalog-apiserver                  4.3.0     True        False         False      10m
service-catalog-controller-manager         4.3.0     True        False         False      10m
storage                                    4.3.0     True        False         False      10m

2. upgrade v4.3.0 to v4.4.0-0.nightly-2020-01-23-054055, new kube-storage-version-migrator operator created.
...
machine-api                                4.4.0-0.nightly-2020-01-23-054055   True        False         False      53m
machine-config                             4.4.0-0.nightly-2020-01-23-054055                               True        False         False      46m
marketplace                                4.4.0-0.nightly-2020-01-23-054055   True        False         False      81s
...

3. trigger downgrade from v4.4.0-0.nightly-2020-01-23-054055 to v4.3.0
4. checked only kube-storage-version-migrator still at a wrong version, which should be cleaned since this is not induced in v4.3.0.
# ./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0                               True        False         False      94m
cloud-credential                           4.3.0                               True        False         False      114m
cluster-autoscaler                         4.3.0                               True        False         False      99m
console                                    4.3.0                               True        False         False      8m11s
dns                                        4.3.0                               True        False         False      104m
image-registry                             4.3.0                               True        False         False      12m
ingress                                    4.3.0                               True        False         False      12m
insights                                   4.3.0                               True        False         False      110m
kube-apiserver                             4.3.0                               True        False         False      103m
kube-controller-manager                    4.3.0                               True        False         False      102m
kube-scheduler                             4.3.0                               True        False         False      103m
kube-storage-version-migrator              4.4.0-0.nightly-2020-01-23-054055   True        False         False      12m
machine-api                                4.3.0                               True        False         False      109m
machine-config                             4.3.0                               True        False         False      102m
marketplace                                4.3.0                               True        False         False      11m
monitoring                                 4.3.0                               True        False         False      4m42s
network                                    4.3.0                               True        False         False      110m
node-tuning                                4.3.0                               True        False         False      7m58s
openshift-apiserver                        4.3.0                               True        False         False      5m51s
openshift-controller-manager               4.3.0                               True        False         False      103m
openshift-samples                          4.3.0                               True        False         False      26m
operator-lifecycle-manager                 4.3.0                               True        False         False      109m
operator-lifecycle-manager-catalog         4.3.0                               True        False         False      109m
operator-lifecycle-manager-packageserver   4.3.0                               True        False         False      6m34s
service-ca                                 4.3.0                               True        False         False      110m
service-catalog-apiserver                  4.3.0                               True        False         False      101m
service-catalog-controller-manager         4.3.0                               True        False         False      101m
storage                                    4.3.0                               True        False         False      26m


# ./oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2020-01-23T11:24:15Z|2020-01-23T11:52:24Z|Completed|4.3.0
2020-01-23T10:26:28Z|2020-01-23T11:11:15Z|Completed|4.4.0-0.nightly-2020-01-23-054055
2020-01-23T10:01:09Z|2020-01-23T10:22:13Z|Completed|4.3.0

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0     True        False         4m55s   Cluster version is 4.3.0

Actual results:
kube-storage-version-migrator operator is at a wrong status.

Expected results:
kube-storage-version-migrator operator should be reset.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Abhinav Dahiya 2020-01-29 01:05:58 UTC
We don't prune not required components.. so the downgrade will leave new resources around...

Comment 2 liujia 2020-02-06 03:19:47 UTC
Two questions support me to reopen the bug:
1. What's downgrade mean for users? In my understanding, should be A-B-A(this is from DEV's explanation about downgrade), not A-B-mixed AB. hdyt?
2. The cluster version is strange if the downgrade will leave new resources around. When "oc get clusterversion" shows that it's v4.3, but v4.4 version shows in "oc get co" list. CVO has the responsibility to sync it to align with original payload(which does not include kube-storage-version-migrator). 

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0     True        False         4m55s   Cluster version is 4.3.0

# ./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0                               True        False         False      94m
cloud-credential                           4.3.0                               True        False         False      114m
cluster-autoscaler                         4.3.0                               True        False         False      99m
console                                    4.3.0                               True        False         False      8m11s
dns                                        4.3.0                               True        False         False      104m
image-registry                             4.3.0                               True        False         False      12m
ingress                                    4.3.0                               True        False         False      12m
insights                                   4.3.0                               True        False         False      110m
kube-apiserver                             4.3.0                               True        False         False      103m
kube-controller-manager                    4.3.0                               True        False         False      102m
kube-scheduler                             4.3.0                               True        False         False      103m
kube-storage-version-migrator              4.4.0-0.nightly-2020-01-23-054055   True        False         False      12m
machine-api                                4.3.0                               True        False         False      109m
machine-config                             4.3.0                               True        False         False      102m
marketplace                                4.3.0                               True        False         False      11m
monitoring                                 4.3.0                               True        False         False      4m42s
network                                    4.3.0                               True        False         False      110m
node-tuning                                4.3.0                               True        False         False      7m58s
openshift-apiserver                        4.3.0                               True        False         False      5m51s
openshift-controller-manager               4.3.0                               True        False         False      103m
openshift-samples                          4.3.0                               True        False         False      26m
operator-lifecycle-manager                 4.3.0                               True        False         False      109m
operator-lifecycle-manager-catalog         4.3.0                               True        False         False      109m
operator-lifecycle-manager-packageserver   4.3.0                               True        False         False      6m34s
service-ca                                 4.3.0                               True        False         False      110m
service-catalog-apiserver                  4.3.0                               True        False         False      101m
service-catalog-controller-manager         4.3.0                               True        False         False      101m
storage                                    4.3.0                               True        False         False      26m

Comment 3 liujia 2020-02-06 03:26:22 UTC
If we regard an upgrade(A-B) is not completed and successfully when there is one of operators shows different version with targeted version. Then we should not regard an downgrade(B-A) is expected successfully when there is one of operators shows different version with targeted version. Because whether upgrade or downgrade, they are the same to sync the cluster to comply with targeted payload through the same command "oc adm upgrade --to-image".

Comment 5 Abhinav Dahiya 2020-02-06 20:45:33 UTC
> https://docs.openshift.com/container-platform/4.2/updating/updating-cluster-cli.html#update-upgrading-cli_updating-cluster-cli

If an upgrade fails, the Operator stops and reports the status of the failing component. Rolling your cluster back to a previous version is not supported. If your upgrade fails, contact Red Hat support.

We do not support roll backs, and it is expected that on roll back new operators that are added during upgrade forward stay running unless cleaned up manually.

Comment 7 Lalatendu Mohanty 2020-02-07 06:03:24 UTC
In the documentation we are pretty clear that rollback is not supported as mentioned by Abhinav. So this can not be a bug.

"If an upgrade fails, the Operator stops and reports the status of the failing component. Rolling your cluster back to a previous version is not supported. If your upgrade fails, contact Red Hat support." 

OpenShift 4 is made of many operators and controllers. Controllers have CRDs. So rolling back means in this context is different than rolling back of a stand alone software. There is no way we can remove everything around an operator without manual steps. 

That being said I think QE should test rollback just to make sure when a cluster is rolled back it is still usable. It is not required from customer point of view but will be useful for CEE folks.

Comment 8 liujia 2020-02-07 06:37:07 UTC
(In reply to Lalatendu Mohanty from comment #7)
> In the documentation we are pretty clear that rollback is not supported as
> mentioned by Abhinav. So this can not be a bug.
Normally we can define an issue a bug or not according to official doc. But downgrade is a little special that both qe and dev should have a cognitive consensus that we should support/test/track downgrade issue internally/technically. If just judge a downgrade bug or not according to our official doc, there should be several downgrade bugs considered NOTABUG, such as https://bugzilla.redhat.com/show_bug.cgi?id=1791863. But this is not what we want.

> 
> OpenShift 4 is made of many operators and controllers. Controllers have
> CRDs. So rolling back means in this context is different than rolling back
> of a stand alone software. There is no way we can remove everything around
> an operator without manual steps. 
We can accept manual steps is the only way for this issue, then how/where this info "manual steps still needed after downgrade" can be found for users/admin? There is not any hint/warn shown/described, then what should they do correctly?

> 
> That being said I think QE should test rollback just to make sure when a
> cluster is rolled back it is still usable. It is not required from customer
> point of view but will be useful for CEE folks.

To be honest, it's difficult to check "it is still usable", it's two abstract, can u give some check lists? When a cluster with mixed two versions operators, usable means "both v4.3 and v4.4 operators should work well or only current v4.3 operators work well". And "if this mixed 4.3&4.4 cluster can be upgraded for the 2nd time to another version?"

Comment 9 W. Trevor King 2020-02-07 18:31:04 UTC
> how/where this info "manual steps still needed after downgrade" can be found for users/admin?

Probably would be "delete the objects installed for the 4.4-only operator".  For example, [1].  But you'd need to check with each operator to confirm that; it's possible that just blowing them away would break your cluster.

> To be honest, it's difficult to check "it is still usable", it's two abstract, can u give some check lists?

Probably "I can run the e2e suite against this cluster without it failing" (or whatever analogous QE suite).  If the only issue is that there are some operators which report themselves at the future version and do not roll back, that doesn't seem like an issue that would impact customer workloads.

> When a cluster with mixed two versions operators, usable means "both v4.3 and v4.4 operators should work well or only current v4.3 operators work well".

I don't think we care about how well the 4.4 operators are working, as long as they do not impact the cluster functionality.

> And "if this mixed 4.3&4.4 cluster can be upgraded for the 2nd time to another version?"

That's an important criteria, yes.  If a 4.4 operator gets mad and sets Upgradeable=False to block a second attempt at 4.3 -> 4.4, we'd want to get that sorted.  Although one way to get it sorted would be getting removal procedures from each of the new operators, so a cluster admin could rollback, cleanup, and then take a second run at 4.4.

[1]: https://github.com/openshift/cluster-etcd-operator/tree/f74a0a8d7215cb30e226d665a6c760dc3a9be6b6/manifests

Comment 10 Lalatendu Mohanty 2020-02-10 07:05:34 UTC
Looks like we have two broad questions to answer

1. Do we need to test for rollbacks or downgrades? If yes, where is the line drawn because in the official documents we do not support downgrades.

2. Do we need to document steps around a cluster is downgraded and so that it can be tried for upgrade again?

I am planning to talk to architects and group leads around this and get back to you. 

Regarding how to check if the cluster is usable, it must be a test suite which checks the cluster health and some runs sanity tests. As Trevor has suggested e2e tests fits in to this idea.

Comment 11 liujia 2020-02-10 07:15:00 UTC
(In reply to Lalatendu Mohanty from comment #10)
> Looks like we have two broad questions to answer
> 
> 1. Do we need to test for rollbacks or downgrades? If yes, where is the line
> drawn because in the official documents we do not support downgrades.
> 
> 2. Do we need to document steps around a cluster is downgraded and so that
> it can be tried for upgrade again?
> 
+1024 agree.

Comment 12 W. Trevor King 2020-02-11 22:35:55 UTC
> 1. Do we need to test for rollbacks or downgrades?

In this space, we do run rollback/downgrade tests in CI, search for 'TEST_OPTIONS=abort-at=...' [1] and [2,3].  Although what we run in CI and what we cover in QE can be largely orthogonal.

[1]: https://github.com/openshift/release/search?q=TEST_OPTIONS%3Dabort-at&unscoped_q=TEST_OPTIONS%3Dabort-at
[2]: https://github.com/openshift/origin/pull/22726
[3]: https://github.com/openshift/origin/blob/ed88d3e228cfab53a4a2d9074a040aed4cc76d34/cmd/openshift-tests/openshift-tests.go#L150-L151

Comment 13 Scott Dodson 2020-02-14 15:01:30 UTC
There is no expectation that CVO will remove cluster operators on downgrade. I'm seeking guidance from the architects team on the scope of expectations for CVO during downgrades. I'm moving this to 4.5 while we sort that out.

Comment 14 Scott Dodson 2020-02-24 18:34:23 UTC
Any cluster operator removal will be handled via one-off downgrade documentation. We should expect to see such for Cluster Etcd Operator. At this point in time CVO will not pursue removal of operators.

Any downgrade that results in failing e2e test should be pursued as independent bugs on the affected operator.

Comment 15 liujia 2020-02-25 02:20:09 UTC
(In reply to Scott Dodson from comment #14)
> Any cluster operator removal will be handled via one-off downgrade
> documentation. We should expect to see such for Cluster Etcd Operator. At
> this point in time CVO will not pursue removal of operators.

Sorry, it's confused. does it mean this downgrade issue should be fixed through documentation, in which described the steps about all extra operators removal? If that, i think we should not close the bug with NOTABUG, but better to change it to doc for further track. HDYT?

> Any downgrade that results in failing e2e test should be pursued as
> independent bugs on the affected operator.

Does this mean, the pass criterion of downgrade test is e2e test not failing? This is important for QE to do this kind of test.

Comment 16 Johnny Liu 2020-02-25 06:41:05 UTC
From the above discussion, QE still do not get clear conclusion if need run downgrade testing or not any more. 

Comment 10 is a good summary for QE's question.

If no good guidance about what downgrade issues are treated as a bug, some are closed as NOTABUG, QE will be struggled if worth running degrade testing.

Comment 17 Scott Dodson 2020-02-25 20:06:56 UTC
(In reply to liujia from comment #15)
> (In reply to Scott Dodson from comment #14)
> > Any cluster operator removal will be handled via one-off downgrade
> > documentation. We should expect to see such for Cluster Etcd Operator. At
> > this point in time CVO will not pursue removal of operators.
> 
> Sorry, it's confused. does it mean this downgrade issue should be fixed
> through documentation, in which described the steps about all extra
> operators removal? If that, i think we should not close the bug with
> NOTABUG, but better to change it to doc for further track. HDYT?

I think we need one bug per operator assigned to the component for those operators so that those teams are obligated to write up KCS articles for any cleanup in their downgrade. I know the etcd team is already aware of this and working on a procedure for the downgrade as noted here https://github.com/openshift/enhancements/pull/207

> > Any downgrade that results in failing e2e test should be pursued as
> > independent bugs on the affected operator.
> 
> Does this mean, the pass criterion of downgrade test is e2e test not
> failing? This is important for QE to do this kind of test.

We're working on sorting out the requirements but the initial thoughts are that downgrading from B -> A would include running e2e tests from version A after the downgrade was complete.

Comment 18 liujia 2020-02-26 02:35:38 UTC
(In reply to Scott Dodson from comment #17)
> 
> I think we need one bug per operator assigned to the component for those
> operators so that those teams are obligated to write up KCS articles for any
> cleanup in their downgrade. I know the etcd team is already aware of this
> and working on a procedure for the downgrade as noted here
> https://github.com/openshift/enhancements/pull/207
> 

cc @geliu @kewang @chaoyang for further track per operator if needed, thx.


Note You need to log in before you can comment on or make changes to this bug.