Description of problem: Downgrade from v4.4 to v4.3.0, kube-storage-version-migrator was not reset and still at v4.4 after downgrade. Version-Release number of the following components: 4.4.0-0.nightly-2020-01-23-054055 How reproducible: always Steps to Reproduce: 1. install v4.3.0, no kube-storage-version-migrator operator in "oc get co" # ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0 True False False 3m39s cloud-credential 4.3.0 True False False 24m cluster-autoscaler 4.3.0 True False False 8m44s console 4.3.0 True False False 4m45s dns 4.3.0 True False False 13m image-registry 4.3.0 True False False 8m16s ingress 4.3.0 True False False 9m7s insights 4.3.0 True False False 19m kube-apiserver 4.3.0 True False False 12m kube-controller-manager 4.3.0 True False False 11m kube-scheduler 4.3.0 True False False 12m machine-api 4.3.0 True False False 18m machine-config 4.3.0 True False False 11m marketplace 4.3.0 True False False 9m20s monitoring 4.3.0 True False False 3m23s network 4.3.0 True False False 19m node-tuning 4.3.0 True False False 10m openshift-apiserver 4.3.0 True False False 11m openshift-controller-manager 4.3.0 True False False 12m openshift-samples 4.3.0 True False False 8m21s operator-lifecycle-manager 4.3.0 True False False 18m operator-lifecycle-manager-catalog 4.3.0 True False False 18m operator-lifecycle-manager-packageserver 4.3.0 True False False 11m service-ca 4.3.0 True False False 19m service-catalog-apiserver 4.3.0 True False False 10m service-catalog-controller-manager 4.3.0 True False False 10m storage 4.3.0 True False False 10m 2. upgrade v4.3.0 to v4.4.0-0.nightly-2020-01-23-054055, new kube-storage-version-migrator operator created. ... machine-api 4.4.0-0.nightly-2020-01-23-054055 True False False 53m machine-config 4.4.0-0.nightly-2020-01-23-054055 True False False 46m marketplace 4.4.0-0.nightly-2020-01-23-054055 True False False 81s ... 3. trigger downgrade from v4.4.0-0.nightly-2020-01-23-054055 to v4.3.0 4. checked only kube-storage-version-migrator still at a wrong version, which should be cleaned since this is not induced in v4.3.0. # ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0 True False False 94m cloud-credential 4.3.0 True False False 114m cluster-autoscaler 4.3.0 True False False 99m console 4.3.0 True False False 8m11s dns 4.3.0 True False False 104m image-registry 4.3.0 True False False 12m ingress 4.3.0 True False False 12m insights 4.3.0 True False False 110m kube-apiserver 4.3.0 True False False 103m kube-controller-manager 4.3.0 True False False 102m kube-scheduler 4.3.0 True False False 103m kube-storage-version-migrator 4.4.0-0.nightly-2020-01-23-054055 True False False 12m machine-api 4.3.0 True False False 109m machine-config 4.3.0 True False False 102m marketplace 4.3.0 True False False 11m monitoring 4.3.0 True False False 4m42s network 4.3.0 True False False 110m node-tuning 4.3.0 True False False 7m58s openshift-apiserver 4.3.0 True False False 5m51s openshift-controller-manager 4.3.0 True False False 103m openshift-samples 4.3.0 True False False 26m operator-lifecycle-manager 4.3.0 True False False 109m operator-lifecycle-manager-catalog 4.3.0 True False False 109m operator-lifecycle-manager-packageserver 4.3.0 True False False 6m34s service-ca 4.3.0 True False False 110m service-catalog-apiserver 4.3.0 True False False 101m service-catalog-controller-manager 4.3.0 True False False 101m storage 4.3.0 True False False 26m # ./oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version' 2020-01-23T11:24:15Z|2020-01-23T11:52:24Z|Completed|4.3.0 2020-01-23T10:26:28Z|2020-01-23T11:11:15Z|Completed|4.4.0-0.nightly-2020-01-23-054055 2020-01-23T10:01:09Z|2020-01-23T10:22:13Z|Completed|4.3.0 # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0 True False 4m55s Cluster version is 4.3.0 Actual results: kube-storage-version-migrator operator is at a wrong status. Expected results: kube-storage-version-migrator operator should be reset. Additional info: Please attach logs from ansible-playbook with the -vvv flag
We don't prune not required components.. so the downgrade will leave new resources around...
Two questions support me to reopen the bug: 1. What's downgrade mean for users? In my understanding, should be A-B-A(this is from DEV's explanation about downgrade), not A-B-mixed AB. hdyt? 2. The cluster version is strange if the downgrade will leave new resources around. When "oc get clusterversion" shows that it's v4.3, but v4.4 version shows in "oc get co" list. CVO has the responsibility to sync it to align with original payload(which does not include kube-storage-version-migrator). # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0 True False 4m55s Cluster version is 4.3.0 # ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0 True False False 94m cloud-credential 4.3.0 True False False 114m cluster-autoscaler 4.3.0 True False False 99m console 4.3.0 True False False 8m11s dns 4.3.0 True False False 104m image-registry 4.3.0 True False False 12m ingress 4.3.0 True False False 12m insights 4.3.0 True False False 110m kube-apiserver 4.3.0 True False False 103m kube-controller-manager 4.3.0 True False False 102m kube-scheduler 4.3.0 True False False 103m kube-storage-version-migrator 4.4.0-0.nightly-2020-01-23-054055 True False False 12m machine-api 4.3.0 True False False 109m machine-config 4.3.0 True False False 102m marketplace 4.3.0 True False False 11m monitoring 4.3.0 True False False 4m42s network 4.3.0 True False False 110m node-tuning 4.3.0 True False False 7m58s openshift-apiserver 4.3.0 True False False 5m51s openshift-controller-manager 4.3.0 True False False 103m openshift-samples 4.3.0 True False False 26m operator-lifecycle-manager 4.3.0 True False False 109m operator-lifecycle-manager-catalog 4.3.0 True False False 109m operator-lifecycle-manager-packageserver 4.3.0 True False False 6m34s service-ca 4.3.0 True False False 110m service-catalog-apiserver 4.3.0 True False False 101m service-catalog-controller-manager 4.3.0 True False False 101m storage 4.3.0 True False False 26m
If we regard an upgrade(A-B) is not completed and successfully when there is one of operators shows different version with targeted version. Then we should not regard an downgrade(B-A) is expected successfully when there is one of operators shows different version with targeted version. Because whether upgrade or downgrade, they are the same to sync the cluster to comply with targeted payload through the same command "oc adm upgrade --to-image".
> https://docs.openshift.com/container-platform/4.2/updating/updating-cluster-cli.html#update-upgrading-cli_updating-cluster-cli If an upgrade fails, the Operator stops and reports the status of the failing component. Rolling your cluster back to a previous version is not supported. If your upgrade fails, contact Red Hat support. We do not support roll backs, and it is expected that on roll back new operators that are added during upgrade forward stay running unless cleaned up manually.
In the documentation we are pretty clear that rollback is not supported as mentioned by Abhinav. So this can not be a bug. "If an upgrade fails, the Operator stops and reports the status of the failing component. Rolling your cluster back to a previous version is not supported. If your upgrade fails, contact Red Hat support." OpenShift 4 is made of many operators and controllers. Controllers have CRDs. So rolling back means in this context is different than rolling back of a stand alone software. There is no way we can remove everything around an operator without manual steps. That being said I think QE should test rollback just to make sure when a cluster is rolled back it is still usable. It is not required from customer point of view but will be useful for CEE folks.
(In reply to Lalatendu Mohanty from comment #7) > In the documentation we are pretty clear that rollback is not supported as > mentioned by Abhinav. So this can not be a bug. Normally we can define an issue a bug or not according to official doc. But downgrade is a little special that both qe and dev should have a cognitive consensus that we should support/test/track downgrade issue internally/technically. If just judge a downgrade bug or not according to our official doc, there should be several downgrade bugs considered NOTABUG, such as https://bugzilla.redhat.com/show_bug.cgi?id=1791863. But this is not what we want. > > OpenShift 4 is made of many operators and controllers. Controllers have > CRDs. So rolling back means in this context is different than rolling back > of a stand alone software. There is no way we can remove everything around > an operator without manual steps. We can accept manual steps is the only way for this issue, then how/where this info "manual steps still needed after downgrade" can be found for users/admin? There is not any hint/warn shown/described, then what should they do correctly? > > That being said I think QE should test rollback just to make sure when a > cluster is rolled back it is still usable. It is not required from customer > point of view but will be useful for CEE folks. To be honest, it's difficult to check "it is still usable", it's two abstract, can u give some check lists? When a cluster with mixed two versions operators, usable means "both v4.3 and v4.4 operators should work well or only current v4.3 operators work well". And "if this mixed 4.3&4.4 cluster can be upgraded for the 2nd time to another version?"
> how/where this info "manual steps still needed after downgrade" can be found for users/admin? Probably would be "delete the objects installed for the 4.4-only operator". For example, [1]. But you'd need to check with each operator to confirm that; it's possible that just blowing them away would break your cluster. > To be honest, it's difficult to check "it is still usable", it's two abstract, can u give some check lists? Probably "I can run the e2e suite against this cluster without it failing" (or whatever analogous QE suite). If the only issue is that there are some operators which report themselves at the future version and do not roll back, that doesn't seem like an issue that would impact customer workloads. > When a cluster with mixed two versions operators, usable means "both v4.3 and v4.4 operators should work well or only current v4.3 operators work well". I don't think we care about how well the 4.4 operators are working, as long as they do not impact the cluster functionality. > And "if this mixed 4.3&4.4 cluster can be upgraded for the 2nd time to another version?" That's an important criteria, yes. If a 4.4 operator gets mad and sets Upgradeable=False to block a second attempt at 4.3 -> 4.4, we'd want to get that sorted. Although one way to get it sorted would be getting removal procedures from each of the new operators, so a cluster admin could rollback, cleanup, and then take a second run at 4.4. [1]: https://github.com/openshift/cluster-etcd-operator/tree/f74a0a8d7215cb30e226d665a6c760dc3a9be6b6/manifests
Looks like we have two broad questions to answer 1. Do we need to test for rollbacks or downgrades? If yes, where is the line drawn because in the official documents we do not support downgrades. 2. Do we need to document steps around a cluster is downgraded and so that it can be tried for upgrade again? I am planning to talk to architects and group leads around this and get back to you. Regarding how to check if the cluster is usable, it must be a test suite which checks the cluster health and some runs sanity tests. As Trevor has suggested e2e tests fits in to this idea.
(In reply to Lalatendu Mohanty from comment #10) > Looks like we have two broad questions to answer > > 1. Do we need to test for rollbacks or downgrades? If yes, where is the line > drawn because in the official documents we do not support downgrades. > > 2. Do we need to document steps around a cluster is downgraded and so that > it can be tried for upgrade again? > +1024 agree.
> 1. Do we need to test for rollbacks or downgrades? In this space, we do run rollback/downgrade tests in CI, search for 'TEST_OPTIONS=abort-at=...' [1] and [2,3]. Although what we run in CI and what we cover in QE can be largely orthogonal. [1]: https://github.com/openshift/release/search?q=TEST_OPTIONS%3Dabort-at&unscoped_q=TEST_OPTIONS%3Dabort-at [2]: https://github.com/openshift/origin/pull/22726 [3]: https://github.com/openshift/origin/blob/ed88d3e228cfab53a4a2d9074a040aed4cc76d34/cmd/openshift-tests/openshift-tests.go#L150-L151
There is no expectation that CVO will remove cluster operators on downgrade. I'm seeking guidance from the architects team on the scope of expectations for CVO during downgrades. I'm moving this to 4.5 while we sort that out.
Any cluster operator removal will be handled via one-off downgrade documentation. We should expect to see such for Cluster Etcd Operator. At this point in time CVO will not pursue removal of operators. Any downgrade that results in failing e2e test should be pursued as independent bugs on the affected operator.
(In reply to Scott Dodson from comment #14) > Any cluster operator removal will be handled via one-off downgrade > documentation. We should expect to see such for Cluster Etcd Operator. At > this point in time CVO will not pursue removal of operators. Sorry, it's confused. does it mean this downgrade issue should be fixed through documentation, in which described the steps about all extra operators removal? If that, i think we should not close the bug with NOTABUG, but better to change it to doc for further track. HDYT? > Any downgrade that results in failing e2e test should be pursued as > independent bugs on the affected operator. Does this mean, the pass criterion of downgrade test is e2e test not failing? This is important for QE to do this kind of test.
From the above discussion, QE still do not get clear conclusion if need run downgrade testing or not any more. Comment 10 is a good summary for QE's question. If no good guidance about what downgrade issues are treated as a bug, some are closed as NOTABUG, QE will be struggled if worth running degrade testing.
(In reply to liujia from comment #15) > (In reply to Scott Dodson from comment #14) > > Any cluster operator removal will be handled via one-off downgrade > > documentation. We should expect to see such for Cluster Etcd Operator. At > > this point in time CVO will not pursue removal of operators. > > Sorry, it's confused. does it mean this downgrade issue should be fixed > through documentation, in which described the steps about all extra > operators removal? If that, i think we should not close the bug with > NOTABUG, but better to change it to doc for further track. HDYT? I think we need one bug per operator assigned to the component for those operators so that those teams are obligated to write up KCS articles for any cleanup in their downgrade. I know the etcd team is already aware of this and working on a procedure for the downgrade as noted here https://github.com/openshift/enhancements/pull/207 > > Any downgrade that results in failing e2e test should be pursued as > > independent bugs on the affected operator. > > Does this mean, the pass criterion of downgrade test is e2e test not > failing? This is important for QE to do this kind of test. We're working on sorting out the requirements but the initial thoughts are that downgrading from B -> A would include running e2e tests from version A after the downgrade was complete.
(In reply to Scott Dodson from comment #17) > > I think we need one bug per operator assigned to the component for those > operators so that those teams are obligated to write up KCS articles for any > cleanup in their downgrade. I know the etcd team is already aware of this > and working on a procedure for the downgrade as noted here > https://github.com/openshift/enhancements/pull/207 > cc @geliu @kewang @chaoyang for further track per operator if needed, thx.