Description of problem: During upgrades the UI reports a failing state but continues This would be very misleading/confusing for a client Version-Release number of selected component (if applicable): 4.4.27 upgrade to 4.5.14 How reproducible: Very Steps to Reproduce: 1.Install 4.4.27 2.Run an upgrade from the UI to 4.5.14 (it appears with any version) 3.Monitor the upgrade status on the UI At time it will go to a failing state Actual results: Upgrade status goes into Failing status When the clusterversion output indicated unable to apply xxx the UI shows Failing but the upgrade continues without issue to the end and testing after that is successful. Expected results: Upgrade remains in working toward state Additional info: Attached is the output of oc get clusterversion for the total upgrade time.
Please provide a must gather from this cluster.
Since the must-gathers can't be uploaded to bugzilla the file has been put on a google drive share you should have access to. https://drive.google.com/file/d/1wboloqr5Qoela1tGOTo3SXBL7KNBt2G9/view?usp=sharing
I'm moving this to UI for further triage but from MCO and/or any operator POV we will toggle between failing/upgrading and we keep re-trying the loop and that's how we re-sync (in every operator). Differentiating an hard failure from a failure that needs to be re-tried isn't trivial (or possible in certain scenario) so what the UI does is just querying the Cluster operator API which can still toggle its status while it's upgrading (or reconciling). Re-assigning to the UI team to check if this can be closed.
Investigated the attached must-gather and there is nothing suspicious in the console or console-operator pod logs. Also check other operators logs, but didn't found any errors/failures that would lead to any particular issue. Closing this BZ. Feel free to re-open if case of questions/comments.
I believe this is a valid bug. The UI ( oc get clusterversion also reports unable to update ) reports that upgrades are failing when they are in fact continuing. This behavior would be confusing and disconcerting to an end user I think. Was this not able to be reproduced? I see it on every upgrade.
Did som investigation by running the upgrade from 4.4.27 -> 4.5.17 Here is the log from it: ``` oc get clusterversion -w NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.27 True True 65s Working towards 4.5.14: 18% complete version 4.4.27 True True 70s Working towards 4.5.14: 27% complete version 4.4.27 True True 2m1s Working towards 4.5.14: 27% complete version 4.4.27 True True 3m17s Working towards 4.5.14: 27% complete version 4.4.27 True True 3m35s Working towards 4.5.14: 29% complete version 4.4.27 True True 4m50s Working towards 4.5.14: 46% complete version 4.4.27 True True 5m5s Working towards 4.5.14: 66% complete version 4.4.27 True True 5m20s Working towards 4.5.14: 68% complete version 4.4.27 True True 5m50s Unable to apply 4.5.14: an unknown error has occurred: MultipleErrors version 4.4.27 True True 6m50s Working towards 4.5.14: 74% complete version 4.4.27 True True 6m50s Working towards 4.5.14: 74% complete version 4.4.27 True True 9m42s Working towards 4.5.14: 76% complete version 4.4.27 True True 12m Working towards 4.5.14: 76% complete, waiting on openshift-samples version 4.4.27 True True 12m Working towards 4.5.14: 76% complete, waiting on openshift-samples version 4.4.27 True True 15m Working towards 4.5.14: 76% complete, waiting on openshift-samples version 4.4.27 True True 15m Working towards 4.5.14: 79% complete version 4.4.27 True True 18m Working towards 4.5.14: 79% complete, waiting on network version 4.4.27 True True 18m Working towards 4.5.14: 79% complete, waiting on network version 4.4.27 True True 20m Working towards 4.5.14: 79% complete version 4.4.27 True True 22m Working towards 4.5.14: 81% complete version 4.4.27 True True 22m Working towards 4.5.14: 81% complete version 4.4.27 True True 22m Working towards 4.5.14: 84% complete version 4.4.27 True True 25m Working towards 4.5.14: 84% complete version 4.4.27 True True 26m Working towards 4.5.14: 84% complete, waiting on machine-config version 4.4.27 True True 27m Working towards 4.5.14 version 4.4.27 True True 27m Working towards 4.5.14: downloading update version 4.4.27 True True 27m Working towards 4.5.14 version 4.4.27 True True 27m Working towards 4.5.14: 0% complete version 4.4.27 True True 28m Working towards 4.5.14: 36% complete version 4.4.27 True True 28m Working towards 4.5.14: 76% complete version 4.4.27 True True 28m Working towards 4.5.14: 84% complete version 4.4.27 True True 32m Working towards 4.5.14 version 4.4.27 True True 32m Working towards 4.5.14: downloading update version 4.4.27 True True 32m Working towards 4.5.14: downloading update version 4.4.27 True True 32m Working towards 4.5.14 version 4.4.27 True True 32m Working towards 4.5.14: 0% complete version 4.4.27 True True 32m Working towards 4.5.14: 18% complete version 4.4.27 True True 35m Working towards 4.5.14: 27% complete version 4.4.27 True True 37m Unable to apply 4.5.14: the cluster operator openshift-apiserver is degraded version 4.4.27 True True 40m Unable to apply 4.5.14: the cluster operator openshift-apiserver is degraded version 4.4.27 True True 40m Unable to apply 4.5.14: the cluster operator openshift-apiserver is degraded version 4.4.27 True True 41m Working towards 4.5.14: 39% complete version 4.4.27 True True 41m Working towards 4.5.14: 39% complete version 4.4.27 True True 41m Working towards 4.5.14: 48% complete version 4.4.27 True True 41m Working towards 4.5.14: 77% complete version 4.4.27 True True 42m Working towards 4.5.14: 84% complete version 4.4.27 True True 42m Working towards 4.5.14: 87% complete version 4.5.14 True False 0s Cluster version is 4.5.14 ``` From it I can see that there is an issue openshift-apiserver operator. Checked the openshift-apiserver operators logs but haven't found anything suspicious. On the other hand after checking the logs from the openshift-apiserver pod I see it's flooded with following errors: ``` 2020-10-16T14:02:14.351597436Z W1016 14:02:14.351511 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:02:25.270434318Z W1016 14:02:25.270305 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:03:21.498404881Z W1016 14:03:21.498328 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:03:31.107598105Z W1016 14:03:31.107521 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:04:21.344546312Z W1016 14:04:21.338265 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:04:27.000020743Z W1016 14:04:26.999923 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:04:58.75845691Z W1016 14:04:58.758327 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:06:05.06361367Z W1016 14:06:05.063540 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted 2020-10-16T14:07:43.949542261Z W1016 14:07:43.949478 1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted ``` Moving the issue to the apiserver team since I dont see any issue with the console or console-operator itself, during the upgrade process.
Hi apiserver team, leaving this bug to you to tag UpcomingSprints in the future.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
This is still present in 4.7. As an example, I ran an update from 4.7.0-0.nightly-s390x-2020-12-01-004903 (fresh clean install) to 4.7.0-0.nightly-s390x-2021-01-07-140745 The following conditions appeared. At least 2X the following appeared in the GUI Update status Update to 4.7.0-0.nightly-s390x-2021-01-07-140745 in progress Failing At those two points oc get clusterversion indicated the following. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-s390x-2020-12-01-004903 True True 15m Unable to apply 4.7.0-0.nightly-s390x-2 021-01-07-140745: the cluster operator kube-apiserver has not yet successfully rolled out NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-s390x-2020-12-01-004903 True True 46m Unable to apply 4.7.0-0.nightly-s390x-2 021-01-07-140745: the cluster operator machine-config has not yet successfully rolled out The upgrade did in fact continue and was successful NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-s390x-2021-01-07-140745 True False 5m17s Cluster version is 4.7.0-0.nightly-s390 x-2021-01-07-140745 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 6m3s baremetal 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 114m cloud-credential 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 114m cluster-autoscaler 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 112m config-operator 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 114m console 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 10m csi-snapshot-controller 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 16m dns 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 112m etcd 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 112m image-registry 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 107m ingress 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 105m insights 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 114m kube-apiserver 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 111m kube-controller-manager 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 111m kube-scheduler 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 111m kube-storage-version-migrator 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 8m4s machine-api 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 113m machine-approver 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 112m machine-config 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 7m2s marketplace 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 15m monitoring 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 35m network 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 27m node-tuning 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 37m openshift-apiserver 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 45m openshift-controller-manager 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 35m openshift-samples 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 37m operator-lifecycle-manager 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 113m operator-lifecycle-manager-catalog 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 113m operator-lifecycle-manager-packageserver 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 21m service-ca 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 114m storage 4.7.0-0.nightly-s390x-2021-01-07-140745 True False False 114m
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
Repeating what Antonio wrote: from MCO and/or any operator POV we will toggle between failing/upgrading and we keep re-trying the loop and that's how we re-sync (in every operator). Differentiating an hard failure from a failure that needs to be re-tried isn't trivial (or possible in certain scenario) so what the UI does is just querying the Cluster operator API which can still toggle its status while it's upgrading (or reconciling). This is the ask, not finding randon upgrade issues.
Stefan, I'm a little unclear on the ask for the console team. We're simply showing the status from the ClusterVersion resource. Is the request not to show it in the UI when the `Failing` condition is `True`? Or to throttle updates so it doesn't appear to flap?
I understand the usability concern here, but I'm not sure the console team is the right owner. Console is simply showing the status from the ClusterVersion resource. You would see the same Failing condition on the CLI with `oc describe clusterversion verion`. Moving to the CVO component for consideration. It's possible for console to ignore the `Failing` condition on the ClusterVersion or present it in a different way. But my question would be if it's confusing or misleading, should the CVO adjust when it sets that condition? If the OTA team has a recommendation for how the console can show upgrade status differently, we can definitely make changes.
[1] is currently in flight for bug 1884334, to fix some CVO logic that had us going Failing=True when we shouldn't have been. Yes, this is just cosmetic, because the CVO will continue to attempt to reconcile/update the cluster while it is complaining. I'm going to close this bug as a dup of 1884334, but if you still see issues with the CVO reporting Failing=True after that (or on releases where that change has been backported), please file bugs like "CVO is reporting Failing=True with $REASON and $MESSAGE, but $REASONS_WHY_YOU_THINK_THE_CLUSTER_IS_ACTUALLY_HAPPY". [1]: https://github.com/openshift/cluster-version-operator/pull/486 *** This bug has been marked as a duplicate of bug 1884334 ***