Bug 1888754 - During upgrades the UI reports a failing state but continues
Summary: During upgrades the UI reports a failing state but continues
Keywords:
Status: CLOSED DUPLICATE of bug 1884334
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.5
Hardware: s390x
OS: Linux
low
low
Target Milestone: ---
: ---
Assignee: Over the Air Updates
QA Contact: Johnny Liu
URL:
Whiteboard: multi-arch LifecycleReset
Depends On:
Blocks: ocp-42-45-z-tracker
TreeView+ depends on / blocked
 
Reported: 2020-10-15 17:03 UTC by Christian LaPolt
Modified: 2022-05-06 12:29 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-21 19:24:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Christian LaPolt 2020-10-15 17:03:47 UTC
Description of problem:
During upgrades the UI reports a failing state but continues
This would be very misleading/confusing for a client

Version-Release number of selected component (if applicable):
4.4.27 upgrade to 4.5.14

How reproducible:
Very

Steps to Reproduce:
1.Install 4.4.27
2.Run an upgrade from the UI to 4.5.14 (it appears with any version)
3.Monitor the upgrade status on the UI At time it will go to a failing state

Actual results:
Upgrade status goes into Failing status
When the clusterversion output indicated unable to apply xxx the UI shows Failing but the upgrade continues without issue to the end and testing after that is successful.

Expected results:
Upgrade remains in working toward state

Additional info:
Attached is the output of oc get clusterversion for the total upgrade time.

Comment 1 Kirsten Garrison 2020-10-15 19:41:39 UTC
Please provide a must gather from this cluster.

Comment 2 Christian LaPolt 2020-10-16 15:07:30 UTC
Since the must-gathers can't be uploaded to bugzilla the file has been put on a google drive share you should have access to.

https://drive.google.com/file/d/1wboloqr5Qoela1tGOTo3SXBL7KNBt2G9/view?usp=sharing

Comment 4 Antonio Murdaca 2020-11-17 08:28:38 UTC
I'm moving this to UI for further triage but from MCO and/or any operator POV we will toggle between failing/upgrading and we keep re-trying the loop and that's how we re-sync (in every operator). Differentiating an hard failure from a failure that needs to be re-tried isn't trivial (or possible in certain scenario) so what the UI does is just querying the Cluster operator API which can still toggle its status while it's upgrading (or reconciling).
Re-assigning to the UI team to check if this can be closed.

Comment 5 Jakub Hadvig 2020-11-18 10:33:43 UTC
Investigated the attached must-gather and there is nothing suspicious in the console or console-operator pod logs.
Also check other operators logs, but didn't found any errors/failures that would lead to any particular issue.
Closing this BZ. Feel free to re-open if case of questions/comments.

Comment 6 Christian LaPolt 2020-11-18 17:20:54 UTC
I believe this is a valid bug.  The UI ( oc get clusterversion also reports unable to update ) reports that upgrades are failing when they are in fact continuing.  This behavior would be confusing and disconcerting to an end user I think. Was this not able to be reproduced?  I see it on every upgrade.

Comment 7 Jakub Hadvig 2020-11-23 11:06:16 UTC
Did som investigation by running the upgrade from 4.4.27 -> 4.5.17
Here is the log from it:
```
oc get clusterversion -w
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.27    True        True          65s     Working towards 4.5.14: 18% complete
version   4.4.27    True        True          70s     Working towards 4.5.14: 27% complete
version   4.4.27    True        True          2m1s    Working towards 4.5.14: 27% complete
version   4.4.27    True        True          3m17s   Working towards 4.5.14: 27% complete
version   4.4.27    True        True          3m35s   Working towards 4.5.14: 29% complete
version   4.4.27    True        True          4m50s   Working towards 4.5.14: 46% complete
version   4.4.27    True        True          5m5s    Working towards 4.5.14: 66% complete
version   4.4.27    True        True          5m20s   Working towards 4.5.14: 68% complete
version   4.4.27    True        True          5m50s   Unable to apply 4.5.14: an unknown error has occurred: MultipleErrors
version   4.4.27    True        True          6m50s   Working towards 4.5.14: 74% complete
version   4.4.27    True        True          6m50s   Working towards 4.5.14: 74% complete
version   4.4.27    True        True          9m42s   Working towards 4.5.14: 76% complete
version   4.4.27    True        True          12m     Working towards 4.5.14: 76% complete, waiting on openshift-samples
version   4.4.27    True        True          12m     Working towards 4.5.14: 76% complete, waiting on openshift-samples
version   4.4.27    True        True          15m     Working towards 4.5.14: 76% complete, waiting on openshift-samples
version   4.4.27    True        True          15m     Working towards 4.5.14: 79% complete
version   4.4.27    True        True          18m     Working towards 4.5.14: 79% complete, waiting on network
version   4.4.27    True        True          18m     Working towards 4.5.14: 79% complete, waiting on network
version   4.4.27    True        True          20m     Working towards 4.5.14: 79% complete
version   4.4.27    True        True          22m     Working towards 4.5.14: 81% complete
version   4.4.27    True        True          22m     Working towards 4.5.14: 81% complete
version   4.4.27    True        True          22m     Working towards 4.5.14: 84% complete
version   4.4.27    True        True          25m     Working towards 4.5.14: 84% complete
version   4.4.27    True        True          26m     Working towards 4.5.14: 84% complete, waiting on machine-config
version   4.4.27    True        True          27m     Working towards 4.5.14
version   4.4.27    True        True          27m     Working towards 4.5.14: downloading update
version   4.4.27    True        True          27m     Working towards 4.5.14
version   4.4.27    True        True          27m     Working towards 4.5.14: 0% complete
version   4.4.27    True        True          28m     Working towards 4.5.14: 36% complete
version   4.4.27    True        True          28m     Working towards 4.5.14: 76% complete
version   4.4.27    True        True          28m     Working towards 4.5.14: 84% complete
version   4.4.27    True        True          32m     Working towards 4.5.14
version   4.4.27    True        True          32m     Working towards 4.5.14: downloading update
version   4.4.27    True        True          32m     Working towards 4.5.14: downloading update
version   4.4.27    True        True          32m     Working towards 4.5.14
version   4.4.27    True        True          32m     Working towards 4.5.14: 0% complete
version   4.4.27    True        True          32m     Working towards 4.5.14: 18% complete
version   4.4.27    True        True          35m     Working towards 4.5.14: 27% complete
version   4.4.27    True        True          37m     Unable to apply 4.5.14: the cluster operator openshift-apiserver is degraded
version   4.4.27    True        True          40m     Unable to apply 4.5.14: the cluster operator openshift-apiserver is degraded
version   4.4.27    True        True          40m     Unable to apply 4.5.14: the cluster operator openshift-apiserver is degraded
version   4.4.27    True        True          41m     Working towards 4.5.14: 39% complete
version   4.4.27    True        True          41m     Working towards 4.5.14: 39% complete
version   4.4.27    True        True          41m     Working towards 4.5.14: 48% complete
version   4.4.27    True        True          41m     Working towards 4.5.14: 77% complete
version   4.4.27    True        True          42m     Working towards 4.5.14: 84% complete
version   4.4.27    True        True          42m     Working towards 4.5.14: 87% complete
version   4.5.14    True        False         0s      Cluster version is 4.5.14
```

From it I can see that there is an issue openshift-apiserver operator.
Checked the openshift-apiserver operators logs but haven't found anything suspicious.
On the other hand after checking the logs from the openshift-apiserver pod I see it's flooded with following errors:
```
2020-10-16T14:02:14.351597436Z W1016 14:02:14.351511       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:02:25.270434318Z W1016 14:02:25.270305       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:03:21.498404881Z W1016 14:03:21.498328       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:03:31.107598105Z W1016 14:03:31.107521       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:04:21.344546312Z W1016 14:04:21.338265       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:04:27.000020743Z W1016 14:04:26.999923       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:04:58.75845691Z W1016 14:04:58.758327       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:06:05.06361367Z W1016 14:06:05.063540       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
2020-10-16T14:07:43.949542261Z W1016 14:07:43.949478       1 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted
```

Moving the issue to the apiserver team since I dont see any issue with the console or console-operator itself, during the upgrade process.

Comment 8 Dan Li 2020-12-02 15:25:41 UTC
Hi apiserver team, leaving this bug to you to tag UpcomingSprints in the future.

Comment 9 Michal Fojtik 2020-12-23 11:58:26 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 10 Christian LaPolt 2021-01-07 16:23:55 UTC
This is still present in 4.7.

As an example, I ran an update from 4.7.0-0.nightly-s390x-2020-12-01-004903 (fresh clean install) to 4.7.0-0.nightly-s390x-2021-01-07-140745

The following conditions appeared. At least 2X the following appeared in the GUI

Update status
Update to 4.7.0-0.nightly-s390x-2021-01-07-140745 in progress
Failing


At those two points oc get clusterversion indicated the following.

NAME	  VERSION                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-s390x-2020-12-01-004903   True        True          15m     Unable to apply 4.7.0-0.nightly-s390x-2
021-01-07-140745: the cluster operator kube-apiserver has not yet successfully rolled out

NAME	  VERSION                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-s390x-2020-12-01-004903   True        True          46m     Unable to apply 4.7.0-0.nightly-s390x-2
021-01-07-140745: the cluster operator machine-config has not yet successfully rolled out



The upgrade did in fact continue and was successful 

NAME	  VERSION                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         5m17s   Cluster version is 4.7.0-0.nightly-s390
x-2021-01-07-140745

NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  6m3s
baremetal                                  4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  114m
cloud-credential                           4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  114m
cluster-autoscaler                         4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  112m
config-operator                            4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  114m
console                                    4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  10m
csi-snapshot-controller                    4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  16m
dns                                        4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  112m
etcd                                       4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  112m
image-registry                             4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  107m
ingress                                    4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  105m
insights                                   4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  114m
kube-apiserver                             4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  111m
kube-controller-manager                    4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  111m
kube-scheduler                             4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  111m
kube-storage-version-migrator              4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  8m4s
machine-api                                4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  113m
machine-approver                           4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  112m
machine-config                             4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  7m2s
marketplace                                4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  15m
monitoring                                 4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  35m
network                                    4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  27m
node-tuning                                4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  37m
openshift-apiserver                        4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  45m
openshift-controller-manager               4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  35m
openshift-samples                          4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  37m
operator-lifecycle-manager                 4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  113m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  113m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  21m
service-ca                                 4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  114m
storage                                    4.7.0-0.nightly-s390x-2021-01-07-140745   True        False         False	  114m

Comment 11 Michal Fojtik 2021-01-07 16:24:40 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 12 Stefan Schimanski 2021-01-12 11:50:25 UTC
Repeating what Antonio wrote:

   from MCO and/or any operator POV we will toggle between failing/upgrading and we keep re-trying the loop and that's how we re-sync (in every operator). Differentiating an hard failure from a failure that needs to be re-tried isn't trivial (or possible in certain scenario) so what the UI does is just querying 
  the Cluster operator API which can still toggle its status while it's upgrading (or reconciling).

This is the ask, not finding randon upgrade issues.

Comment 14 Samuel Padgett 2021-01-20 14:40:54 UTC
Stefan, I'm a little unclear on the ask for the console team. We're simply showing the status from the ClusterVersion resource. Is the request not to show it in the UI when the `Failing` condition is `True`? Or to throttle updates so it doesn't appear to flap?

Comment 15 Samuel Padgett 2021-01-21 17:51:08 UTC
I understand the usability concern here, but I'm not sure the console team is the right owner. Console is simply showing the status from the ClusterVersion resource. You would see the same Failing condition on the CLI with `oc describe clusterversion verion`. Moving to the CVO component for consideration.

It's possible for console to ignore the `Failing` condition on the ClusterVersion or present it in a different way. But my question would be if it's confusing or misleading, should the CVO adjust when it sets that condition? If the OTA team has a recommendation for how the console can show upgrade status differently, we can definitely make changes.

Comment 16 W. Trevor King 2021-01-21 19:24:31 UTC
[1] is currently in flight for bug 1884334, to fix some CVO logic that had us going Failing=True when we shouldn't have been.  Yes, this is just cosmetic, because the CVO will continue to attempt to reconcile/update the cluster while it is complaining.  I'm going to close this bug as a dup of 1884334, but if you still see issues with the CVO reporting Failing=True after that (or on releases where that change has been backported), please file bugs like "CVO is reporting Failing=True with $REASON and $MESSAGE, but $REASONS_WHY_YOU_THINK_THE_CLUSTER_IS_ACTUALLY_HAPPY".

[1]: https://github.com/openshift/cluster-version-operator/pull/486

*** This bug has been marked as a duplicate of bug 1884334 ***


Note You need to log in before you can comment on or make changes to this bug.