During cluster upgrades, web console pages that hold watches can stop updating. It appears that the WebSockets are either closing and not getting reopened, or they stay open but stop receiving new messages. Console has logic that will attempt to reopen closed WebSockets, but it is not working during upgrades.
This is particularly problematic when users initiate an upgrade through the console cluster settings page. We watch cluster operators and machine config pools to show the progress during upgrades, but the progress bars stop updating midway. We are also adding a pause/resume button for updating machine config pools. The button doesn't change state when clicked since the WebSockets aren't getting new messages.
I'm opening this against the Management Console component to investigate further. It could be an issue with ingress or the API server. We should determine the WebSockets are getting closed and whether console tries to reopen them. (We should be printing messages to the JS console when this happens.) It would also be good to understand what cluster operator is being updated when the updates stop occurring. A HAR file collected during upgrade could be helpful. Another thing to try is to keep the console side-by-side with a terminal window that is watching updates to ClusterOperators to make sure they're in sync and track when the UI updates stop.
Based on my cursory investigations that led to this bug, I believe the update of the kube-apiserver ClusterOperator is what causes this bug as it is the second ClusterOperator to be updated after etcd, and the console correctly reports etcd is updated but not kube-apiserver or any resources that follow.
I think the start of the ClusterOperators updating may be a red herring. I was just able to reproduce the bug before any of the ClusterOperators started updating.
Created attachment 1872712 [details]
upgrade status bar shows correctly
1. pull latest master code, build a local bridge
2. trigger an upgrade from 4.10.8 to 4.10.9, ClusterOperators progress updates status successfully, Pause update/Resume update state can be changed correctly when clicked
upgrade must be performed on console since the issue seems only reproducible on console
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.