Bug 2058282 - Websockets stop updating during cluster upgrades
Summary: Websockets stop updating during cluster upgrades
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Yadan Pei
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On:
Blocks: 2073023
TreeView+ depends on / blocked
 
Reported: 2022-02-24 16:11 UTC by Samuel Padgett
Modified: 2022-08-10 10:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:51:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
upgrade status bar shows correctly (615.01 KB, image/png)
2022-04-15 09:25 UTC, Yadan Pei
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift console pull 11288 0 None open Bug 2058282: Fix WebSockets not reconnecting during upgrade 2022-04-05 17:47:01 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:51:34 UTC

Description Samuel Padgett 2022-02-24 16:11:57 UTC
During cluster upgrades, web console pages that hold watches can stop updating. It appears that the WebSockets are either closing and not getting reopened, or they stay open but stop receiving new messages. Console has logic that will attempt to reopen closed WebSockets, but it is not working during upgrades.

This is particularly problematic when users initiate an upgrade through the console cluster settings page. We watch cluster operators and machine config pools to show the progress during upgrades, but the progress bars stop updating midway. We are also adding a pause/resume button for updating machine config pools. The button doesn't change state when clicked since the WebSockets aren't getting new messages.

I'm opening this against the Management Console component to investigate further. It could be an issue with ingress or the API server. We should determine the WebSockets are getting closed and whether console tries to reopen them. (We should be printing messages to the JS console when this happens.) It would also be good to understand what cluster operator is being updated when the updates stop occurring. A HAR file collected during upgrade could be helpful. Another thing to try is to keep the console side-by-side with a terminal window that is watching updates to ClusterOperators to make sure they're in sync and track when the UI updates stop.

Comment 1 Robb Hamilton 2022-02-24 17:08:14 UTC
Based on my cursory investigations that led to this bug, I believe the update of the kube-apiserver ClusterOperator is what causes this bug as it is the second ClusterOperator to be updated after etcd, and the console correctly reports etcd is updated but not kube-apiserver or any resources that follow.

Comment 2 Robb Hamilton 2022-02-24 17:55:35 UTC
I think the start of the ClusterOperators updating may be a red herring.  I was just able to reproduce the bug before any of the ClusterOperators started updating.

Comment 5 Yadan Pei 2022-04-15 09:25:09 UTC
Created attachment 1872712 [details]
upgrade status bar shows correctly

1. pull latest master code, build a local bridge
2. trigger an upgrade from 4.10.8 to 4.10.9, ClusterOperators progress updates status successfully, Pause update/Resume update state can be changed correctly when clicked

Comment 6 Yadan Pei 2022-04-15 09:25:53 UTC
upgrade must be performed on console since the issue seems only reproducible on console

Comment 8 errata-xmlrpc 2022-08-10 10:51:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.