Bug 2058282

Summary: Websockets stop updating during cluster upgrades
Product: OpenShift Container Platform Reporter: Samuel Padgett <spadgett>
Component: Management ConsoleAssignee: Yadan Pei <yapei>
Status: CLOSED ERRATA QA Contact: Yadan Pei <yapei>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: aos-bugs, jhadvig, rhamilto, yapei
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:51:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2073023    
Attachments:
Description Flags
upgrade status bar shows correctly none

Description Samuel Padgett 2022-02-24 16:11:57 UTC
During cluster upgrades, web console pages that hold watches can stop updating. It appears that the WebSockets are either closing and not getting reopened, or they stay open but stop receiving new messages. Console has logic that will attempt to reopen closed WebSockets, but it is not working during upgrades.

This is particularly problematic when users initiate an upgrade through the console cluster settings page. We watch cluster operators and machine config pools to show the progress during upgrades, but the progress bars stop updating midway. We are also adding a pause/resume button for updating machine config pools. The button doesn't change state when clicked since the WebSockets aren't getting new messages.

I'm opening this against the Management Console component to investigate further. It could be an issue with ingress or the API server. We should determine the WebSockets are getting closed and whether console tries to reopen them. (We should be printing messages to the JS console when this happens.) It would also be good to understand what cluster operator is being updated when the updates stop occurring. A HAR file collected during upgrade could be helpful. Another thing to try is to keep the console side-by-side with a terminal window that is watching updates to ClusterOperators to make sure they're in sync and track when the UI updates stop.

Comment 1 Robb Hamilton 2022-02-24 17:08:14 UTC
Based on my cursory investigations that led to this bug, I believe the update of the kube-apiserver ClusterOperator is what causes this bug as it is the second ClusterOperator to be updated after etcd, and the console correctly reports etcd is updated but not kube-apiserver or any resources that follow.

Comment 2 Robb Hamilton 2022-02-24 17:55:35 UTC
I think the start of the ClusterOperators updating may be a red herring.  I was just able to reproduce the bug before any of the ClusterOperators started updating.

Comment 5 Yadan Pei 2022-04-15 09:25:09 UTC
Created attachment 1872712 [details]
upgrade status bar shows correctly

1. pull latest master code, build a local bridge
2. trigger an upgrade from 4.10.8 to 4.10.9, ClusterOperators progress updates status successfully, Pause update/Resume update state can be changed correctly when clicked

Comment 6 Yadan Pei 2022-04-15 09:25:53 UTC
upgrade must be performed on console since the issue seems only reproducible on console

Comment 8 errata-xmlrpc 2022-08-10 10:51:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069