2058282 – Websockets stop updating during cluster upgrades

Bug 2058282 - Websockets stop updating during cluster upgrades

Summary: Websockets stop updating during cluster upgrades

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Management Console
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Yadan Pei
QA Contact:	Yadan Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2073023
TreeView+	depends on / blocked

Reported:	2022-02-24 16:11 UTC by Samuel Padgett
Modified:	2022-08-10 10:51 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:51:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
upgrade status bar shows correctly (615.01 KB, image/png) 2022-04-15 09:25 UTC, Yadan Pei	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift console pull 11288	0	None	open	Bug 2058282: Fix WebSockets not reconnecting during upgrade	2022-04-05 17:47:01 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:51:34 UTC

Description Samuel Padgett 2022-02-24 16:11:57 UTC

During cluster upgrades, web console pages that hold watches can stop updating. It appears that the WebSockets are either closing and not getting reopened, or they stay open but stop receiving new messages. Console has logic that will attempt to reopen closed WebSockets, but it is not working during upgrades.

This is particularly problematic when users initiate an upgrade through the console cluster settings page. We watch cluster operators and machine config pools to show the progress during upgrades, but the progress bars stop updating midway. We are also adding a pause/resume button for updating machine config pools. The button doesn't change state when clicked since the WebSockets aren't getting new messages.

I'm opening this against the Management Console component to investigate further. It could be an issue with ingress or the API server. We should determine the WebSockets are getting closed and whether console tries to reopen them. (We should be printing messages to the JS console when this happens.) It would also be good to understand what cluster operator is being updated when the updates stop occurring. A HAR file collected during upgrade could be helpful. Another thing to try is to keep the console side-by-side with a terminal window that is watching updates to ClusterOperators to make sure they're in sync and track when the UI updates stop.

Comment 1 Robb Hamilton 2022-02-24 17:08:14 UTC

Based on my cursory investigations that led to this bug, I believe the update of the kube-apiserver ClusterOperator is what causes this bug as it is the second ClusterOperator to be updated after etcd, and the console correctly reports etcd is updated but not kube-apiserver or any resources that follow.

Comment 2 Robb Hamilton 2022-02-24 17:55:35 UTC

I think the start of the ClusterOperators updating may be a red herring.  I was just able to reproduce the bug before any of the ClusterOperators started updating.

Comment 5 Yadan Pei 2022-04-15 09:25:09 UTC

Created attachment 1872712 [details]
upgrade status bar shows correctly

1. pull latest master code, build a local bridge
2. trigger an upgrade from 4.10.8 to 4.10.9, ClusterOperators progress updates status successfully, Pause update/Resume update state can be changed correctly when clicked

Comment 6 Yadan Pei 2022-04-15 09:25:53 UTC

upgrade must be performed on console since the issue seems only reproducible on console

Comment 8 errata-xmlrpc 2022-08-10 10:51:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.