Bug 2058282

Summary:

Websockets stop updating during cluster upgrades

Product:

OpenShift Container Platform

Reporter:

Samuel Padgett <spadgett>

Component:

Management Console

Assignee:

Yadan Pei <yapei>

Status:

CLOSED ERRATA

QA Contact:

Yadan Pei <yapei>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.10

CC:

aos-bugs, jhadvig, rhamilto, yapei

Target Milestone:

---

Target Release:

4.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-10 10:51:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2073023

Attachments:

Description	Flags
upgrade status bar shows correctly	none

Description Samuel Padgett 2022-02-24 16:11:57 UTC

During cluster upgrades, web console pages that hold watches can stop updating. It appears that the WebSockets are either closing and not getting reopened, or they stay open but stop receiving new messages. Console has logic that will attempt to reopen closed WebSockets, but it is not working during upgrades.

This is particularly problematic when users initiate an upgrade through the console cluster settings page. We watch cluster operators and machine config pools to show the progress during upgrades, but the progress bars stop updating midway. We are also adding a pause/resume button for updating machine config pools. The button doesn't change state when clicked since the WebSockets aren't getting new messages.

I'm opening this against the Management Console component to investigate further. It could be an issue with ingress or the API server. We should determine the WebSockets are getting closed and whether console tries to reopen them. (We should be printing messages to the JS console when this happens.) It would also be good to understand what cluster operator is being updated when the updates stop occurring. A HAR file collected during upgrade could be helpful. Another thing to try is to keep the console side-by-side with a terminal window that is watching updates to ClusterOperators to make sure they're in sync and track when the UI updates stop.

Comment 1 Robb Hamilton 2022-02-24 17:08:14 UTC

Based on my cursory investigations that led to this bug, I believe the update of the kube-apiserver ClusterOperator is what causes this bug as it is the second ClusterOperator to be updated after etcd, and the console correctly reports etcd is updated but not kube-apiserver or any resources that follow.

Comment 2 Robb Hamilton 2022-02-24 17:55:35 UTC

I think the start of the ClusterOperators updating may be a red herring.  I was just able to reproduce the bug before any of the ClusterOperators started updating.

Comment 5 Yadan Pei 2022-04-15 09:25:09 UTC

Created attachment 1872712 [details]
upgrade status bar shows correctly

1. pull latest master code, build a local bridge
2. trigger an upgrade from 4.10.8 to 4.10.9, ClusterOperators progress updates status successfully, Pause update/Resume update state can be changed correctly when clicked

Comment 6 Yadan Pei 2022-04-15 09:25:53 UTC

upgrade must be performed on console since the issue seems only reproducible on console

Comment 8 errata-xmlrpc 2022-08-10 10:51:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069