1809253 – [4.3.z] 502 error for Prometheus API after the cluster running overnight

Bug 1809253 - [4.3.z] 502 error for Prometheus API after the cluster running overnight

Summary: [4.3.z] 502 error for Prometheus API after the cluster running overnight

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Maru Newby
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1801573
Blocks:	1809258
TreeView+	depends on / blocked

Reported:	2020-03-02 17:42 UTC by Maru Newby
Modified:	2021-04-05 17:46 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1801573
Clones:	1809258 (view as bug list)
Environment:
Last Closed:	2020-03-24 14:34:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift oauth-proxy pull 159	None	closed	[release-4.3] Bug 1809253: Reload serving certs	2020-05-16 14:37:19 UTC
Github	openshift oauth-proxy pull 160	None	closed	[release-4.3] Bug 1809253: Reload serving certs	2020-05-16 14:37:19 UTC
Red Hat Product Errata	RHBA-2020:0858	None	None	None	2020-03-24 14:34:44 UTC

Comment 1 Scott Dodson 2020-03-18 15:22:14 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

Depending on the answers to the above questions we can remove UpgradeBlocker keyword.

Comment 2 Maru Newby 2020-03-18 16:16:24 UTC

Who is impacted?
  All customers that upgrade to 4.3.5.
What is the impact?
  The service CA will be rotated on upgrade, which is intended to ensure against CA expiry. Without the fix for this bz, though, oauth-proxy will not automatically refresh to pick up the new key material. If not restarted before expiry of the pre-rotation CA, any attempt to communicate via oauth-proxy will result in tls validation errors which will break many of the monitoring components (see [1]). For 4.1 clusters upgraded to 4.3.5 this could occur as soon as May 14th 2020.
How involved is remediation?
  Manual restart of the monitoring components that use oauth-proxy.
Is this a regression?
  No. Without automated rotation, manual rotation (including pod restart) would be required anyway.

1: https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

Comment 3 Junqi Zhao 2020-03-19 11:51:30 UTC

Tested with 4.3.0-0.nightly-2020-03-19-052824 and followed the case OCP-27992, issue is not happen

Comment 6 errata-xmlrpc 2020-03-24 14:34:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0858

Comment 7 W. Trevor King 2021-04-05 17:46:46 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.