1951639 – Bootstrap API server unclean shutdown causes reconcile delay

Bug 1951639 - Bootstrap API server unclean shutdown causes reconcile delay

Summary: Bootstrap API server unclean shutdown causes reconcile delay

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-20 15:49 UTC by Michael Gugino
Modified:	2021-07-27 23:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:02:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-bootstrap pull 58	0	None	open	Bug 1951639: Add --tear-down-delay and --tear-down-termination-timeout	2021-04-21 07:39:07 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:02:26 UTC

Description Michael Gugino 2021-04-20 15:49:33 UTC

Description of problem:

During installation, the bootstrap API server is shutdown in such a way that it appears to accept new connections for a time while shutting down.  At some point, connections are accepted and never terminated.  This leads to a delay in reconciliation for various operators delaying installation time and causing CI flakes.

Here are the details of one run for diagnosis:
https://bugzilla.redhat.com/show_bug.cgi?id=1940972#c5

That bug is about CSR approval, but the underlying case of unclean bootstrap API server shutdown could apply to any operator.

Comment 2 Ke Wang 2021-05-31 05:19:37 UTC

Talked to Dev, the PR code changes involved flags that are not exposed and not visible, it is hard to verify them from QE, need to call the two flags with installer in bootstrap node. Recently team installation tests didn't see regression issue related to this PR, so move the bug VERIFIED.

Comment 3 Ke Wang 2021-05-31 08:14:04 UTC

Did a check with cluster-bootstrap image of openshift, confirmed the two flags already in payload.

$ oc adm release info registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-29-114625 --pullspecs | grep cluster-bootstrap
  cluster-bootstrap                              quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

$ docker run -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

[root@7773a3b248e3 /]# ./cluster-bootstrap start -h | grep -E "\-\-tear-down-delay|\-\-tear-down-termination-timeout"
      --tear-down-delay duration                 duration to delay the bootstrap control-plane tear-down before bootstrap-success event is created, in order to give load-balancers time to observe the self-hosted control-plane. This even applies in case of --tear-down-early.
      --tear-down-termination-timeout duration   wait of (graceful) termination of the bootstrap control-plane before reporting success. Set to zero to disable.

Comment 6 errata-xmlrpc 2021-07-27 23:02:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.