Bug 1951639

Summary:	Bootstrap API server unclean shutdown causes reconcile delay
Product:	OpenShift Container Platform	Reporter:	Michael Gugino <mgugino>
Component:	kube-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED ERRATA	QA Contact:	Ke Wang <kewang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.8	CC:	aos-bugs, kewang, mfojtik, wking, xxia
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:02:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michael Gugino 2021-04-20 15:49:33 UTC

Description of problem:

During installation, the bootstrap API server is shutdown in such a way that it appears to accept new connections for a time while shutting down.  At some point, connections are accepted and never terminated.  This leads to a delay in reconciliation for various operators delaying installation time and causing CI flakes.

Here are the details of one run for diagnosis:
https://bugzilla.redhat.com/show_bug.cgi?id=1940972#c5

That bug is about CSR approval, but the underlying case of unclean bootstrap API server shutdown could apply to any operator.

Comment 2 Ke Wang 2021-05-31 05:19:37 UTC

Talked to Dev, the PR code changes involved flags that are not exposed and not visible, it is hard to verify them from QE, need to call the two flags with installer in bootstrap node. Recently team installation tests didn't see regression issue related to this PR, so move the bug VERIFIED.

Comment 3 Ke Wang 2021-05-31 08:14:04 UTC

Did a check with cluster-bootstrap image of openshift, confirmed the two flags already in payload.

$ oc adm release info registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-29-114625 --pullspecs | grep cluster-bootstrap
  cluster-bootstrap                              quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

$ docker run -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

[root@7773a3b248e3 /]# ./cluster-bootstrap start -h | grep -E "\-\-tear-down-delay|\-\-tear-down-termination-timeout"
      --tear-down-delay duration                 duration to delay the bootstrap control-plane tear-down before bootstrap-success event is created, in order to give load-balancers time to observe the self-hosted control-plane. This even applies in case of --tear-down-early.
      --tear-down-termination-timeout duration   wait of (graceful) termination of the bootstrap control-plane before reporting success. Set to zero to disable.

Comment 6 errata-xmlrpc 2021-07-27 23:02:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438