Bug 1951639

Summary: Bootstrap API server unclean shutdown causes reconcile delay
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: aos-bugs, kewang, mfojtik, wking, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:02:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Gugino 2021-04-20 15:49:33 UTC
Description of problem:

During installation, the bootstrap API server is shutdown in such a way that it appears to accept new connections for a time while shutting down.  At some point, connections are accepted and never terminated.  This leads to a delay in reconciliation for various operators delaying installation time and causing CI flakes.

Here are the details of one run for diagnosis:
https://bugzilla.redhat.com/show_bug.cgi?id=1940972#c5

That bug is about CSR approval, but the underlying case of unclean bootstrap API server shutdown could apply to any operator.

Comment 2 Ke Wang 2021-05-31 05:19:37 UTC
Talked to Dev, the PR code changes involved flags that are not exposed and not visible, it is hard to verify them from QE, need to call the two flags with installer in bootstrap node. Recently team installation tests didn't see regression issue related to this PR, so move the bug VERIFIED.

Comment 3 Ke Wang 2021-05-31 08:14:04 UTC
Did a check with cluster-bootstrap image of openshift, confirmed the two flags already in payload.

$ oc adm release info registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-29-114625 --pullspecs | grep cluster-bootstrap
  cluster-bootstrap                              quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

$ docker run -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

[root@7773a3b248e3 /]# ./cluster-bootstrap start -h | grep -E "\-\-tear-down-delay|\-\-tear-down-termination-timeout"
      --tear-down-delay duration                 duration to delay the bootstrap control-plane tear-down before bootstrap-success event is created, in order to give load-balancers time to observe the self-hosted control-plane. This even applies in case of --tear-down-early.
      --tear-down-termination-timeout duration   wait of (graceful) termination of the bootstrap control-plane before reporting success. Set to zero to disable.

Comment 6 errata-xmlrpc 2021-07-27 23:02:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438