Bug 1951639 - Bootstrap API server unclean shutdown causes reconcile delay
Summary: Bootstrap API server unclean shutdown causes reconcile delay
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-20 15:49 UTC by Michael Gugino
Modified: 2021-07-27 23:02 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:02:11 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-bootstrap pull 58 0 None open Bug 1951639: Add --tear-down-delay and --tear-down-termination-timeout 2021-04-21 07:39:07 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:02:26 UTC

Description Michael Gugino 2021-04-20 15:49:33 UTC
Description of problem:

During installation, the bootstrap API server is shutdown in such a way that it appears to accept new connections for a time while shutting down.  At some point, connections are accepted and never terminated.  This leads to a delay in reconciliation for various operators delaying installation time and causing CI flakes.

Here are the details of one run for diagnosis:
https://bugzilla.redhat.com/show_bug.cgi?id=1940972#c5

That bug is about CSR approval, but the underlying case of unclean bootstrap API server shutdown could apply to any operator.

Comment 2 Ke Wang 2021-05-31 05:19:37 UTC
Talked to Dev, the PR code changes involved flags that are not exposed and not visible, it is hard to verify them from QE, need to call the two flags with installer in bootstrap node. Recently team installation tests didn't see regression issue related to this PR, so move the bug VERIFIED.

Comment 3 Ke Wang 2021-05-31 08:14:04 UTC
Did a check with cluster-bootstrap image of openshift, confirmed the two flags already in payload.

$ oc adm release info registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-29-114625 --pullspecs | grep cluster-bootstrap
  cluster-bootstrap                              quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

$ docker run -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05446c3cf72252eeef8dbfa76d6d28da5e6d6a932b0e2ec2d66146e3f21b924e

[root@7773a3b248e3 /]# ./cluster-bootstrap start -h | grep -E "\-\-tear-down-delay|\-\-tear-down-termination-timeout"
      --tear-down-delay duration                 duration to delay the bootstrap control-plane tear-down before bootstrap-success event is created, in order to give load-balancers time to observe the self-hosted control-plane. This even applies in case of --tear-down-early.
      --tear-down-termination-timeout duration   wait of (graceful) termination of the bootstrap control-plane before reporting success. Set to zero to disable.

Comment 6 errata-xmlrpc 2021-07-27 23:02:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.