Bug 2007324 - race condition can cause in cluster-bootstrap can cause crashlooping bootstrap kube-apiserver
Summary: race condition can cause in cluster-bootstrap can cause crashlooping bootstra...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.9.0
Assignee: aos-install
QA Contact: Yunfei Jiang
Depends On: 2006945
TreeView+ depends on / blocked
Reported: 2021-09-23 14:54 UTC by Devan Goodwin
Modified: 2021-10-18 17:52 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2006945
Last Closed: 2021-10-18 17:51:56 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-bootstrap pull 66 0 None Merged [release-4.9] bug 2007324: update library-go for hardcoded restmapper 2021-10-05 22:13:17 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:52:20 UTC

Description Devan Goodwin 2021-09-23 14:54:59 UTC
+++ This bug was initially created as a clone of Bug #2006945 +++


failed 4.9 installs in CI, examples: 

OVN: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1439917064897695744

SDN: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade/1440456016905900032

Platform: aws IPI

In the artifacts for these failed installs a log-bundle.tar is generated due to bootstrap failure.

In the bootkube.log you can see a large number of messages such as: 

bootstrap/journals/bootkube.log:Sep 21 23:44:37 ip-10-0-15-48 bootkube.sh[2293]: "0000_20_kube-apiserver-operator_00_cr-scc-hostaccess.yaml": unable to get REST mapping for "0000_20_kube-apiserver-operator_00_cr-scc-hostaccess.yaml": no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1"

Which is for a core type that should be present.

deads has prepared a:

proposed fix: https://github.com/openshift/library-go/pull/1219
proof pr: https://github.com/openshift/cluster-bootstrap/pull/64

Slack thread with deads and stts: https://coreos.slack.com/archives/C02FFM5PNSG/p1632319918039000

We are still trying to determine if these are why the installs are failing, but there's still a problem to be fixed here regardless.

Comment 5 Scott Dodson 2021-09-29 13:50:55 UTC
Yes, both of those tests (mentioned in private comments, why private?) used rc3 where as the fixes for bootstrap went into 4.9.0-rc4.

Comment 6 Scott Dodson 2021-09-29 14:30:59 UTC
Needs to be tested with 4.9.0-rc.4 as the starting point for upgrading to 4.10, rc.4 was promoted on Monday.

Comment 7 Amogh Rameshappa Devapura 2021-10-05 20:44:08 UTC
Tried installation with 4.9.0-rc.4, did not encounter the bug. https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/46177/console

Did not observe any install failures with the same bug for the above version from October 1st

@yunjiang please do have a look when you are back and reopen if needed.

Comment 8 Yunfei Jiang 2021-10-08 02:40:22 UTC
Thanks Amogh, one-time installation may not encounter this issue, I will monitor ci job for a while and to see if it's still there.

Comment 10 Yunfei Jiang 2021-10-09 05:36:34 UTC
Searched ci logs and didn't encounter this issue in recent 7 days, set to VERIFIED.

Comment 12 errata-xmlrpc 2021-10-18 17:51:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.