Bug 2007324

Summary: race condition can cause in cluster-bootstrap can cause crashlooping bootstrap kube-apiserver
Product: OpenShift Container Platform Reporter: Devan Goodwin <dgoodwin>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-install, aramesha, deads, jhixson, rpattath, sdodson, wking, yunjiang
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2006945 Environment:
Last Closed: 2021-10-18 17:51:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2006945    
Bug Blocks:    

Description Devan Goodwin 2021-09-23 14:54:59 UTC
+++ This bug was initially created as a clone of Bug #2006945 +++

Version:

failed 4.9 installs in CI, examples: 

OVN: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1439917064897695744

SDN: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade/1440456016905900032

Platform: aws IPI

In the artifacts for these failed installs a log-bundle.tar is generated due to bootstrap failure.

In the bootkube.log you can see a large number of messages such as: 

bootstrap/journals/bootkube.log:Sep 21 23:44:37 ip-10-0-15-48 bootkube.sh[2293]: "0000_20_kube-apiserver-operator_00_cr-scc-hostaccess.yaml": unable to get REST mapping for "0000_20_kube-apiserver-operator_00_cr-scc-hostaccess.yaml": no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1"

Which is for a core type that should be present.

deads has prepared a:

proposed fix: https://github.com/openshift/library-go/pull/1219
proof pr: https://github.com/openshift/cluster-bootstrap/pull/64

Slack thread with deads and stts: https://coreos.slack.com/archives/C02FFM5PNSG/p1632319918039000

We are still trying to determine if these are why the installs are failing, but there's still a problem to be fixed here regardless.

Comment 5 Scott Dodson 2021-09-29 13:50:55 UTC
Yes, both of those tests (mentioned in private comments, why private?) used rc3 where as the fixes for bootstrap went into 4.9.0-rc4.

Comment 6 Scott Dodson 2021-09-29 14:30:59 UTC
Needs to be tested with 4.9.0-rc.4 as the starting point for upgrading to 4.10, rc.4 was promoted on Monday.

Comment 7 Amogh Rameshappa Devapura 2021-10-05 20:44:08 UTC
Tried installation with 4.9.0-rc.4, did not encounter the bug. https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/46177/console

Did not observe any install failures with the same bug for the above version from October 1st

@yunjiang please do have a look when you are back and reopen if needed.

Comment 8 Yunfei Jiang 2021-10-08 02:40:22 UTC
Thanks Amogh, one-time installation may not encounter this issue, I will monitor ci job for a while and to see if it's still there.

Comment 10 Yunfei Jiang 2021-10-09 05:36:34 UTC
Searched ci logs and didn't encounter this issue in recent 7 days, set to VERIFIED.

Comment 12 errata-xmlrpc 2021-10-18 17:51:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759