1748452 – race condition on bootstrap node between openshift and bootkube services

Bug 1748452 - race condition on bootstrap node between openshift and bootkube services

Summary: race condition on bootstrap node between openshift and bootkube services

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.0
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1731242
TreeView+	depends on / blocked

Reported:	2019-09-03 15:55 UTC by Marius Cornea
Modified:	2020-01-23 11:06 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:05:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
bootkube logs (1.52 MB, text/plain) 2019-09-13 15:31 UTC, Doug Hellmann	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift-kni install-scripts issues 55	'None'	closed	03_create_cluster.sh fails with msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed o...	2021-01-21 20:53:42 UTC
Github	openshift installer pull 1381	'None'	closed	Bug 1748452: data/bootstrap: Replace openshift.sh with cluster-bootstrap	2021-01-21 20:53:41 UTC
Red Hat Product Errata	RHBA-2020:0062	None	None	None	2020-01-23 11:06:00 UTC

Description Marius Cornea 2019-09-03 15:55:28 UTC

Description of problem:

This issue was filed to keep track of https://github.com/openshift-kni/install-scripts/issues/55

Comment 1 Doug Hellmann 2019-09-09 15:45:10 UTC

I ran the installer again and captured the output of bootkube. I see:

Sep 06 18:20:42 localhost systemd[1]: Started Bootstrap a Kubernetes cluster.
Sep 06 18:20:43 localhost podman[2075]: 2019-09-06 18:20:43.861216889 +0000 UTC m=+0.239315627 container create beb1374fdac38de9ff24326a837f1ffe46f7b0ad63f677c931fc428416d26bd1 (image=registry.svc.ci.openshift.org/ocp/release sha256:9abcba8184b1221cf91438fcfa3c50da2f6813a2855ee793e5a419b6df15dc1f, name=adoring_driscoll)

This is followed by a bunch of additional messages about creating containers, then a series of messages describing types of manifests being generated and the lists of the manifest filenames, then:

Sep 06 18:31:27 localhost bootkube.sh[1955]: Starting temporary bootstrap control plane...

Then a series of messages about the things created by those manifests, along with some failures because kinds are not defined:

Sep 06 18:31:44 localhost bootkube.sh[1955]: [#38] failed to create some manifests:
Sep 06 18:31:44 localhost bootkube.sh[1955]: "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml": unable to get REST mapping for "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

Then near the end of the log I see:

Sep 06 18:35:02 localhost bootkube.sh[1955]: All self-hosted control plane components successfully started
Sep 06 18:35:02 localhost bootkube.sh[1955]: Sending bootstrap-success event.Waiting for remaining assets to be created.

Then there are a bunch of messages about skipping YAML files because things already exist, like:

Sep 06 18:35:32 localhost bootkube.sh[1955]: Skipped "0000_00_cluster-version-operator_00_namespace.yaml" namespaces.v1./openshift-cluster-version -n  as it already exists
Sep 06 18:35:32 localhost bootkube.sh[1955]: Skipped "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml" customresourcedefinitions.v1beta1.apiextensions.k8s.io/clusteroperators.config.openshift.io -n  as it already exists

Then the logs end with:

Sep 06 18:36:00 localhost bootkube.sh[1955]: Tearing down temporary bootstrap control plane...
Sep 06 18:36:00 localhost bootkube.sh[1955]: bootkube.service complete

Meanwhile, the openshift service running on the bootstrap node is still trying to create the host resources.

So, it seems that bootkube is shutting down before we're done trying to use it.

Comment 3 Doug Hellmann 2019-09-09 15:51:45 UTC

The underlying problem here is the way the installer works, so I have moved it into that component. I have a fix in process already.

Comment 5 W. Trevor King 2019-09-10 19:11:58 UTC

After Doug explained the issue to me, I've revived [1] to resolve the race.  And although my PR is old, the race was introduced by the new-in-4.2 shift to loopback kubeconfigs [2], so I think we may want to target 4.2 after all.  Although for some reason I can't find the Target Release drop-down...

[1]: https://github.com/openshift/installer/pull/1381
[2]: https://github.com/openshift/installer/pull/2086

Comment 6 Doug Hellmann 2019-09-10 20:04:03 UTC

Testing with PR #1381, it's not clear it fixes the issue. I get messages like this in the logs from bootkube.sh:

Sep 10 19:30:26 localhost bootkube.sh[24699]: Starting temporary bootstrap control plane...
Sep 10 19:30:26 localhost bootkube.sh[24699]: E0910 19:30:26.103835       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused
Sep 10 19:30:26 localhost bootkube.sh[24699]: Assert creation failed: failed to load some manifests:
Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed
Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_worker-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_worker-user-data-secret.yaml" to unstructed
Sep 10 19:30:26 localhost bootkube.sh[24699]: Error: error while checking pod status: timed out waiting for the condition
Sep 10 19:30:26 localhost bootkube.sh[24699]: Tearing down temporary bootstrap control plane...
Sep 10 19:30:26 localhost bootkube.sh[24699]: Error: error while checking pod status: timed out waiting for the condition

Comment 9 liujia 2019-09-12 06:12:09 UTC

Following up our official install process(upi/vsphere), QE can not reproduce the issue on an old version 4.2.0-0.nightly-2019-09-05-234433, since the build 4.2.0-0.nightly-2019-08-29-062233 mentioned in description is not available now.During the bootstrap stage, monitor the logs of openshift and bootkube services.
...
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Creating object from file: ./99_role-cloud-creds-secret-reader.yaml ...
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Executing kubectl create --filename ./99_role-cloud-creds-secret-reader.yaml
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: role.rbac.authorization.k8s.io/vsphere-creds-secret-reader created
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Done creating object from file: ./99_role-cloud-creds-secret-reader.yaml ...
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: OpenShift installation is done

...
Sep 12 03:08:07 bootstrap-0 bootkube.sh[1272]: Tearing down temporary bootstrap control plane...
Sep 12 03:08:08 bootstrap-0 bootkube.sh[1272]: bootkube.service complete
Above shows that openshift finish before bootkube complete.

Tried on latest 4.2.0-0.nightly-2019-09-11-202233, did not hit the issue either. 

Went through the email in comment2, it seems some changes of manifest files done before installation, which should cause the services abnormally. Sounds like it is a must for kni team, but for users following up the official doc, it will not block the setup on vsphere.

Comment 13 W. Trevor King 2019-09-13 12:55:06 UTC

> Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed

This is now fixed in my PR.

Comment 14 Doug Hellmann 2019-09-13 14:42:27 UTC

(In reply to W. Trevor King from comment #13)
> > Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed
> 
> This is now fixed in my PR.

Testing locally with the latest version of the PR, I see messages like:

Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"
Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-1.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-1.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"
Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-2.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-2.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"

These are repeated many, many times in the bootkube.sh logs.

The BareMetalHost CRD is defined by the machine-api-operator, so I will try adding that to the list of services that cluster-bootstrap waits for.

Comment 15 W. Trevor King 2019-09-13 15:25:41 UTC

Attach your bootkube.service log?  I expected bootstrap-service would wait until it had pushed all its manifests before shutting down the bootstrap API server, but maybe I'm wrong.

Comment 16 Doug Hellmann 2019-09-13 15:25:41 UTC

It looks like the issue with the host files may be a bug in the machine-api-operator. https://github.com/openshift/machine-api-operator/issues/397

Comment 17 Doug Hellmann 2019-09-13 15:31:51 UTC

Created attachment 1614912 [details]
bootkube logs

Comment 18 W. Trevor King 2019-09-13 15:48:39 UTC

Looks like the temporary bootstrap control plane was still running:

```
$ grep 'temporary bootstrap' /tmp/bootkube.log 
Sep 13 15:00:33 localhost bootkube.sh[1959]: Starting temporary bootstrap control plane...
```

So yeah, might just be the MCO issue.

Comment 26 liujia 2019-10-31 02:40:03 UTC

Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now. So this bug still blocked.

Comment 27 W. Trevor King 2019-10-31 16:13:05 UTC

> Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now.

There aren't?  [1] has lots of green 4.3 nightlies, most recently [2,3].  Which has:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:d1800917f0e983fc617afd78622e2851eff9cfb4d15e9482a3b3ca14da756504 | grep installer
  baremetal-installer                           https://github.com/openshift/installer                                     f2ac89df630f6ae6efb91e8a6f01c5a120389942
  installer                                     https://github.com/openshift/installer                                     f2ac89df630f6ae6efb91e8a6f01c5a120389942
  installer-artifacts                           https://github.com/openshift/installer                                     f2ac89df630f6ae6efb91e8a6f01c5a120389942

and that includes the change for this bug:

$ git log --oneline f2ac89df630f6 | grep 'data/bootstrap: Replace openshift.sh with cluster-bootstrap'
108a45bdb data/bootstrap: Replace openshift.sh with cluster-bootstrap

[1]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly
[2]: https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-10-31-085222
[3]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-31-085222/

Comment 28 liujia 2019-11-04 01:41:11 UTC

(In reply to W. Trevor King from comment #27)
> > Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now.
> 
> There aren't?  [1] has lots of green 4.3 nightlies, most recently [2,3]. 
> Which has:
> 
I mean the available nightly build which pass ci-test against upi/vsphere platform. According to comment 9 and comment 24, we need do verification on upi/vsphere, right?

Comment 29 liujia 2019-11-25 03:09:04 UTC

According to comment 24, do regression test against 4.3.0-0.nightly-2019-11-22-050018. Installation for upi/vsphere succeed.

Comment 31 errata-xmlrpc 2020-01-23 11:05:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.