Bug 1748452 - race condition on bootstrap node between openshift and bootkube services
Summary: race condition on bootstrap node between openshift and bootkube services
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.3.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 1731242
TreeView+ depends on / blocked
 
Reported: 2019-09-03 15:55 UTC by Marius Cornea
Modified: 2020-01-23 11:06 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:05:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bootkube logs (1.52 MB, text/plain)
2019-09-13 15:31 UTC, Doug Hellmann
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift-kni install-scripts issues 55 0 'None' closed 03_create_cluster.sh fails with msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed o... 2021-01-21 20:53:42 UTC
Github openshift installer pull 1381 0 'None' closed Bug 1748452: data/bootstrap: Replace openshift.sh with cluster-bootstrap 2021-01-21 20:53:41 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:06:00 UTC

Description Marius Cornea 2019-09-03 15:55:28 UTC
Description of problem:

This issue was filed to keep track of https://github.com/openshift-kni/install-scripts/issues/55

Comment 1 Doug Hellmann 2019-09-09 15:45:10 UTC
I ran the installer again and captured the output of bootkube. I see:

Sep 06 18:20:42 localhost systemd[1]: Started Bootstrap a Kubernetes cluster.
Sep 06 18:20:43 localhost podman[2075]: 2019-09-06 18:20:43.861216889 +0000 UTC m=+0.239315627 container create beb1374fdac38de9ff24326a837f1ffe46f7b0ad63f677c931fc428416d26bd1 (image=registry.svc.ci.openshift.org/ocp/release sha256:9abcba8184b1221cf91438fcfa3c50da2f6813a2855ee793e5a419b6df15dc1f, name=adoring_driscoll)

This is followed by a bunch of additional messages about creating containers, then a series of messages describing types of manifests being generated and the lists of the manifest filenames, then:

Sep 06 18:31:27 localhost bootkube.sh[1955]: Starting temporary bootstrap control plane...

Then a series of messages about the things created by those manifests, along with some failures because kinds are not defined:

Sep 06 18:31:44 localhost bootkube.sh[1955]: [#38] failed to create some manifests:
Sep 06 18:31:44 localhost bootkube.sh[1955]: "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml": unable to get REST mapping for "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

Then near the end of the log I see:

Sep 06 18:35:02 localhost bootkube.sh[1955]: All self-hosted control plane components successfully started
Sep 06 18:35:02 localhost bootkube.sh[1955]: Sending bootstrap-success event.Waiting for remaining assets to be created.

Then there are a bunch of messages about skipping YAML files because things already exist, like:

Sep 06 18:35:32 localhost bootkube.sh[1955]: Skipped "0000_00_cluster-version-operator_00_namespace.yaml" namespaces.v1./openshift-cluster-version -n  as it already exists
Sep 06 18:35:32 localhost bootkube.sh[1955]: Skipped "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml" customresourcedefinitions.v1beta1.apiextensions.k8s.io/clusteroperators.config.openshift.io -n  as it already exists

Then the logs end with:

Sep 06 18:36:00 localhost bootkube.sh[1955]: Tearing down temporary bootstrap control plane...
Sep 06 18:36:00 localhost bootkube.sh[1955]: bootkube.service complete

Meanwhile, the openshift service running on the bootstrap node is still trying to create the host resources.

So, it seems that bootkube is shutting down before we're done trying to use it.

Comment 3 Doug Hellmann 2019-09-09 15:51:45 UTC
The underlying problem here is the way the installer works, so I have moved it into that component. I have a fix in process already.

Comment 5 W. Trevor King 2019-09-10 19:11:58 UTC
After Doug explained the issue to me, I've revived [1] to resolve the race.  And although my PR is old, the race was introduced by the new-in-4.2 shift to loopback kubeconfigs [2], so I think we may want to target 4.2 after all.  Although for some reason I can't find the Target Release drop-down...

[1]: https://github.com/openshift/installer/pull/1381
[2]: https://github.com/openshift/installer/pull/2086

Comment 6 Doug Hellmann 2019-09-10 20:04:03 UTC
Testing with PR #1381, it's not clear it fixes the issue. I get messages like this in the logs from bootkube.sh:

Sep 10 19:30:26 localhost bootkube.sh[24699]: Starting temporary bootstrap control plane...
Sep 10 19:30:26 localhost bootkube.sh[24699]: E0910 19:30:26.103835       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused
Sep 10 19:30:26 localhost bootkube.sh[24699]: Assert creation failed: failed to load some manifests:
Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed
Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_worker-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_worker-user-data-secret.yaml" to unstructed
Sep 10 19:30:26 localhost bootkube.sh[24699]: Error: error while checking pod status: timed out waiting for the condition
Sep 10 19:30:26 localhost bootkube.sh[24699]: Tearing down temporary bootstrap control plane...
Sep 10 19:30:26 localhost bootkube.sh[24699]: Error: error while checking pod status: timed out waiting for the condition

Comment 9 liujia 2019-09-12 06:12:09 UTC
Following up our official install process(upi/vsphere), QE can not reproduce the issue on an old version 4.2.0-0.nightly-2019-09-05-234433, since the build 4.2.0-0.nightly-2019-08-29-062233 mentioned in description is not available now.During the bootstrap stage, monitor the logs of openshift and bootkube services.
...
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Creating object from file: ./99_role-cloud-creds-secret-reader.yaml ...
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Executing kubectl create --filename ./99_role-cloud-creds-secret-reader.yaml
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: role.rbac.authorization.k8s.io/vsphere-creds-secret-reader created
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Done creating object from file: ./99_role-cloud-creds-secret-reader.yaml ...
Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: OpenShift installation is done

...
Sep 12 03:08:07 bootstrap-0 bootkube.sh[1272]: Tearing down temporary bootstrap control plane...
Sep 12 03:08:08 bootstrap-0 bootkube.sh[1272]: bootkube.service complete
Above shows that openshift finish before bootkube complete.

Tried on latest 4.2.0-0.nightly-2019-09-11-202233, did not hit the issue either. 

Went through the email in comment2, it seems some changes of manifest files done before installation, which should cause the services abnormally. Sounds like it is a must for kni team, but for users following up the official doc, it will not block the setup on vsphere.

Comment 13 W. Trevor King 2019-09-13 12:55:06 UTC
> Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed

This is now fixed in my PR.

Comment 14 Doug Hellmann 2019-09-13 14:42:27 UTC
(In reply to W. Trevor King from comment #13)
> > Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed
> 
> This is now fixed in my PR.

Testing locally with the latest version of the PR, I see messages like:

Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"
Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-1.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-1.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"
Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-2.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-2.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"

These are repeated many, many times in the bootkube.sh logs.

The BareMetalHost CRD is defined by the machine-api-operator, so I will try adding that to the list of services that cluster-bootstrap waits for.

Comment 15 W. Trevor King 2019-09-13 15:25:41 UTC
Attach your bootkube.service log?  I expected bootstrap-service would wait until it had pushed all its manifests before shutting down the bootstrap API server, but maybe I'm wrong.

Comment 16 Doug Hellmann 2019-09-13 15:25:41 UTC
It looks like the issue with the host files may be a bug in the machine-api-operator. https://github.com/openshift/machine-api-operator/issues/397

Comment 17 Doug Hellmann 2019-09-13 15:31:51 UTC
Created attachment 1614912 [details]
bootkube logs

Comment 18 W. Trevor King 2019-09-13 15:48:39 UTC
Looks like the temporary bootstrap control plane was still running:

```
$ grep 'temporary bootstrap' /tmp/bootkube.log 
Sep 13 15:00:33 localhost bootkube.sh[1959]: Starting temporary bootstrap control plane...
```

So yeah, might just be the MCO issue.

Comment 26 liujia 2019-10-31 02:40:03 UTC
Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now. So this bug still blocked.

Comment 27 W. Trevor King 2019-10-31 16:13:05 UTC
> Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now.

There aren't?  [1] has lots of green 4.3 nightlies, most recently [2,3].  Which has:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:d1800917f0e983fc617afd78622e2851eff9cfb4d15e9482a3b3ca14da756504 | grep installer
  baremetal-installer                           https://github.com/openshift/installer                                     f2ac89df630f6ae6efb91e8a6f01c5a120389942
  installer                                     https://github.com/openshift/installer                                     f2ac89df630f6ae6efb91e8a6f01c5a120389942
  installer-artifacts                           https://github.com/openshift/installer                                     f2ac89df630f6ae6efb91e8a6f01c5a120389942

and that includes the change for this bug:

$ git log --oneline f2ac89df630f6 | grep 'data/bootstrap: Replace openshift.sh with cluster-bootstrap'
108a45bdb data/bootstrap: Replace openshift.sh with cluster-bootstrap

[1]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly
[2]: https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-10-31-085222
[3]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-31-085222/

Comment 28 liujia 2019-11-04 01:41:11 UTC
(In reply to W. Trevor King from comment #27)
> > Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now.
> 
> There aren't?  [1] has lots of green 4.3 nightlies, most recently [2,3]. 
> Which has:
> 
I mean the available nightly build which pass ci-test against upi/vsphere platform. According to comment 9 and comment 24, we need do verification on upi/vsphere, right?

Comment 29 liujia 2019-11-25 03:09:04 UTC
According to comment 24, do regression test against 4.3.0-0.nightly-2019-11-22-050018. Installation for upi/vsphere succeed.

Comment 31 errata-xmlrpc 2020-01-23 11:05:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.