Description of problem: This issue was filed to keep track of https://github.com/openshift-kni/install-scripts/issues/55
I ran the installer again and captured the output of bootkube. I see: Sep 06 18:20:42 localhost systemd[1]: Started Bootstrap a Kubernetes cluster. Sep 06 18:20:43 localhost podman[2075]: 2019-09-06 18:20:43.861216889 +0000 UTC m=+0.239315627 container create beb1374fdac38de9ff24326a837f1ffe46f7b0ad63f677c931fc428416d26bd1 (image=registry.svc.ci.openshift.org/ocp/release sha256:9abcba8184b1221cf91438fcfa3c50da2f6813a2855ee793e5a419b6df15dc1f, name=adoring_driscoll) This is followed by a bunch of additional messages about creating containers, then a series of messages describing types of manifests being generated and the lists of the manifest filenames, then: Sep 06 18:31:27 localhost bootkube.sh[1955]: Starting temporary bootstrap control plane... Then a series of messages about the things created by those manifests, along with some failures because kinds are not defined: Sep 06 18:31:44 localhost bootkube.sh[1955]: [#38] failed to create some manifests: Sep 06 18:31:44 localhost bootkube.sh[1955]: "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml": unable to get REST mapping for "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1" Then near the end of the log I see: Sep 06 18:35:02 localhost bootkube.sh[1955]: All self-hosted control plane components successfully started Sep 06 18:35:02 localhost bootkube.sh[1955]: Sending bootstrap-success event.Waiting for remaining assets to be created. Then there are a bunch of messages about skipping YAML files because things already exist, like: Sep 06 18:35:32 localhost bootkube.sh[1955]: Skipped "0000_00_cluster-version-operator_00_namespace.yaml" namespaces.v1./openshift-cluster-version -n as it already exists Sep 06 18:35:32 localhost bootkube.sh[1955]: Skipped "0000_00_cluster-version-operator_01_clusteroperator.crd.yaml" customresourcedefinitions.v1beta1.apiextensions.k8s.io/clusteroperators.config.openshift.io -n as it already exists Then the logs end with: Sep 06 18:36:00 localhost bootkube.sh[1955]: Tearing down temporary bootstrap control plane... Sep 06 18:36:00 localhost bootkube.sh[1955]: bootkube.service complete Meanwhile, the openshift service running on the bootstrap node is still trying to create the host resources. So, it seems that bootkube is shutting down before we're done trying to use it.
The underlying problem here is the way the installer works, so I have moved it into that component. I have a fix in process already.
After Doug explained the issue to me, I've revived [1] to resolve the race. And although my PR is old, the race was introduced by the new-in-4.2 shift to loopback kubeconfigs [2], so I think we may want to target 4.2 after all. Although for some reason I can't find the Target Release drop-down... [1]: https://github.com/openshift/installer/pull/1381 [2]: https://github.com/openshift/installer/pull/2086
Testing with PR #1381, it's not clear it fixes the issue. I get messages like this in the logs from bootkube.sh: Sep 10 19:30:26 localhost bootkube.sh[24699]: Starting temporary bootstrap control plane... Sep 10 19:30:26 localhost bootkube.sh[24699]: E0910 19:30:26.103835 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused Sep 10 19:30:26 localhost bootkube.sh[24699]: Assert creation failed: failed to load some manifests: Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_worker-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_worker-user-data-secret.yaml" to unstructed Sep 10 19:30:26 localhost bootkube.sh[24699]: Error: error while checking pod status: timed out waiting for the condition Sep 10 19:30:26 localhost bootkube.sh[24699]: Tearing down temporary bootstrap control plane... Sep 10 19:30:26 localhost bootkube.sh[24699]: Error: error while checking pod status: timed out waiting for the condition
Following up our official install process(upi/vsphere), QE can not reproduce the issue on an old version 4.2.0-0.nightly-2019-09-05-234433, since the build 4.2.0-0.nightly-2019-08-29-062233 mentioned in description is not available now.During the bootstrap stage, monitor the logs of openshift and bootkube services. ... Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Creating object from file: ./99_role-cloud-creds-secret-reader.yaml ... Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Executing kubectl create --filename ./99_role-cloud-creds-secret-reader.yaml Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: role.rbac.authorization.k8s.io/vsphere-creds-secret-reader created Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: Done creating object from file: ./99_role-cloud-creds-secret-reader.yaml ... Sep 12 03:03:55 bootstrap-0 openshift.sh[1273]: OpenShift installation is done ... Sep 12 03:08:07 bootstrap-0 bootkube.sh[1272]: Tearing down temporary bootstrap control plane... Sep 12 03:08:08 bootstrap-0 bootkube.sh[1272]: bootkube.service complete Above shows that openshift finish before bootkube complete. Tried on latest 4.2.0-0.nightly-2019-09-11-202233, did not hit the issue either. Went through the email in comment2, it seems some changes of manifest files done before installation, which should cause the services abnormally. Sounds like it is a must for kni team, but for users following up the official doc, it will not block the setup on vsphere.
> Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed This is now fixed in my PR.
(In reply to W. Trevor King from comment #13) > > Sep 10 19:30:26 localhost bootkube.sh[24699]: "99_openshift-cluster-api_master-user-data-secret.yaml": unable to convert asset "99_openshift-cluster-api_master-user-data-secret.yaml" to unstructed > > This is now fixed in my PR. Testing locally with the latest version of the PR, I see messages like: Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1" Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-1.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-1.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1" Sep 13 14:37:45 localhost bootkube.sh[1954]: "99_openshift-cluster-api_hosts-2.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-2.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1" These are repeated many, many times in the bootkube.sh logs. The BareMetalHost CRD is defined by the machine-api-operator, so I will try adding that to the list of services that cluster-bootstrap waits for.
Attach your bootkube.service log? I expected bootstrap-service would wait until it had pushed all its manifests before shutting down the bootstrap API server, but maybe I'm wrong.
It looks like the issue with the host files may be a bug in the machine-api-operator. https://github.com/openshift/machine-api-operator/issues/397
Created attachment 1614912 [details] bootkube logs
Looks like the temporary bootstrap control plane was still running: ``` $ grep 'temporary bootstrap' /tmp/bootkube.log Sep 13 15:00:33 localhost bootkube.sh[1959]: Starting temporary bootstrap control plane... ``` So yeah, might just be the MCO issue.
Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now. So this bug still blocked.
> Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now. There aren't? [1] has lots of green 4.3 nightlies, most recently [2,3]. Which has: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release-nightly@sha256:d1800917f0e983fc617afd78622e2851eff9cfb4d15e9482a3b3ca14da756504 | grep installer baremetal-installer https://github.com/openshift/installer f2ac89df630f6ae6efb91e8a6f01c5a120389942 installer https://github.com/openshift/installer f2ac89df630f6ae6efb91e8a6f01c5a120389942 installer-artifacts https://github.com/openshift/installer f2ac89df630f6ae6efb91e8a6f01c5a120389942 and that includes the change for this bug: $ git log --oneline f2ac89df630f6 | grep 'data/bootstrap: Replace openshift.sh with cluster-bootstrap' 108a45bdb data/bootstrap: Replace openshift.sh with cluster-bootstrap [1]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly [2]: https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-10-31-085222 [3]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.3.0-0.nightly-2019-10-31-085222/
(In reply to W. Trevor King from comment #27) > > Check latest 4.3 builds, there are not available nightly builds which pass the ci-test now. > > There aren't? [1] has lots of green 4.3 nightlies, most recently [2,3]. > Which has: > I mean the available nightly build which pass ci-test against upi/vsphere platform. According to comment 9 and comment 24, we need do verification on upi/vsphere, right?
According to comment 24, do regression test against 4.3.0-0.nightly-2019-11-22-050018. Installation for upi/vsphere succeed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062