Description of problem: UPI/vsphere fail at "wait for bootstrapping to complete" stage. level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.jliu-44.qe.devcluster.openshift.com:6443..." level=info msg="API v1.17.1 up" level=info msg="Waiting up to 40m0s for bootstrapping to complete..." level=info msg="Use the following commands to gather logs from the cluster" level=info msg="openshift-install gather bootstrap --help" level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition" The bootkube.sh log shows: 1. etcd endpoint use bootstrap ip instead of control plane hostname, so etcd is not actually healthy before "Starting cluster-bootstrap..." stage. ... Feb 03 07:47:21 bootstrap-0 bootkube.sh[1867]: {"level":"warn","ts":"2020-02-03T07:47:21.021Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-7636cee0-75e2-4933-8a3b-e6f150e8fd7b/139.178.76.18:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 139.178.76.18:2379: connect: connection refused\""} Feb 03 07:47:21 bootstrap-0 bootkube.sh[1867]: https://139.178.76.18:2379 is unhealthy: failed to commit proposal: context deadline exceeded ... Feb 03 07:47:26 bootstrap-0 bootkube.sh[1867]: https://139.178.76.18:2379 is healthy: successfully committed proposal: took = 7.997972ms Feb 03 07:47:26 bootstrap-0 podman[5111]: 2020-02-03 07:47:26.563591121 +0000 UTC m=+0.425265151 container died 73b37234d156b127807da9d0862d7addab9f81a3cae7d814e48801081d5b8df6 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3d0208b957da5f116933fe90df4fbe3682968289ac668bebe812d74ac8fe9c0, name=etcdctl) Feb 03 07:47:26 bootstrap-0 podman[5111]: 2020-02-03 07:47:26.592853689 +0000 UTC m=+0.454527719 container remove 73b37234d156b127807da9d0862d7addab9f81a3cae7d814e48801081d5b8df6 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3d0208b957da5f116933fe90df4fbe3682968289ac668bebe812d74ac8fe9c0, name=etcdctl) Feb 03 07:47:26 bootstrap-0 bootkube.sh[1867]: etcd cluster up. Killing etcd certificate signer... ... There are many tcp connect on 2379 on bootstrap node. At the same time, control plane hosts does not come up due to :22623 is not available from bootstrap node. [root@bootstrap-0 core]# netstat -na|grep 2379|wc -l 269 2. Some manifests fail to create during "Starting cluster-bootstrap..." stage. ... Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: [#115] failed to create some manifests: Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: "99_openshift-cluster-api_worker-machineset-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_worker-machineset-0.yaml": no matches for kind "MachineSet" in version "machine.openshift.io/v1beta1" Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1" Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1" ... Version-Release number of the following components: 4.4.0-0.nightly-2020-02-03-043955 How reproducible: always Steps to Reproduce: 1. Trigger upi/vsphere installation with 4.4.0-0.nightly-2020-02-03-043955 2. 3. Actual results: install fail Expected results: install succeed Additional info: Please attach logs from ansible-playbook with the -vvv flag
Add testblocker since it block v4.4 test on vsphere.
I am not quite sure the root cause is from etcd since mco does not get 22623 at all, which cause control plane host can not launch successfully. # crictl ps -a|grep -E 'api|machine' 7baaa719aed88 29028117595da69c5128adf7816f45a5f35e988df3f37b2ac71ee353572665c5 2 minutes ago Exited machine-config-controller 6 faa1d50063168 72447f205806e 363d404df555581648b531c988675f95cb38ae424c25c44cbee9f6c846ba2364 7 minutes ago Running kube-apiserver-insecure-readyz 0 10766ac2d1da5 240949b21f42a quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66514d986e20ef7abd5e348513973cb9c5fc6b75a9aa5b3bfac5219fb5a7b51f 7 minutes ago Running kube-apiserver 0 10766ac2d1da5 [root@bootstrap-0 core]# crictl logs 7baaa719aed88 I0206 10:01:53.692150 1 bootstrap.go:40] Version: v4.4.0-202002030016-dirty (09fe53e2e47bc6f8129376dfe389e98fc151ff48) F0206 10:01:53.751233 1 bootstrap.go:47] error running MCC[BOOTSTRAP]: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-coredns-db.yaml:14:26: executing "/etc/mcc/templates/common/vsphere/files/vsphere-coredns-db.yaml" at <.Infra.Status.PlatformStatus.VSphere.APIServerInternalIP>: nil pointer evaluating *v1.VSpherePlatformStatus.APIServerInternalIP So change back to installer component first to get above issue fixed first. BTW, according to qe's test, this issue only happend on vsphere(I mean bootstrap can not finish and stuck on "fail to create some manifests" error without control plane launched)
The error message makes it look like dup of 1794824, which was fixed by https://github.com/openshift/machine-config-operator/pull/1425 *** This bug has been marked as a duplicate of bug 1794824 ***