1797508 – bootstrap can not complete because fail to create some manifests

Bug 1797508 - bootstrap can not complete because fail to create some manifests

Summary: bootstrap can not complete because fail to create some manifests

Keywords:
Status:	CLOSED DUPLICATE of bug 1794824
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Abhinav Dahiya
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-03 10:24 UTC by liujia
Modified:	2020-02-07 23:22 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-07 23:22:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description liujia 2020-02-03 10:24:47 UTC

Description of problem:
UPI/vsphere fail at "wait for bootstrapping to complete" stage.
level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.jliu-44.qe.devcluster.openshift.com:6443..."
level=info msg="API v1.17.1 up"
level=info msg="Waiting up to 40m0s for bootstrapping to complete..."
level=info msg="Use the following commands to gather logs from the cluster"
level=info msg="openshift-install gather bootstrap --help"
level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"

The bootkube.sh log shows:
1. etcd endpoint use bootstrap ip instead of control plane hostname, so etcd is not actually healthy before "Starting cluster-bootstrap..." stage.

...
Feb 03 07:47:21 bootstrap-0 bootkube.sh[1867]: {"level":"warn","ts":"2020-02-03T07:47:21.021Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-7636cee0-75e2-4933-8a3b-e6f150e8fd7b/139.178.76.18:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 139.178.76.18:2379: connect: connection refused\""}
Feb 03 07:47:21 bootstrap-0 bootkube.sh[1867]: https://139.178.76.18:2379 is unhealthy: failed to commit proposal: context deadline exceeded
...
Feb 03 07:47:26 bootstrap-0 bootkube.sh[1867]: https://139.178.76.18:2379 is healthy: successfully committed proposal: took = 7.997972ms
Feb 03 07:47:26 bootstrap-0 podman[5111]: 2020-02-03 07:47:26.563591121 +0000 UTC m=+0.425265151 container died 73b37234d156b127807da9d0862d7addab9f81a3cae7d814e48801081d5b8df6 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3d0208b957da5f116933fe90df4fbe3682968289ac668bebe812d74ac8fe9c0, name=etcdctl)
Feb 03 07:47:26 bootstrap-0 podman[5111]: 2020-02-03 07:47:26.592853689 +0000 UTC m=+0.454527719 container remove 73b37234d156b127807da9d0862d7addab9f81a3cae7d814e48801081d5b8df6 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3d0208b957da5f116933fe90df4fbe3682968289ac668bebe812d74ac8fe9c0, name=etcdctl)
Feb 03 07:47:26 bootstrap-0 bootkube.sh[1867]: etcd cluster up. Killing etcd certificate signer...
...
There are many tcp connect on 2379 on bootstrap node. At the same time, control plane hosts does not come up due to :22623 is not available from bootstrap node.
[root@bootstrap-0 core]# netstat -na|grep 2379|wc -l
269

2. Some manifests fail to create during "Starting cluster-bootstrap..." stage.
...
Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: [#115] failed to create some manifests:
Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: "99_openshift-cluster-api_worker-machineset-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_worker-machineset-0.yaml": no matches for kind "MachineSet" in version "machine.openshift.io/v1beta1"
Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"
Feb 03 07:48:38 bootstrap-0 bootkube.sh[1867]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"
...

Version-Release number of the following components:
4.4.0-0.nightly-2020-02-03-043955

How reproducible:
always

Steps to Reproduce:
1. Trigger upi/vsphere installation with 4.4.0-0.nightly-2020-02-03-043955
2.
3.

Actual results:
install fail

Expected results:
install succeed

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 liujia 2020-02-03 10:26:56 UTC

Add testblocker since it block v4.4 test on vsphere.

Comment 4 liujia 2020-02-06 10:16:47 UTC

I am not quite sure the root cause is from etcd since mco does not get 22623 at all, which cause control plane host can not launch successfully.
# crictl ps -a|grep -E 'api|machine'
7baaa719aed88       29028117595da69c5128adf7816f45a5f35e988df3f37b2ac71ee353572665c5                                                         2 minutes ago       Exited              machine-config-controller        6                   faa1d50063168
72447f205806e       363d404df555581648b531c988675f95cb38ae424c25c44cbee9f6c846ba2364                                                         7 minutes ago       Running             kube-apiserver-insecure-readyz   0                   10766ac2d1da5
240949b21f42a       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:66514d986e20ef7abd5e348513973cb9c5fc6b75a9aa5b3bfac5219fb5a7b51f   7 minutes ago       Running             kube-apiserver                   0                   10766ac2d1da5

[root@bootstrap-0 core]# crictl logs 7baaa719aed88
I0206 10:01:53.692150       1 bootstrap.go:40] Version: v4.4.0-202002030016-dirty (09fe53e2e47bc6f8129376dfe389e98fc151ff48)
F0206 10:01:53.751233       1 bootstrap.go:47] error running MCC[BOOTSTRAP]: failed to create MachineConfig for role master: failed to execute template: template: /etc/mcc/templates/common/vsphere/files/vsphere-coredns-db.yaml:14:26: executing "/etc/mcc/templates/common/vsphere/files/vsphere-coredns-db.yaml" at <.Infra.Status.PlatformStatus.VSphere.APIServerInternalIP>: nil pointer evaluating *v1.VSpherePlatformStatus.APIServerInternalIP

So change back to installer component first to get above issue fixed first. BTW, according to qe's test, this issue only happend on vsphere(I mean bootstrap can not finish and stuck on "fail to create some manifests" error without control plane launched)

Comment 5 Abhinav Dahiya 2020-02-07 23:22:16 UTC

The error message makes it look like dup of 1794824, which was fixed by https://github.com/openshift/machine-config-operator/pull/1425

*** This bug has been marked as a duplicate of bug 1794824 ***

Note You need to log in before you can comment on or make changes to this bug.