Description of problem: OpenShift 4.2 UPI install fails on CoreOS/VMWare - The bootstrap API server is up to an extent, but it's returning a 404 - The bootstrap is unable to approve CSRs Version-Release number of the following components: 4.2 How reproducible: Steps to Reproduce: 1. Customer is following instructions per: https://docs.openshift.com/container-platform/4.2/installing/installing_vsphere/installing-vsphere.html Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Successful install of 4.2 Additional info: Please attach logs from ansible-playbook with the -vvv flag
control-plane/10.123.13.102/journals/kubelet.log:Oct 31 14:49:37 etcd-1.o4.dr3.demo.sk hyperkube[1154]: E1031 14:49:37.907026 1154 certificate_manager.go:385] Failed while requesting a signed certificate from the master: cannot create certificate signing request: the server rejected our request for an unknown reason (post certificatesigningrequests.certificates.k8s.io)
Created attachment 1631710 [details] Log bundle
> log-bundle-20191031145106/bootstrap/journals/bootkube.log ``` Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-0.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.101:2379: connect: connection refused Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-2.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.103:2379: connect: connection refused Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-1.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.102:2379: connect: connection refused Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: Error: unhealthy cluster Oct 31 14:45:34 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: etcdctl failed. Retrying in 5 seconds... ``` The bootstrap-host is waiting for etcd-cluster formation on control-plane hosts. > log-bundle-20191031145106/bootstrap/containers/machine-config-server-ae8426373114ed617b03030a747589d9a38efc8a7aa38b07849219995bb86a86.log ``` I1031 14:04:26.885488 1 api.go:97] Pool master requested by 10.123.13.80:46260 I1031 14:04:26.885538 1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml" I1031 14:04:26.887600 1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml" I1031 14:05:47.131961 1 api.go:97] Pool master requested by 10.123.13.80:46890 I1031 14:05:47.132943 1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml" I1031 14:05:47.133592 1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml" I1031 14:07:35.453662 1 api.go:97] Pool master requested by 10.123.13.80:47728 I1031 14:07:35.454651 1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml" ``` The control-plane hosts have requested the ignition from bootstrap-host. So looking at > /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.101/ > /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.102/ there are containers running on that host `empty containers` directory and kubelet is also not showing errors for why etcd statisc pods are not running. > /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.103/ the init containers for etcd have completed but etcd-member pods are failinig or haven't started yet, no logs from kubelet regarding anything. Moving to node team to help debug.
Thank you, Abhinav, for the update.
Hi Abhinav - Did you have any updates from the node team? Were they able to help with debugging this? Thank you, Aja
bootstrap/containers/machine-config-controller-7b89f76874a18448df276b8ecf7a14cef4ad2911a6f1b9f062a20e12dc4ddbaf.log: ``` I1031 14:04:22.198338 1 bootstrap.go:40] Version: v4.2.0-201910101614-dirty (62b0b6d2a751a5f364f2e6d5c9cfe63419668777) W1031 14:04:22.426844 1 render.go:137] Warning: the controller config referenced an unsupported platform: vsphere W1031 14:04:22.466008 1 render.go:137] Warning: the controller config referenced an unsupported platform: vsphere ``` Looks like the MCO is reporting vsphere is an unsupported platform. This is strange because the docs show vsphere should be supported [1]. Going to reassign to the MCO team for more input. 1. https://docs.openshift.com/container-platform/4.2/installing/installing_vsphere/installing-vsphere.html#installation-vsphere-config-yaml_installing-vsphere
MCO has been waiting for verification on the status of vsphere : https://github.com/openshift/machine-config-operator/pull/998#discussion_r318568006 We are happy to merge (and make any other changes) but were told there were issues with kubelet on vsphere with no update to the contrary. Please let us know..
I have not heard of any Kubelet issues on Vsphere. Is there something the Node team should look into?
Antonio, is there anything Ryan/Node needs outside of your comment here?: https://github.com/openshift/machine-config-operator/pull/998#discussion_r318568006
(In reply to Kirsten Garrison from comment #9) > Antonio, is there anything Ryan/Node needs outside of your comment here?: > https://github.com/openshift/machine-config-operator/pull/ > 998#discussion_r318568006 I don't think so, also, that's just a warning, how is it causing any issue here?
As https://github.com/openshift/machine-config-operator/pull/998 has merged, is there a further problem we need to investigate in this BZ?
Hi Team, The problem we were investigating is the failed 4.2 installation on vSphere (see Comment 1 and Comment 3), and not just removing the warning message that stated vSphere was unsupported (Comment 6). Please let me know if there is more I need to gather to help solve this. Log Errors, from Comment 1: control-plane/10.123.13.102/journals/kubelet.log:Oct 31 14:49:37 etcd-1.o4.dr3.demo.sk hyperkube[1154]: E1031 14:49:37.907026 1154 certificate_manager.go:385] Failed while requesting a signed certificate from the master: cannot create certificate signing request: the server rejected our request for an unknown reason (post certificatesigningrequests.certificates.k8s.io) Log Errors, From Comment 3: > log-bundle-20191031145106/bootstrap/journals/bootkube.log ``` Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-0.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.101:2379: connect: connection refused Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-2.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.103:2379: connect: connection refused Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-1.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.102:2379: connect: connection refused Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: Error: unhealthy cluster Oct 31 14:45:34 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: etcdctl failed. Retrying in 5 seconds... ``` The bootstrap-host is waiting for etcd-cluster formation on control-plane hosts. > log-bundle-20191031145106/bootstrap/containers/machine-config-server-ae8426373114ed617b03030a747589d9a38efc8a7aa38b07849219995bb86a86.log ``` I1031 14:04:26.885488 1 api.go:97] Pool master requested by 10.123.13.80:46260 I1031 14:04:26.885538 1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml" I1031 14:04:26.887600 1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml" I1031 14:05:47.131961 1 api.go:97] Pool master requested by 10.123.13.80:46890 I1031 14:05:47.132943 1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml" I1031 14:05:47.133592 1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml" I1031 14:07:35.453662 1 api.go:97] Pool master requested by 10.123.13.80:47728 I1031 14:07:35.454651 1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml" ``` The control-plane hosts have requested the ignition from bootstrap-host. So looking at > /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.101/ > /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.102/ there are containers running on that host `empty containers` directory and kubelet is also not showing errors for why etcd statisc pods are not running. > /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.103/ the init containers for etcd have completed but etcd-member pods are failinig or haven't started yet, no logs from kubelet regarding anything.
Please double-check: - NTP/time on the ESXi hosts and confirm the guests have the correct time as well - Confirm all DNS records, confirm the RHCOS guests are resolving correctly.
I am asking the customer to confirm these items now. Thanks Joseph.
Any updates on this?
Aja, Any updates?