Created attachment 1701935 [details] Log bundle collected during the problem Description of problem: During bootstrapping of master nodes there is issue with etcd coming up on them. It might be the root cause of other problems but it might just one of several problems. As etcd never comes up on master hence API server does not come up too on master and API endpoint is never moved from bootstrap node to master. Version-Release number of the following components: ./openshift-install 4.4.11 built from commit db69e0456f2f7d6b937a8e88fc1ee6be32bf61fd release image quay.io/openshift-release-dev/ocp-release@sha256:bf373a678979c1bf09069eb34f51e8c8180ef3488a8f6d915a99047320e81c24 ansible 2.9.11 config file = /etc/ansible/ansible.cfg configured module search path = ['/home/mimazure/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible executable location = /usr/local/bin/ansible python version = 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0] How reproducible: Always - same happens with 4.5.1 OCP, 4.5 OKD - IPI and even same happens with UPI installation on 4.5.1. Tried with 1 master with 3 masters. Problem is always the same. Steps to Reproduce: ./openshift-install create cluster --dir=. --log-level=debug Actual results: Etcd on master nodes doesn't come up during bootstraping --> after api_fip starts pointing to selected master node. **Bootstrap node** ```console Jul 18 22:14:45 ocp-uc-dzmv6-bootstrap bootkube.sh[2162]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1" ``` **Master** ```console [core@ocp-uc-tnfdt-master-0 ~]$ sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=etcd-member --quiet) --quiet) 2020-07-18T21:42:25+0000 Entrypoint skipped copying Multus binary. 2020-07-18T21:42:25+0000 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d... 2020-07-18T21:42:25+0000 Attemping to find master plugin configuration, attempt 0 2020-07-18T21:42:30+0000 Attemping to find master plugin configuration, attempt 5 2020-07-18T21:42:35+0000 Attemping to find master plugin configuration, attempt 10 2020-07-18T21:42:40+0000 Attemping to find master plugin configuration, attempt 15 2020-07-18T21:42:45+0000 Attemping to find master plugin configuration, attempt 20 [core@ocp-uc-tnfdt-master-0 ~]$ sudo crictl pods --name=etcd-member POD ID CREATED STATE NAME NAMESPACE ATTEMPT [core@ocp-uc-tnfdt-master-0 ~]$ ```console Jul 18 21:48:07 ocp-uc-tnfdt-master-0 hyperkube[1367]: E0718 21:48:07.929542 1367 openstack_instances.go:71] cannot initialize cloud provider, only limited functionality is available : cloud provider is not initialized ``` ```console ./journals/kubelet.log:Jul 18 08:52:20 ocp-uc-bwv9j-master-0 hyperkube[1374]: I0718 08:52:20.764876 1374 flags.go:33] FLAG: --cni-conf-dir="/etc/cni/net.d" ./journals/kubelet.log:Jul 18 08:52:21 ocp-uc-bwv9j-master-0 hyperkube[1374]: E0718 08:52:21.091181 1374 kubelet.go:2194] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? ``` there’s no etcd directory under /etc/kubernetes ```console /var/log/pods/openshift-openstack-infra_coredns-ocp-uc-tnfdt-master-0_10d55529df7d2cca408ef148939eb0e9/render-config/0.log:2020-07-18T21:41:13.074240077+00:00 stderr F time="2020-07-18T21:41:13Z" level=info msg="Failed to get Etcd SRV members" err="lookup _etcd-server-ssl._tcp.ocp.uc.nelab on 10.0.0.6:53: no such host" /var/log/pods/openshift-openstack-infra_coredns-ocp-uc-tnfdt-master-0_10d55529df7d2cca408ef148939eb0e9/render-config/0.log:2020-07-18T21:41:13.075370581+00:00 stderr F time="2020-07-18T21:41:13Z" level=info msg="Failed to get Etcd SRV members" err="lookup _etcd-server-ssl._tcp.ocp.uc.nelab on 10.0.0.6:53: no such host" /var/log/pods/openshift-network-operator_network-operator-58554b89d-j9bfq_914b8fda-4f26-42af-a32c-7e353aeca869/network-operator/0.log:2020-07-18T21:40:59.514887672+00:00 stderr F 2020/07/18 21:40:59 configmap 'openshift-config/initial-etcd-ca' name differs from trustedCA of proxy 'cluster' or trustedCA not set; reconciliation will be skipped ``` Expected results: Bootstraping should continue and API server should be accessible on master nodes. Additional info: Tried to add in /etc/hosts on master api-int endpoint to point to API_FIP rather than IP address of master VM (this is like that by default). IT looked that process has continued further but etcd never came up and API server was never started on master node.
The logs seem to refer to the failed install with 1 single master, which is currently not a supported setup. Can you please post the log-bundle of the three-masters failed install? It's also worth mentioning that ETCD requires fast disks in order to prevent serial leader elections.
Yes - I can repeat it with 3 masters (on 4.5.1 which I have just tried). Let me repeat it and get back to you. Good catch on disk speed - however we I think we should be good with the following: *-disk description: ATA Disk product: INTEL SSDSC2BB30
Created attachment 1702251 [details] Log bundle from try with 3 masters
Comment on attachment 1702251 [details] Log bundle from try with 3 masters 4.5.1 & IPI
I am very sorry, I did put wrongly number of controller replicas to 1. Need to rerun again.
Created attachment 1702260 [details] Log bundle from try with 3 masters (CORRECT ONE) 4.5.1/IPI with 3 masters
This is caused by your cluster name having a dot in it. We've added a check to prevent dots in cluster names. Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1857158 *** This bug has been marked as a duplicate of bug 1857158 ***