Bug 1859290 - OCP 4.4.11 - etcd does not come up on master nodes during bootstrapping (API server not available)
Summary: OCP 4.4.11 - etcd does not come up on master nodes during bootstrapping (API ...
Keywords:
Status: CLOSED DUPLICATE of bug 1857158
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Eric Duen
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-21 15:59 UTC by michal_mazurek
Modified: 2020-07-30 14:52 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-30 14:52:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Log bundle collected during the problem (2.43 MB, application/gzip)
2020-07-21 15:59 UTC, michal_mazurek
no flags Details
Log bundle from try with 3 masters (2.37 MB, application/gzip)
2020-07-23 16:03 UTC, michal_mazurek
no flags Details
Log bundle from try with 3 masters (CORRECT ONE) (5.63 MB, application/gzip)
2020-07-23 17:15 UTC, michal_mazurek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer issues 3926 0 None closed OCP 4.4.11 - etcd does not come up on master nodes during bootstrapping (API server not available) 2021-01-16 01:51:16 UTC

Description michal_mazurek 2020-07-21 15:59:27 UTC
Created attachment 1701935 [details]
Log bundle collected during the problem

Description of problem:

During bootstrapping of master nodes there is issue with etcd coming up on them. It might be the root cause of other problems but it might just one of several problems. As etcd never comes up on master hence API server does not come up too on master and API endpoint is never moved from bootstrap node to master.

Version-Release number of the following components:
./openshift-install 4.4.11
built from commit db69e0456f2f7d6b937a8e88fc1ee6be32bf61fd
release image quay.io/openshift-release-dev/ocp-release@sha256:bf373a678979c1bf09069eb34f51e8c8180ef3488a8f6d915a99047320e81c24

ansible 2.9.11
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/mimazure/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]


How reproducible:

Always - same happens with 4.5.1 OCP, 4.5 OKD - IPI and even same happens with UPI installation on 4.5.1. Tried with 1 master with 3 masters. Problem is always the same. 

Steps to Reproduce:

./openshift-install create cluster --dir=. --log-level=debug

Actual results:

Etcd on master nodes doesn't come up during bootstraping --> after api_fip starts pointing to selected master node.

**Bootstrap node**
```console
Jul 18 22:14:45 ocp-uc-dzmv6-bootstrap bootkube.sh[2162]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"
```

**Master**
```console
[core@ocp-uc-tnfdt-master-0 ~]$ sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=etcd-member --quiet) --quiet)
2020-07-18T21:42:25+0000 Entrypoint skipped copying Multus binary.
2020-07-18T21:42:25+0000 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2020-07-18T21:42:25+0000 Attemping to find master plugin configuration, attempt 0
2020-07-18T21:42:30+0000 Attemping to find master plugin configuration, attempt 5
2020-07-18T21:42:35+0000 Attemping to find master plugin configuration, attempt 10
2020-07-18T21:42:40+0000 Attemping to find master plugin configuration, attempt 15
2020-07-18T21:42:45+0000 Attemping to find master plugin configuration, attempt 20
[core@ocp-uc-tnfdt-master-0 ~]$ sudo crictl pods --name=etcd-member
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT
[core@ocp-uc-tnfdt-master-0 ~]$

```console
Jul 18 21:48:07 ocp-uc-tnfdt-master-0 hyperkube[1367]: E0718 21:48:07.929542    1367 openstack_instances.go:71] cannot initialize cloud provider, only limited functionality is available : cloud provider is not initialized
```
```console
./journals/kubelet.log:Jul 18 08:52:20 ocp-uc-bwv9j-master-0 hyperkube[1374]: I0718 08:52:20.764876    1374 flags.go:33] FLAG: --cni-conf-dir="/etc/cni/net.d"
./journals/kubelet.log:Jul 18 08:52:21 ocp-uc-bwv9j-master-0 hyperkube[1374]: E0718 08:52:21.091181    1374 kubelet.go:2194] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
```
there’s no etcd directory under /etc/kubernetes
```console
/var/log/pods/openshift-openstack-infra_coredns-ocp-uc-tnfdt-master-0_10d55529df7d2cca408ef148939eb0e9/render-config/0.log:2020-07-18T21:41:13.074240077+00:00 stderr F time="2020-07-18T21:41:13Z" level=info msg="Failed to get Etcd SRV members" err="lookup _etcd-server-ssl._tcp.ocp.uc.nelab on 10.0.0.6:53: no such host"
/var/log/pods/openshift-openstack-infra_coredns-ocp-uc-tnfdt-master-0_10d55529df7d2cca408ef148939eb0e9/render-config/0.log:2020-07-18T21:41:13.075370581+00:00 stderr F time="2020-07-18T21:41:13Z" level=info msg="Failed to get Etcd SRV members" err="lookup _etcd-server-ssl._tcp.ocp.uc.nelab on 10.0.0.6:53: no such host"
/var/log/pods/openshift-network-operator_network-operator-58554b89d-j9bfq_914b8fda-4f26-42af-a32c-7e353aeca869/network-operator/0.log:2020-07-18T21:40:59.514887672+00:00 stderr F 2020/07/18 21:40:59 configmap 'openshift-config/initial-etcd-ca' name differs from trustedCA of proxy 'cluster' or trustedCA not set; reconciliation will be skipped
```

Expected results:

Bootstraping should continue and API server should be accessible on master nodes.

Additional info:

Tried to add in /etc/hosts on master api-int endpoint to point to API_FIP rather than IP address of master VM (this is like that by default). IT looked that process has continued further but etcd never came up and API server was never started on master node.

Comment 1 Pierre Prinetti 2020-07-23 14:11:09 UTC
The logs seem to refer to the failed install with 1 single master, which is currently not a supported setup.

Can you please post the log-bundle of the three-masters failed install?

It's also worth mentioning that ETCD requires fast disks in order to prevent serial leader elections.

Comment 2 michal_mazurek 2020-07-23 14:49:16 UTC
Yes - I can repeat it with 3 masters (on 4.5.1 which I have just tried).
Let me repeat it and get back to you. 
Good catch on disk speed - however we I think we should be good with the following:

  *-disk
       description: ATA Disk
       product: INTEL SSDSC2BB30

Comment 3 michal_mazurek 2020-07-23 16:03:06 UTC
Created attachment 1702251 [details]
Log bundle from try with 3 masters

Comment 4 michal_mazurek 2020-07-23 16:04:05 UTC
Comment on attachment 1702251 [details]
Log bundle from try with 3 masters

4.5.1 & IPI

Comment 5 michal_mazurek 2020-07-23 16:14:35 UTC
I am very sorry, I did put wrongly number of controller replicas to 1. Need to rerun again.

Comment 6 michal_mazurek 2020-07-23 17:15:32 UTC
Created attachment 1702260 [details]
Log bundle from try with 3 masters (CORRECT ONE)

4.5.1/IPI with 3 masters

Comment 8 Martin André 2020-07-30 14:52:57 UTC
This is caused by your cluster name having a dot in it. We've added a check to prevent dots in cluster names.
Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1857158

*** This bug has been marked as a duplicate of bug 1857158 ***


Note You need to log in before you can comment on or make changes to this bug.