Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1859290

Summary:

OCP 4.4.11 - etcd does not come up on master nodes during bootstrapping (API server not available)

Product:

OpenShift Container Platform

Reporter:

michal_mazurek

Component:

Installer

Assignee:

Eric Duen <eduen>

Installer sub component:

OpenShift on OpenStack

QA Contact:

David Sanz <dsanzmor>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

m.andre, michal_mazurek, pprinett

Version:

4.4

Keywords:

UpcomingSprint

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-07-30 14:52:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Log bundle collected during the problem	none
Log bundle from try with 3 masters	none
Log bundle from try with 3 masters (CORRECT ONE)	none

Description michal_mazurek 2020-07-21 15:59:27 UTC

Created attachment 1701935 [details]
Log bundle collected during the problem

Description of problem:

During bootstrapping of master nodes there is issue with etcd coming up on them. It might be the root cause of other problems but it might just one of several problems. As etcd never comes up on master hence API server does not come up too on master and API endpoint is never moved from bootstrap node to master.

Version-Release number of the following components:
./openshift-install 4.4.11
built from commit db69e0456f2f7d6b937a8e88fc1ee6be32bf61fd
release image quay.io/openshift-release-dev/ocp-release@sha256:bf373a678979c1bf09069eb34f51e8c8180ef3488a8f6d915a99047320e81c24

ansible 2.9.11
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/mimazure/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]


How reproducible:

Always - same happens with 4.5.1 OCP, 4.5 OKD - IPI and even same happens with UPI installation on 4.5.1. Tried with 1 master with 3 masters. Problem is always the same. 

Steps to Reproduce:

./openshift-install create cluster --dir=. --log-level=debug

Actual results:

Etcd on master nodes doesn't come up during bootstraping --> after api_fip starts pointing to selected master node.

**Bootstrap node**
```console
Jul 18 22:14:45 ocp-uc-dzmv6-bootstrap bootkube.sh[2162]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"
```

**Master**
```console
[core@ocp-uc-tnfdt-master-0 ~]$ sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=etcd-member --quiet) --quiet)
2020-07-18T21:42:25+0000 Entrypoint skipped copying Multus binary.
2020-07-18T21:42:25+0000 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2020-07-18T21:42:25+0000 Attemping to find master plugin configuration, attempt 0
2020-07-18T21:42:30+0000 Attemping to find master plugin configuration, attempt 5
2020-07-18T21:42:35+0000 Attemping to find master plugin configuration, attempt 10
2020-07-18T21:42:40+0000 Attemping to find master plugin configuration, attempt 15
2020-07-18T21:42:45+0000 Attemping to find master plugin configuration, attempt 20
[core@ocp-uc-tnfdt-master-0 ~]$ sudo crictl pods --name=etcd-member
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT
[core@ocp-uc-tnfdt-master-0 ~]$

```console
Jul 18 21:48:07 ocp-uc-tnfdt-master-0 hyperkube[1367]: E0718 21:48:07.929542    1367 openstack_instances.go:71] cannot initialize cloud provider, only limited functionality is available : cloud provider is not initialized
```
```console
./journals/kubelet.log:Jul 18 08:52:20 ocp-uc-bwv9j-master-0 hyperkube[1374]: I0718 08:52:20.764876    1374 flags.go:33] FLAG: --cni-conf-dir="/etc/cni/net.d"
./journals/kubelet.log:Jul 18 08:52:21 ocp-uc-bwv9j-master-0 hyperkube[1374]: E0718 08:52:21.091181    1374 kubelet.go:2194] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
```
there’s no etcd directory under /etc/kubernetes
```console
/var/log/pods/openshift-openstack-infra_coredns-ocp-uc-tnfdt-master-0_10d55529df7d2cca408ef148939eb0e9/render-config/0.log:2020-07-18T21:41:13.074240077+00:00 stderr F time="2020-07-18T21:41:13Z" level=info msg="Failed to get Etcd SRV members" err="lookup _etcd-server-ssl._tcp.ocp.uc.nelab on 10.0.0.6:53: no such host"
/var/log/pods/openshift-openstack-infra_coredns-ocp-uc-tnfdt-master-0_10d55529df7d2cca408ef148939eb0e9/render-config/0.log:2020-07-18T21:41:13.075370581+00:00 stderr F time="2020-07-18T21:41:13Z" level=info msg="Failed to get Etcd SRV members" err="lookup _etcd-server-ssl._tcp.ocp.uc.nelab on 10.0.0.6:53: no such host"
/var/log/pods/openshift-network-operator_network-operator-58554b89d-j9bfq_914b8fda-4f26-42af-a32c-7e353aeca869/network-operator/0.log:2020-07-18T21:40:59.514887672+00:00 stderr F 2020/07/18 21:40:59 configmap 'openshift-config/initial-etcd-ca' name differs from trustedCA of proxy 'cluster' or trustedCA not set; reconciliation will be skipped
```

Expected results:

Bootstraping should continue and API server should be accessible on master nodes.

Additional info:

Tried to add in /etc/hosts on master api-int endpoint to point to API_FIP rather than IP address of master VM (this is like that by default). IT looked that process has continued further but etcd never came up and API server was never started on master node.

Comment 1 Pierre Prinetti 2020-07-23 14:11:09 UTC

The logs seem to refer to the failed install with 1 single master, which is currently not a supported setup.

Can you please post the log-bundle of the three-masters failed install?

It's also worth mentioning that ETCD requires fast disks in order to prevent serial leader elections.

Comment 2 michal_mazurek 2020-07-23 14:49:16 UTC

Yes - I can repeat it with 3 masters (on 4.5.1 which I have just tried).
Let me repeat it and get back to you. 
Good catch on disk speed - however we I think we should be good with the following:

  *-disk
       description: ATA Disk
       product: INTEL SSDSC2BB30

Comment 3 michal_mazurek 2020-07-23 16:03:06 UTC

Created attachment 1702251 [details]
Log bundle from try with 3 masters

Comment 4 michal_mazurek 2020-07-23 16:04:05 UTC

Comment on attachment 1702251 [details]
Log bundle from try with 3 masters

4.5.1 & IPI

Comment 5 michal_mazurek 2020-07-23 16:14:35 UTC

I am very sorry, I did put wrongly number of controller replicas to 1. Need to rerun again.

Comment 6 michal_mazurek 2020-07-23 17:15:32 UTC

Created attachment 1702260 [details]
Log bundle from try with 3 masters (CORRECT ONE)

4.5.1/IPI with 3 masters

Comment 8 Martin André 2020-07-30 14:52:57 UTC

This is caused by your cluster name having a dot in it. We've added a check to prevent dots in cluster names.
Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1857158

*** This bug has been marked as a duplicate of bug 1857158 ***