Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1917115

Summary:

Kube-api failing to connect to etcd while installing single node on vm with 2 nics

Product:

OpenShift Container Platform

Reporter:

Igal Tsoiref <itsoiref>

Component:

Etcd

Assignee:

Sam Batschelet <sbatsche>

Status:

CLOSED DUPLICATE

QA Contact:

ge liu <geliu>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.8

CC:

aos-bugs, mfojtik, skolicha, xxia

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-03-05 19:34:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
must-gather-sno	none

Description Igal Tsoiref 2021-01-17 10:29:50 UTC

While installing single node(none platform) on vms with 2 nics, we saw that kube-api is trying to talk to etcd with wrong ip and got sll error.


2021-01-05T18:44:05.497047195+00:00 stderr F I0105 18:44:05.496965     20 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc01188de00, {TRANSIENT_FAILURE connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, 192.168.126.10, ::1, not 192.168.144.10"}

Comment 1 Stefan Schimanski 2021-01-18 10:07:52 UTC

The kube-apiserver gets the etcd endpoints from openshift-etcd/etcd-endpoints config map. Please verify its contents. Meanwhile moving to etcd.

Comment 2 Sam Batschelet 2021-01-18 14:33:34 UTC

Igal can you provide some logs for this issue? To understand the problem we would like to see the install-config and log bundles. Generally speaking, I would expect that machineCidr defined in install-config to allow for render to understand which IP to utilize.

Comment 3 Sam Batschelet 2021-01-18 14:42:20 UTC

I think in general we have an issue with bootstrap member management. We have logic that performs certain actions when we scale down the bootstrap node. In this case we never take that step so I believe we should review the full ramifications of not removing the bootstrap node.

Comment 4 Igal Tsoiref 2021-01-18 14:46:35 UTC

The problem is that in case of none platform/ single node, user is not asked to provide machine cidr but looks like installer sets some default. Maybe we must have cidr provided?

Comment 5 Sam Batschelet 2021-01-18 14:56:59 UTC

Yeah if we have a node with multiple IPs render needs a reasonable way to understand which IP should I use. I believe if machineCidr is not passed we take the first IP we find which is not very durable IMO. The failure to make the correct decision here results in apiserver with an invalid backend.

Comment 6 Sam Batschelet 2021-01-18 14:58:17 UTC

To clarify

> Yeah if we have a node

bootstrap node

Comment 7 Igal Tsoiref 2021-01-18 15:10:04 UTC

But etcd takes the right one. Why we just can't set same ip to the cm?

Comment 8 Sam Batschelet 2021-01-18 17:38:01 UTC

Created attachment 1748501 [details]
must-gather-sno

attached is a must-gather for reference from a sno cluster that pivoted correctly.

Comment 9 Sam Batschelet 2021-01-18 17:53:56 UTC

Here is an example of the issues I referred to above. Here we see the annotation ` alpha.installer.openshift.io/etcd-bootstrap: 192.168.126.10` still populated. This tells the system that the bootstrap node still exists. The apiserver uses this as well to populate the backend[1].You can see the problem is the bootstrap IP was invalid, one subconn in etcd client for apiserver would always be failing for each request.


```
- apiVersion: v1
  data:
    MTkyLjE2OC4xMjYuMTA: 192.168.126.10
  kind: ConfigMap
  metadata:
    annotations:
      alpha.installer.openshift.io/etcd-bootstrap: 192.168.126.10
    name: etcd-endpoints
    namespace: openshift-etcd
    resourceVersion: "7459"
    uid: 0546f04c-6de3-432b-b264-d602791f7e80
```

Another issue is that we define at least one ENV wrong. This basically says we have 2 etcd members and they are both the same. This should get cleaned up.

etcd-pod.yaml
```
          - name: "ALL_ETCD_ENDPOINTS"
              value: "https://192.168.126.10:2379,https://192.168.126.10:2379"
``` 

[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.7/pkg/operator/configobservation/etcdendpoints/observe_etcd_endpoints_test.go#L211

Comment 12 Sam Batschelet 2021-03-05 19:34:34 UTC


*** This bug has been marked as a duplicate of bug 1931658 ***