Bug 1917115
| Summary: | Kube-api failing to connect to etcd while installing single node on vm with 2 nics | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Igal Tsoiref <itsoiref> | ||||
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.8 | CC: | aos-bugs, mfojtik, skolicha, xxia | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.8.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-03-05 19:34:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
The kube-apiserver gets the etcd endpoints from openshift-etcd/etcd-endpoints config map. Please verify its contents. Meanwhile moving to etcd. Igal can you provide some logs for this issue? To understand the problem we would like to see the install-config and log bundles. Generally speaking, I would expect that machineCidr defined in install-config to allow for render to understand which IP to utilize. I think in general we have an issue with bootstrap member management. We have logic that performs certain actions when we scale down the bootstrap node. In this case we never take that step so I believe we should review the full ramifications of not removing the bootstrap node. The problem is that in case of none platform/ single node, user is not asked to provide machine cidr but looks like installer sets some default. Maybe we must have cidr provided? Yeah if we have a node with multiple IPs render needs a reasonable way to understand which IP should I use. I believe if machineCidr is not passed we take the first IP we find which is not very durable IMO. The failure to make the correct decision here results in apiserver with an invalid backend. To clarify
> Yeah if we have a node
bootstrap node
But etcd takes the right one. Why we just can't set same ip to the cm? Created attachment 1748501 [details]
must-gather-sno
attached is a must-gather for reference from a sno cluster that pivoted correctly.
Here is an example of the issues I referred to above. Here we see the annotation ` alpha.installer.openshift.io/etcd-bootstrap: 192.168.126.10` still populated. This tells the system that the bootstrap node still exists. The apiserver uses this as well to populate the backend[1].You can see the problem is the bootstrap IP was invalid, one subconn in etcd client for apiserver would always be failing for each request.
```
- apiVersion: v1
data:
MTkyLjE2OC4xMjYuMTA: 192.168.126.10
kind: ConfigMap
metadata:
annotations:
alpha.installer.openshift.io/etcd-bootstrap: 192.168.126.10
name: etcd-endpoints
namespace: openshift-etcd
resourceVersion: "7459"
uid: 0546f04c-6de3-432b-b264-d602791f7e80
```
Another issue is that we define at least one ENV wrong. This basically says we have 2 etcd members and they are both the same. This should get cleaned up.
etcd-pod.yaml
```
- name: "ALL_ETCD_ENDPOINTS"
value: "https://192.168.126.10:2379,https://192.168.126.10:2379"
```
[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.7/pkg/operator/configobservation/etcdendpoints/observe_etcd_endpoints_test.go#L211
*** This bug has been marked as a duplicate of bug 1931658 *** |
While installing single node(none platform) on vms with 2 nics, we saw that kube-api is trying to talk to etcd with wrong ip and got sll error. 2021-01-05T18:44:05.497047195+00:00 stderr F I0105 18:44:05.496965 20 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc01188de00, {TRANSIENT_FAILURE connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, 192.168.126.10, ::1, not 192.168.144.10"}