Bug 1950471
| Summary: | e2e-metal-ipi-ovn-dualstack is failing with etcd unable to bootstrap | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Stephen Benjamin <stbenjam> |
| Component: | Networking | Assignee: | Federico Paolinelli <fpaoline> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | achernet, aconstan, astoycos, augol, danw, keyoung, pablo.iranzo, pibanezr, rfreiman, saledort, sgreene, ydalal, ykashtan, yprokule, ysegev, zzhao |
| Version: | 4.8 | Keywords: | AutomationBlocker, TestBlocker |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 23:01:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1831748 | ||
cluster-etcd-operator claims it cannot connect to any etcds. Trying to connect to etcd on .20 gives "connection refused":
2021-04-17T19:03:50.542258872Z WARNING: 2021/04/17 19:03:50 grpc: addrConn.createTransport failed to connect to {https://192.168.111.20:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.20:2379: connect: connection refused". Reconnecting...
which seems to be because etcd on .20 is failing to start:
2021-04-17T19:23:50.055477614Z #### attempt 9
2021-04-17T19:23:50.057339651Z member={name="master-2.ostest.test.metalkube.org", peerURLs=[https://192.168.111.22:2380}, clientURLs=[https://192.168.111.22:2379]
2021-04-17T19:23:50.057339651Z member={name="etcd-bootstrap", peerURLs=[https://192.168.111.45:2380}, clientURLs=[https://192.168.111.45:2379]
2021-04-17T19:23:50.057339651Z member "https://192.168.111.20:2380" not found in member list, check operator logs for possible scaling problems
2021-04-17T19:23:50.057339651Z #### sleeping...
2021-04-17T19:23:51.057782760Z timed out
Trying to connect to etcd on .21 also gives connection refused
2021-04-17T19:03:50.390584234Z WARNING: 2021/04/17 19:03:50 grpc: addrConn.createTransport failed to connect to {https://192.168.111.21:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.21:2379: connect: connection refused". Reconnecting...
which makes sense since the must-gather does not even show there being any etcd on .21 (so why does cluster-etcd-operator think there is one?)
Trying to connect to etcd on .22 gives "operation was canceled"
2021-04-17T19:04:00.291207509Z WARNING: 2021/04/17 19:04:00 grpc: addrConn.createTransport failed to connect to {https://192.168.111.22:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.22:2379: operation was canceled". Reconnecting...
while the etcd logs on .22 show:
2021-04-17T19:04:00.399636356Z {"level":"warn","ts":"2021-04-17T19:04:00.399Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:41532","server-name":"","error":"EOF"}
So it seems to be intentionally refusing to talk to cluster-etcd-operator, not having a network problem.
The bootstrap etcd logs show lots of
{"level":"warn","ts":"2021-04-17T19:00:15.045Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:45472","server-name":"","error":"remote error: tls: bad certificate"}
for a while, and then later
{"level":"info","ts":"2021-04-17T19:00:26.547Z","caller":"membership/cluster.go:392","msg":"added member","cluster-id":"388387a7db949890","local-member-id":"85b84316b387be60","added-peer-id":"3c1dba9ae16d7132","added-peer-peer-urls":["https://192.168.111.22:2380"]}
...
{"level":"info","ts":"2021-04-17T19:00:26.568Z","caller":"rafthttp/transport.go:327","msg":"added remote peer","local-member-id":"85b84316b387be60","remote-peer-id":"3c1dba9ae16d7132","remote-peer-urls":["https://192.168.111.22:2380"]}
...
{"level":"warn","ts":"2021-04-17T19:00:27.734Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://192.168.111.22:2380/version","remote-member-id":"3c1dba9ae16d7132","error":"Get https://192.168.111.22:2380/version: dial tcp 192.168.111.22:2380: connect: connection refused"}
but the last error mentioning 192.168.111.22 is at 19:00:31, which is exactly when the current etcd pod on that node started. So it looks like master-2 _has_ successfully added itself to the etcd cluster.
So there does not appear to be any network-level problem here; master-0's etcd cannot start because the bootstrap etcd does not believe it should be part of the cluster, master-1 is not currently running etcd, and master-2's etcd has started and is in communication with the bootstrap etcd, but is choosing to reject connections from cluster-etcd-operator.
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack/1384044504545234944 In testing, I found that with dualstack IPv6 OVN the pod network could not communicate with hostNetwork on the same node. So if etcd-operator was scheduled on control-plane-0 then it could not communicate with etcd on control-plane-0. But at the same time etcd-operator pod scheduled on control-plane-0 could communicate with control-plane-{1,2}. I tested this by rshing into etcd-operator container and performing curl GET requests to the etcd operand health endpoint curl -vvv https://$IP/health -k. Success for this test was invalid cert error, failure is hanging timeout. Then verified by killing etcd-operator having it reschedule to another node and the exact same pattern was repeatable on that local node. This affects scaling because the operator will fail health checks to the etcd member on the local node resulting in a degraded status. If the operator is degraded we will not scale up. # etcd-operator on master 2 (hostIP: 192.168.111.22) failing healthcheck to etcd on master 2. > 2021-04-19T08:22:10.406451158Z I0419 08:22:10.406419 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-etcd-operator", Name:"openshift-etcd-operator", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: master-2.ostest.test.metalkube.org # master 2 etcd reports successful healthcheck from quorum-guard a second later. > 2021-04-19T08:22:11.157518868Z 2021-04-19 08:22:11.157469 I | etcdserver/api/etcdhttp: /health OK (status code 200) # etcd-operator log showing failed health check now you notice here this was canceled from timeout vs refused. Refused would be a possible TLS issue or unstarted etcd. > 2021-04-19T08:22:42.241879843Z WARNING: 2021/04/19 08:22:42 grpc: addrConn.createTransport failed to connect to {https://192.168.111.22:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.22:2379: operation was canceled". Reconnecting... This proves etcd is running fine on master-2 but the operator can not communicate with it over hostNetwork. Strange that quorum-guard also scheduled on the node does not appear to have a problem. So perhaps reviewing the routes between the two pods can help resolve the issue? # NOTE: these are unauthenticated TCP probes from kubelet, we are working on fixing this but the logs do not reflect an actual issue. > 2021-04-19T08:22:19.953771624Z {"level":"warn","ts":"2021-04-19T08:22:19.953Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:43888","server-name":"","error":"EOF"} so summing up, it looks like on dual-stack nodes, a pod cannot make a connection to its IPv4 node IP Facing the same issue installing metal IPI cluster in dualstack mode.
From etcd operator logs:
WARNING: 2021/04/26 08:43:54 grpc: addrConn.createTransport failed to connect to {https://10.19.17.204:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.19.17.204:2379: connect: connection refused". Reconnecting...
WARNING: 2021/04/26 08:43:54 grpc: addrConn.createTransport failed to connect to {https://10.19.17.162:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.19.17.162:2379: connect: connection refused". Reconnecting...
Any objection to adding tag (IPI) to whiteboard so we can easily filter all bugs that affect IPI installation on virt/BM Verified this bug on build 4.8.0-0.nightly-2021-05-10-151546
# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.8.0-0.nightly-2021-05-10-151546 True False False 49m
baremetal 4.8.0-0.nightly-2021-05-10-151546 True False False 95m
cloud-credential 4.8.0-0.nightly-2021-05-10-151546 True False False 113m
cluster-autoscaler 4.8.0-0.nightly-2021-05-10-151546 True False False 97m
config-operator 4.8.0-0.nightly-2021-05-10-151546 True False False 99m
console 4.8.0-0.nightly-2021-05-10-151546 True False False 57m
csi-snapshot-controller 4.8.0-0.nightly-2021-05-10-151546 True False False 87m
dns 4.8.0-0.nightly-2021-05-10-151546 True False False 89m
etcd 4.8.0-0.nightly-2021-05-10-151546 True False False 93m
image-registry 4.8.0-0.nightly-2021-05-10-151546 True False False 46m
ingress 4.8.0-0.nightly-2021-05-10-151546 True False False 64m
insights 4.8.0-0.nightly-2021-05-10-151546 True False False 86m
kube-apiserver 4.8.0-0.nightly-2021-05-10-151546 True False False 87m
kube-controller-manager 4.8.0-0.nightly-2021-05-10-151546 True False False 91m
kube-scheduler 4.8.0-0.nightly-2021-05-10-151546 True False False 88m
kube-storage-version-migrator 4.8.0-0.nightly-2021-05-10-151546 True False False 91m
machine-api 4.8.0-0.nightly-2021-05-10-151546 True False False 86m
machine-approver 4.8.0-0.nightly-2021-05-10-151546 True False False 95m
machine-config 4.8.0-0.nightly-2021-05-10-151546 True False False 87m
marketplace 4.8.0-0.nightly-2021-05-10-151546 True False False 97m
monitoring 4.8.0-0.nightly-2021-05-10-151546 True False False 63m
network 4.8.0-0.nightly-2021-05-10-151546 True False False 99m
node-tuning 4.8.0-0.nightly-2021-05-10-151546 True False False 90m
openshift-apiserver 4.8.0-0.nightly-2021-05-10-151546 True False False 86m
openshift-controller-manager 4.8.0-0.nightly-2021-05-10-151546 True False False 88m
openshift-samples 4.8.0-0.nightly-2021-05-10-151546 True False False 86m
operator-lifecycle-manager 4.8.0-0.nightly-2021-05-10-151546 True False False 98m
operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-05-10-151546 True False False 96m
operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-05-10-151546 True False False 88m
service-ca 4.8.0-0.nightly-2021-05-10-151546 True False False 99m
storage 4.8.0-0.nightly-2021-05-10-151546 True False False 99m
# oc get network -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: Network
metadata:
creationTimestamp: "2021-05-11T00:53:05Z"
generation: 2
name: cluster
resourceVersion: "5074"
uid: 181ba4ca-8bb5-493f-affb-3985819768c4
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
- cidr: fd01::/48
hostPrefix: 64
externalIP:
policy: {}
networkType: OVNKubernetes
serviceNetwork:
- 172.30.0.0/16
- fd02::/112
status:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
- cidr: fd01::/48
hostPrefix: 64
clusterNetworkMTU: 1400
networkType: OVNKubernetes
serviceNetwork:
- 172.30.0.0/16
- fd02::/112
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: Dualstack is fairly reliably failing with etcd unable to come up: $ oc logs -n openshift-etcd etcd-master-2.ostest.test.metalkube.org -c etcd #### attempt 9 member={name="etcd-bootstrap", peerURLs=[https://192.168.111.27:2380}, clientURLs=[https://192.168.111.27:2379] member={name="master-1.ostest.test.metalkube.org", peerURLs=[https://192.168.111.21:2380}, clientURLs=[https://192.168.111.21:2379] member={name="master-0.ostest.test.metalkube.org", peerURLs=[https://192.168.111.20:2380}, clientURLs=[https://192.168.111.20:2379] member "https://192.168.111.22:2380" not found in member list, check operator logs for possible scaling problems Jobs here: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-16-110300 How reproducible: Always or close to it Steps to Reproduce: 1. Install a metal IPI environment in dualstack mode (IP_STACK=v4v6 with dev-scripts) Actual results: Installation fails with etcd errors Additional info: I also can provide a live reproducer environment.