Description of problem: Dualstack is fairly reliably failing with etcd unable to come up: $ oc logs -n openshift-etcd etcd-master-2.ostest.test.metalkube.org -c etcd #### attempt 9 member={name="etcd-bootstrap", peerURLs=[https://192.168.111.27:2380}, clientURLs=[https://192.168.111.27:2379] member={name="master-1.ostest.test.metalkube.org", peerURLs=[https://192.168.111.21:2380}, clientURLs=[https://192.168.111.21:2379] member={name="master-0.ostest.test.metalkube.org", peerURLs=[https://192.168.111.20:2380}, clientURLs=[https://192.168.111.20:2379] member "https://192.168.111.22:2380" not found in member list, check operator logs for possible scaling problems Jobs here: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-16-110300 How reproducible: Always or close to it Steps to Reproduce: 1. Install a metal IPI environment in dualstack mode (IP_STACK=v4v6 with dev-scripts) Actual results: Installation fails with etcd errors Additional info: I also can provide a live reproducer environment.
cluster-etcd-operator claims it cannot connect to any etcds. Trying to connect to etcd on .20 gives "connection refused": 2021-04-17T19:03:50.542258872Z WARNING: 2021/04/17 19:03:50 grpc: addrConn.createTransport failed to connect to {https://192.168.111.20:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.20:2379: connect: connection refused". Reconnecting... which seems to be because etcd on .20 is failing to start: 2021-04-17T19:23:50.055477614Z #### attempt 9 2021-04-17T19:23:50.057339651Z member={name="master-2.ostest.test.metalkube.org", peerURLs=[https://192.168.111.22:2380}, clientURLs=[https://192.168.111.22:2379] 2021-04-17T19:23:50.057339651Z member={name="etcd-bootstrap", peerURLs=[https://192.168.111.45:2380}, clientURLs=[https://192.168.111.45:2379] 2021-04-17T19:23:50.057339651Z member "https://192.168.111.20:2380" not found in member list, check operator logs for possible scaling problems 2021-04-17T19:23:50.057339651Z #### sleeping... 2021-04-17T19:23:51.057782760Z timed out Trying to connect to etcd on .21 also gives connection refused 2021-04-17T19:03:50.390584234Z WARNING: 2021/04/17 19:03:50 grpc: addrConn.createTransport failed to connect to {https://192.168.111.21:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.21:2379: connect: connection refused". Reconnecting... which makes sense since the must-gather does not even show there being any etcd on .21 (so why does cluster-etcd-operator think there is one?) Trying to connect to etcd on .22 gives "operation was canceled" 2021-04-17T19:04:00.291207509Z WARNING: 2021/04/17 19:04:00 grpc: addrConn.createTransport failed to connect to {https://192.168.111.22:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.22:2379: operation was canceled". Reconnecting... while the etcd logs on .22 show: 2021-04-17T19:04:00.399636356Z {"level":"warn","ts":"2021-04-17T19:04:00.399Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:41532","server-name":"","error":"EOF"} So it seems to be intentionally refusing to talk to cluster-etcd-operator, not having a network problem. The bootstrap etcd logs show lots of {"level":"warn","ts":"2021-04-17T19:00:15.045Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:45472","server-name":"","error":"remote error: tls: bad certificate"} for a while, and then later {"level":"info","ts":"2021-04-17T19:00:26.547Z","caller":"membership/cluster.go:392","msg":"added member","cluster-id":"388387a7db949890","local-member-id":"85b84316b387be60","added-peer-id":"3c1dba9ae16d7132","added-peer-peer-urls":["https://192.168.111.22:2380"]} ... {"level":"info","ts":"2021-04-17T19:00:26.568Z","caller":"rafthttp/transport.go:327","msg":"added remote peer","local-member-id":"85b84316b387be60","remote-peer-id":"3c1dba9ae16d7132","remote-peer-urls":["https://192.168.111.22:2380"]} ... {"level":"warn","ts":"2021-04-17T19:00:27.734Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://192.168.111.22:2380/version","remote-member-id":"3c1dba9ae16d7132","error":"Get https://192.168.111.22:2380/version: dial tcp 192.168.111.22:2380: connect: connection refused"} but the last error mentioning 192.168.111.22 is at 19:00:31, which is exactly when the current etcd pod on that node started. So it looks like master-2 _has_ successfully added itself to the etcd cluster. So there does not appear to be any network-level problem here; master-0's etcd cannot start because the bootstrap etcd does not believe it should be part of the cluster, master-1 is not currently running etcd, and master-2's etcd has started and is in communication with the bootstrap etcd, but is choosing to reject connections from cluster-etcd-operator.
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack/1384044504545234944 In testing, I found that with dualstack IPv6 OVN the pod network could not communicate with hostNetwork on the same node. So if etcd-operator was scheduled on control-plane-0 then it could not communicate with etcd on control-plane-0. But at the same time etcd-operator pod scheduled on control-plane-0 could communicate with control-plane-{1,2}. I tested this by rshing into etcd-operator container and performing curl GET requests to the etcd operand health endpoint curl -vvv https://$IP/health -k. Success for this test was invalid cert error, failure is hanging timeout. Then verified by killing etcd-operator having it reschedule to another node and the exact same pattern was repeatable on that local node. This affects scaling because the operator will fail health checks to the etcd member on the local node resulting in a degraded status. If the operator is degraded we will not scale up. # etcd-operator on master 2 (hostIP: 192.168.111.22) failing healthcheck to etcd on master 2. > 2021-04-19T08:22:10.406451158Z I0419 08:22:10.406419 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-etcd-operator", Name:"openshift-etcd-operator", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: master-2.ostest.test.metalkube.org # master 2 etcd reports successful healthcheck from quorum-guard a second later. > 2021-04-19T08:22:11.157518868Z 2021-04-19 08:22:11.157469 I | etcdserver/api/etcdhttp: /health OK (status code 200) # etcd-operator log showing failed health check now you notice here this was canceled from timeout vs refused. Refused would be a possible TLS issue or unstarted etcd. > 2021-04-19T08:22:42.241879843Z WARNING: 2021/04/19 08:22:42 grpc: addrConn.createTransport failed to connect to {https://192.168.111.22:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.22:2379: operation was canceled". Reconnecting... This proves etcd is running fine on master-2 but the operator can not communicate with it over hostNetwork. Strange that quorum-guard also scheduled on the node does not appear to have a problem. So perhaps reviewing the routes between the two pods can help resolve the issue? # NOTE: these are unauthenticated TCP probes from kubelet, we are working on fixing this but the logs do not reflect an actual issue. > 2021-04-19T08:22:19.953771624Z {"level":"warn","ts":"2021-04-19T08:22:19.953Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:43888","server-name":"","error":"EOF"}
so summing up, it looks like on dual-stack nodes, a pod cannot make a connection to its IPv4 node IP
Facing the same issue installing metal IPI cluster in dualstack mode. From etcd operator logs: WARNING: 2021/04/26 08:43:54 grpc: addrConn.createTransport failed to connect to {https://10.19.17.204:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.19.17.204:2379: connect: connection refused". Reconnecting... WARNING: 2021/04/26 08:43:54 grpc: addrConn.createTransport failed to connect to {https://10.19.17.162:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.19.17.162:2379: connect: connection refused". Reconnecting...
Any objection to adding tag (IPI) to whiteboard so we can easily filter all bugs that affect IPI installation on virt/BM
Verified this bug on build 4.8.0-0.nightly-2021-05-10-151546 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-10-151546 True False False 49m baremetal 4.8.0-0.nightly-2021-05-10-151546 True False False 95m cloud-credential 4.8.0-0.nightly-2021-05-10-151546 True False False 113m cluster-autoscaler 4.8.0-0.nightly-2021-05-10-151546 True False False 97m config-operator 4.8.0-0.nightly-2021-05-10-151546 True False False 99m console 4.8.0-0.nightly-2021-05-10-151546 True False False 57m csi-snapshot-controller 4.8.0-0.nightly-2021-05-10-151546 True False False 87m dns 4.8.0-0.nightly-2021-05-10-151546 True False False 89m etcd 4.8.0-0.nightly-2021-05-10-151546 True False False 93m image-registry 4.8.0-0.nightly-2021-05-10-151546 True False False 46m ingress 4.8.0-0.nightly-2021-05-10-151546 True False False 64m insights 4.8.0-0.nightly-2021-05-10-151546 True False False 86m kube-apiserver 4.8.0-0.nightly-2021-05-10-151546 True False False 87m kube-controller-manager 4.8.0-0.nightly-2021-05-10-151546 True False False 91m kube-scheduler 4.8.0-0.nightly-2021-05-10-151546 True False False 88m kube-storage-version-migrator 4.8.0-0.nightly-2021-05-10-151546 True False False 91m machine-api 4.8.0-0.nightly-2021-05-10-151546 True False False 86m machine-approver 4.8.0-0.nightly-2021-05-10-151546 True False False 95m machine-config 4.8.0-0.nightly-2021-05-10-151546 True False False 87m marketplace 4.8.0-0.nightly-2021-05-10-151546 True False False 97m monitoring 4.8.0-0.nightly-2021-05-10-151546 True False False 63m network 4.8.0-0.nightly-2021-05-10-151546 True False False 99m node-tuning 4.8.0-0.nightly-2021-05-10-151546 True False False 90m openshift-apiserver 4.8.0-0.nightly-2021-05-10-151546 True False False 86m openshift-controller-manager 4.8.0-0.nightly-2021-05-10-151546 True False False 88m openshift-samples 4.8.0-0.nightly-2021-05-10-151546 True False False 86m operator-lifecycle-manager 4.8.0-0.nightly-2021-05-10-151546 True False False 98m operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-05-10-151546 True False False 96m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-05-10-151546 True False False 88m service-ca 4.8.0-0.nightly-2021-05-10-151546 True False False 99m storage 4.8.0-0.nightly-2021-05-10-151546 True False False 99m # oc get network -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: Network metadata: creationTimestamp: "2021-05-11T00:53:05Z" generation: 2 name: cluster resourceVersion: "5074" uid: 181ba4ca-8bb5-493f-affb-3985819768c4 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 - cidr: fd01::/48 hostPrefix: 64 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 - fd02::/112 status: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 - cidr: fd01::/48 hostPrefix: 64 clusterNetworkMTU: 1400 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 - fd02::/112 kind: List metadata: resourceVersion: "" selfLink: ""
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438