When trying to restore IPv6 support for baremetal IPI, etcd has a pod CrashLooping: "message": "StaticPodsDegraded: pod/etcd-master-2.ostest.test.metalkube.org container \"etcd\" is not ready: CrashLoopBackOff: It seems to have difficulty connecting to another host: "message": "= \"transport: Error while dialing dial tcp [fd2e:6f44:5dd8:c956::16]:2379: connect: connection refused\". Reconnecting...\nW0909 21:07:23.362759 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://[fd2e:6f44:5dd8:c956::14]:2379 \u003cnil\u003e 0 \u003cnil\u003e}. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/11177/rehearse-11177-pull-ci-openshift-baremetal-operator-master-e2e-metal-ipi/1303784256455053312/artifacts/e2e-metal-ipi/baremetalds-devscripts-gather/ has log-bundle tar with control plane and bootstrap logs. https://github.com/openshift/release/pull/11177 is the PR where we're restoring IPv6 support. It's reliably failing with etcd issues.
etcd appears to be running on the masters and bound to the following addresses: https://[fd2e:6f44:5dd8:c956::14]:2379 https://[fd2e:6f44:5dd8:c956::15]:2379 https://[fd2e:6f44:5dd8:c956::16]:2379 I agree that there also appears to be some sort of connectivity issue between the members, and possibly between other processes and the members; although the process -> member problems could be a side effect of the etcd members' connectivity issues. What I'm not clear on is why this implies an issue with etcd. Assuming the etcd processes are bound to the correct interface (please help me verify that), the next place I'd go looking for trouble is in the networking layer. Is there some other piece of evidence I'm missing or misunderstanding which support a suspicion that etcd itself is the cause, especially in the context of a PR introducing significant networking changes? I'll keep digging a little to see if I can learn more about OVN.
Since this bug was opened as "urgent" for the 4.6 release, I want to be explicitly clear: if this is important to solve, I think some networking folks need to take a look in parallel based on the plausibility this is a networking issue that I won't be able to diagnose anytime soon given my current knowledge of OVN.
Indeed it is, broken IPv6 needs to block the release, IMO. @Tim, could someone from OVN have a look at this? I'll try to get a host for you in the morning for live troubleshooting.
Yeah, this is probably OVN's fault
I'm seeing a similar issue with the etcd operator being in a degraded state with ocp 4.6 ipv6 baremetal deployment. The log message in the crash looping etcd master pod is different than what I see in this bz (Although seems similar), so I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1878215 in case the issue is different.
We're in the process of changing some stuff in ovn-kubernetes to fix various problems. One side effect of this should be fixing IPv6.
*** Bug 1878215 has been marked as a duplicate of this bug. ***
(In reply to Dan Winship from comment #7) > We're in the process of changing some stuff in ovn-kubernetes to fix various > problems. One side effect of this should be fixing IPv6. Hey Dan - Are there any Jira cards / Github PR's you can provide for the ovn-kubernetes fixes so that we can track? Thanks
@arik, could you help us your team verify this since you have PI BM env being deployed? Please re-assign to right QA if this is not the case. Thanks
We have verified this issue is fixed in our bm environment. The cluster has deployed and etcd operator comes up. We are seeing other issues with testing, but as this issue appears clear we will open separate bugs. OCP image: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-27-075304 [root@sealusa10 tmp]# oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 3/3 Running 0 29m etcd-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 3/3 Running 0 29m etcd-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com 3/3 Running 0 28m etcd-quorum-guard-99d56d7b8-wd4gv 1/1 Running 0 49m etcd-quorum-guard-99d56d7b8-x8j6w 1/1 Running 0 49m etcd-quorum-guard-99d56d7b8-z52q8 1/1 Running 0 49m installer-2-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 50m installer-2-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 51m installer-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 50m installer-3-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 29m installer-3-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 45m installer-3-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 29m revision-pruner-2-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 50m revision-pruner-2-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 50m revision-pruner-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 49m revision-pruner-3-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 29m revision-pruner-3-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 29m revision-pruner-3-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com 0/1 Completed 0 28m NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.6.0-0.nightly-2020-09-27-075304 True False False 51m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196