Bug 1877833

Summary: etcd is degraded on 4.6 baremetal IPv6 deployments
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: NetworkingAssignee: Dan Winship <danw>
Networking sub component: ovn-kubernetes QA Contact: Chad Crum <ccrum>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: athomas, bjacot, ccrum, danw, mcornea, mvirgil, sbatsche, trozet, yprokule
Version: 4.6Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:39:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2020-09-10 14:26:11 UTC
When trying to restore IPv6 support for baremetal IPI, etcd has a pod CrashLooping:


"message": "StaticPodsDegraded: pod/etcd-master-2.ostest.test.metalkube.org container \"etcd\" is not ready: CrashLoopBackOff: 

It seems to have difficulty connecting to another host:

"message": "= \"transport: Error while dialing dial tcp [fd2e:6f44:5dd8:c956::16]:2379: connect: connection refused\". Reconnecting...\nW0909 21:07:23.362759       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://[fd2e:6f44:5dd8:c956::14]:2379  \u003cnil\u003e 0 \u003cnil\u003e}. 


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/11177/rehearse-11177-pull-ci-openshift-baremetal-operator-master-e2e-metal-ipi/1303784256455053312/artifacts/e2e-metal-ipi/baremetalds-devscripts-gather/ has log-bundle tar with control plane and bootstrap logs.

https://github.com/openshift/release/pull/11177 is the PR where we're restoring IPv6 support. It's reliably failing with etcd issues.

Comment 1 Dan Mace 2020-09-10 19:35:09 UTC
etcd appears to be running on the masters and bound to the following addresses:

https://[fd2e:6f44:5dd8:c956::14]:2379
https://[fd2e:6f44:5dd8:c956::15]:2379
https://[fd2e:6f44:5dd8:c956::16]:2379

I agree that there also appears to be some sort of connectivity issue between the members, and possibly between other processes and the members; although the process -> member problems could be a side effect of the etcd members' connectivity issues.

What I'm not clear on is why this implies an issue with etcd. Assuming the etcd processes are bound to the correct interface (please help me verify that), the next place I'd go looking for trouble is in the networking layer. Is there some other piece of evidence I'm missing or misunderstanding which support a suspicion that etcd itself is the cause, especially in the context of a PR introducing significant networking changes?

I'll keep digging a little to see if I can learn more about OVN.

Comment 2 Dan Mace 2020-09-10 20:15:30 UTC
Since this bug was opened as "urgent" for the 4.6 release, I want to be explicitly clear: if this is important to solve, I think some networking folks need to take a look in parallel based on the plausibility this is a networking issue that I won't be able to diagnose anytime soon given my current knowledge of OVN.

Comment 3 Stephen Benjamin 2020-09-10 23:31:49 UTC
Indeed it is, broken IPv6 needs to block the release, IMO. @Tim, could someone from OVN have a look at this?

I'll try to get a host for you in the morning for live troubleshooting.

Comment 4 Dan Winship 2020-09-11 10:56:38 UTC
Yeah, this is probably OVN's fault

Comment 6 Chad Crum 2020-09-11 15:51:43 UTC
I'm seeing a similar issue with the etcd operator being in a degraded state with ocp 4.6 ipv6 baremetal deployment. 

The log message in the crash looping etcd master pod is different than what I see in this bz (Although seems similar), so I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1878215 in case the issue is different.

Comment 7 Dan Winship 2020-09-14 11:37:56 UTC
We're in the process of changing some stuff in ovn-kubernetes to fix various problems. One side effect of this should be fixing IPv6.

Comment 8 Ben Bennett 2020-09-14 14:33:20 UTC
*** Bug 1878215 has been marked as a duplicate of this bug. ***

Comment 9 Chad Crum 2020-09-15 13:39:09 UTC
(In reply to Dan Winship from comment #7)
> We're in the process of changing some stuff in ovn-kubernetes to fix various
> problems. One side effect of this should be fixing IPv6.

Hey Dan - Are there any Jira cards / Github PR's you can provide for the ovn-kubernetes fixes so that we can track? Thanks

Comment 16 Anurag saxena 2020-09-28 14:25:01 UTC
@arik, could you help us your team verify this since you have PI BM env being deployed? Please re-assign to right QA if this is not the case. Thanks

Comment 19 Chad Crum 2020-09-30 15:14:27 UTC
We have verified this issue is fixed in our bm environment. The cluster has deployed and etcd operator comes up. 

We are seeing other issues with testing, but as this issue appears clear we will open separate bugs.


OCP image: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-27-075304


[root@sealusa10 tmp]# oc get pods -n openshift-etcd
NAME                                                                READY   STATUS      RESTARTS   AGE
etcd-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com                3/3     Running     0          29m
etcd-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com                3/3     Running     0          29m
etcd-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com                3/3     Running     0          28m
etcd-quorum-guard-99d56d7b8-wd4gv                                   1/1     Running     0          49m
etcd-quorum-guard-99d56d7b8-x8j6w                                   1/1     Running     0          49m
etcd-quorum-guard-99d56d7b8-z52q8                                   1/1     Running     0          49m
installer-2-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          50m
installer-2-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          51m
installer-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          50m
installer-3-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          29m
installer-3-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          45m
installer-3-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          29m
revision-pruner-2-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          50m
revision-pruner-2-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          50m
revision-pruner-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          49m
revision-pruner-3-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          29m
revision-pruner-3-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          29m
revision-pruner-3-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          28m


NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd                                       4.6.0-0.nightly-2020-09-27-075304   True        False         False      51m

Comment 22 errata-xmlrpc 2020-10-27 16:39:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196