Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1877833 - etcd is degraded on 4.6 baremetal IPv6 deployments
Summary: etcd is degraded on 4.6 baremetal IPv6 deployments
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Dan Winship
QA Contact: Chad Crum
URL:
Whiteboard:
: 1878215 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-10 14:26 UTC by Stephen Benjamin
Modified: 2020-10-27 16:39 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:39:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 279 0 None closed Bug 1880974: 9-21-2020 merge 2021-02-07 09:19:24 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:39:52 UTC

Description Stephen Benjamin 2020-09-10 14:26:11 UTC
When trying to restore IPv6 support for baremetal IPI, etcd has a pod CrashLooping:


"message": "StaticPodsDegraded: pod/etcd-master-2.ostest.test.metalkube.org container \"etcd\" is not ready: CrashLoopBackOff: 

It seems to have difficulty connecting to another host:

"message": "= \"transport: Error while dialing dial tcp [fd2e:6f44:5dd8:c956::16]:2379: connect: connection refused\". Reconnecting...\nW0909 21:07:23.362759       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://[fd2e:6f44:5dd8:c956::14]:2379  \u003cnil\u003e 0 \u003cnil\u003e}. 


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/11177/rehearse-11177-pull-ci-openshift-baremetal-operator-master-e2e-metal-ipi/1303784256455053312/artifacts/e2e-metal-ipi/baremetalds-devscripts-gather/ has log-bundle tar with control plane and bootstrap logs.

https://github.com/openshift/release/pull/11177 is the PR where we're restoring IPv6 support. It's reliably failing with etcd issues.

Comment 1 Dan Mace 2020-09-10 19:35:09 UTC
etcd appears to be running on the masters and bound to the following addresses:

https://[fd2e:6f44:5dd8:c956::14]:2379
https://[fd2e:6f44:5dd8:c956::15]:2379
https://[fd2e:6f44:5dd8:c956::16]:2379

I agree that there also appears to be some sort of connectivity issue between the members, and possibly between other processes and the members; although the process -> member problems could be a side effect of the etcd members' connectivity issues.

What I'm not clear on is why this implies an issue with etcd. Assuming the etcd processes are bound to the correct interface (please help me verify that), the next place I'd go looking for trouble is in the networking layer. Is there some other piece of evidence I'm missing or misunderstanding which support a suspicion that etcd itself is the cause, especially in the context of a PR introducing significant networking changes?

I'll keep digging a little to see if I can learn more about OVN.

Comment 2 Dan Mace 2020-09-10 20:15:30 UTC
Since this bug was opened as "urgent" for the 4.6 release, I want to be explicitly clear: if this is important to solve, I think some networking folks need to take a look in parallel based on the plausibility this is a networking issue that I won't be able to diagnose anytime soon given my current knowledge of OVN.

Comment 3 Stephen Benjamin 2020-09-10 23:31:49 UTC
Indeed it is, broken IPv6 needs to block the release, IMO. @Tim, could someone from OVN have a look at this?

I'll try to get a host for you in the morning for live troubleshooting.

Comment 4 Dan Winship 2020-09-11 10:56:38 UTC
Yeah, this is probably OVN's fault

Comment 6 Chad Crum 2020-09-11 15:51:43 UTC
I'm seeing a similar issue with the etcd operator being in a degraded state with ocp 4.6 ipv6 baremetal deployment. 

The log message in the crash looping etcd master pod is different than what I see in this bz (Although seems similar), so I have opened https://bugzilla.redhat.com/show_bug.cgi?id=1878215 in case the issue is different.

Comment 7 Dan Winship 2020-09-14 11:37:56 UTC
We're in the process of changing some stuff in ovn-kubernetes to fix various problems. One side effect of this should be fixing IPv6.

Comment 8 Ben Bennett 2020-09-14 14:33:20 UTC
*** Bug 1878215 has been marked as a duplicate of this bug. ***

Comment 9 Chad Crum 2020-09-15 13:39:09 UTC
(In reply to Dan Winship from comment #7)
> We're in the process of changing some stuff in ovn-kubernetes to fix various
> problems. One side effect of this should be fixing IPv6.

Hey Dan - Are there any Jira cards / Github PR's you can provide for the ovn-kubernetes fixes so that we can track? Thanks

Comment 16 Anurag saxena 2020-09-28 14:25:01 UTC
@arik, could you help us your team verify this since you have PI BM env being deployed? Please re-assign to right QA if this is not the case. Thanks

Comment 19 Chad Crum 2020-09-30 15:14:27 UTC
We have verified this issue is fixed in our bm environment. The cluster has deployed and etcd operator comes up. 

We are seeing other issues with testing, but as this issue appears clear we will open separate bugs.


OCP image: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-27-075304


[root@sealusa10 tmp]# oc get pods -n openshift-etcd
NAME                                                                READY   STATUS      RESTARTS   AGE
etcd-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com                3/3     Running     0          29m
etcd-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com                3/3     Running     0          29m
etcd-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com                3/3     Running     0          28m
etcd-quorum-guard-99d56d7b8-wd4gv                                   1/1     Running     0          49m
etcd-quorum-guard-99d56d7b8-x8j6w                                   1/1     Running     0          49m
etcd-quorum-guard-99d56d7b8-z52q8                                   1/1     Running     0          49m
installer-2-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          50m
installer-2-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          51m
installer-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          50m
installer-3-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          29m
installer-3-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          45m
installer-3-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed   0          29m
revision-pruner-2-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          50m
revision-pruner-2-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          50m
revision-pruner-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          49m
revision-pruner-3-master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          29m
revision-pruner-3-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          29m
revision-pruner-3-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed   0          28m


NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd                                       4.6.0-0.nightly-2020-09-27-075304   True        False         False      51m

Comment 22 errata-xmlrpc 2020-10-27 16:39:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.