Bug 1849540

Summary:

Tests are failing due to constant etcd leader elections changes

Product:

OpenShift Container Platform

Reporter:

OpenShift BugZilla Robot <openshift-bugzilla-robot>

Component:

Networking

Assignee:

Maysa Macedo <mdemaced>

Networking sub component:

kuryr

QA Contact:

GenadiC <gcheresh>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

urgent

CC:

ltomasbo, rlobillo

Version:

4.5

Target Milestone:

---

Target Release:

4.4.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1851338 (view as bug list)

Environment:

Last Closed:

2020-07-06 20:47:17 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1849051

Bug Blocks:

1851338

Attachments:

Description	Flags
NP test results (parallel 3)	none
ETCD metrics during test execution	none

Comment 3 rlobillo 2020-06-29 16:38:49 UTC

Created attachment 1699193 [details]
NP test results (parallel 3)

Comment 4 rlobillo 2020-06-29 16:44:31 UTC

Created attachment 1699194 [details]
ETCD metrics during test execution

Comment 5 rlobillo 2020-06-29 16:45:50 UTC

Verified on 4.4.0-0.nightly-2020-06-29-071755 with OSP16.1 (RHOS-16.1-RHEL-8-20200625.n.0) with OVN.

NP tests run with parallelism set to 3 with expected results. It took 1h 9 minutes to be executed (from  Jun 29 11:21:54.059 UTC to Jun 29 12:30:44.356 UTC)

No etcd leader change observed during test execution:

(overcloud) [stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'became leader'; done
Mon Jun 29 12:16:20 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:44:57.414342 I | raft: e2f2fc9d46f0eb5c became leader at term 4

No timeouts on port 2380 during test execution (on day 2020-06-24):

$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'timeout'; done
Mon Jun 29 12:17:02 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:45:37.233344 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 39.988079ms, to 45a5708909c764fb)
2020-06-29 10:45:37.233408 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.06606ms, to b417308bc0582b13)
2020-06-29 10:46:21.765507 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 245.167291ms, to 45a5708909c764fb)
2020-06-29 10:46:21.773092 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 252.830754ms, to b417308bc0582b13)
2020-06-29 10:46:22.695692 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.289407ms, to b417308bc0582b13)
2020-06-29 10:46:22.695734 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.344808ms, to 45a5708909c764fb)
2020-06-29 10:47:45.588674 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.406816ms, to b417308bc0582b13)
2020-06-29 10:47:45.588762 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.522375ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456351 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.278383ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456826 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.791253ms, to b417308bc0582b13)
2020-06-29 10:54:28.937621 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.725382ms, to 45a5708909c764fb)
2020-06-29 10:54:28.937697 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.825453ms, to b417308bc0582b13)
2020-06-29 11:04:32.088895 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.816893ms, to 45a5708909c764fb)
2020-06-29 11:04:32.088974 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.933145ms, to b417308bc0582b13)
2020-06-29 14:04:31.574536 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.48505ms, to 45a5708909c764fb)
2020-06-29 14:04:31.574930 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.903726ms, to b417308bc0582b13)
2020-06-29 16:04:31.264962 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 28.598692ms, to 45a5708909c764fb)
2020-06-29 16:04:31.265383 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 29.054097ms, to b417308bc0582b13)

Furthermore, etcd metrics show an stable behaviour (Report Attached)

Comment 7 errata-xmlrpc 2020-07-06 20:47:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786