1849540 – Tests are failing due to constant etcd leader elections changes

Bug 1849540 - Tests are failing due to constant etcd leader elections changes

Summary: Tests are failing due to constant etcd leader elections changes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Maysa Macedo
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:
Depends On:	1849051
Blocks:	1851338
TreeView+	depends on / blocked

Reported:	2020-06-22 07:42 UTC by OpenShift BugZilla Robot
Modified:	2020-07-06 20:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1851338 (view as bug list)
Environment:
Last Closed:	2020-07-06 20:47:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
NP test results (parallel 3) (913.19 KB, application/gzip) 2020-06-29 16:38 UTC, rlobillo	no flags	Details
ETCD metrics during test execution (448.85 KB, application/pdf) 2020-06-29 16:44 UTC, rlobillo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 678	0	None	closed	[release-4.4] [release-4.5] Bug 1849540: Split etcd sg rule ports range into different sg rules	2020-07-06 22:17:06 UTC
Red Hat Product Errata	RHBA-2020:2786	0	None	None	None	2020-07-06 20:47:39 UTC

Comment 3 rlobillo 2020-06-29 16:38:49 UTC

Created attachment 1699193 [details]
NP test results (parallel 3)

Comment 4 rlobillo 2020-06-29 16:44:31 UTC

Created attachment 1699194 [details]
ETCD metrics during test execution

Comment 5 rlobillo 2020-06-29 16:45:50 UTC

Verified on 4.4.0-0.nightly-2020-06-29-071755 with OSP16.1 (RHOS-16.1-RHEL-8-20200625.n.0) with OVN.

NP tests run with parallelism set to 3 with expected results. It took 1h 9 minutes to be executed (from  Jun 29 11:21:54.059 UTC to Jun 29 12:30:44.356 UTC)

No etcd leader change observed during test execution:

(overcloud) [stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'became leader'; done
Mon Jun 29 12:16:20 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:44:57.414342 I | raft: e2f2fc9d46f0eb5c became leader at term 4

No timeouts on port 2380 during test execution (on day 2020-06-24):

$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'timeout'; done
Mon Jun 29 12:17:02 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:45:37.233344 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 39.988079ms, to 45a5708909c764fb)
2020-06-29 10:45:37.233408 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.06606ms, to b417308bc0582b13)
2020-06-29 10:46:21.765507 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 245.167291ms, to 45a5708909c764fb)
2020-06-29 10:46:21.773092 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 252.830754ms, to b417308bc0582b13)
2020-06-29 10:46:22.695692 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.289407ms, to b417308bc0582b13)
2020-06-29 10:46:22.695734 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.344808ms, to 45a5708909c764fb)
2020-06-29 10:47:45.588674 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.406816ms, to b417308bc0582b13)
2020-06-29 10:47:45.588762 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.522375ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456351 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.278383ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456826 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.791253ms, to b417308bc0582b13)
2020-06-29 10:54:28.937621 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.725382ms, to 45a5708909c764fb)
2020-06-29 10:54:28.937697 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.825453ms, to b417308bc0582b13)
2020-06-29 11:04:32.088895 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.816893ms, to 45a5708909c764fb)
2020-06-29 11:04:32.088974 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.933145ms, to b417308bc0582b13)
2020-06-29 14:04:31.574536 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.48505ms, to 45a5708909c764fb)
2020-06-29 14:04:31.574930 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.903726ms, to b417308bc0582b13)
2020-06-29 16:04:31.264962 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 28.598692ms, to 45a5708909c764fb)
2020-06-29 16:04:31.265383 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 29.054097ms, to b417308bc0582b13)

Furthermore, etcd metrics show an stable behaviour (Report Attached)

Comment 7 errata-xmlrpc 2020-07-06 20:47:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786

Note You need to log in before you can comment on or make changes to this bug.