Bug 1849540 - Tests are failing due to constant etcd leader elections changes
Summary: Tests are failing due to constant etcd leader elections changes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.4.z
Assignee: Maysa Macedo
QA Contact: GenadiC
URL:
Whiteboard:
Depends On: 1849051
Blocks: 1851338
TreeView+ depends on / blocked
 
Reported: 2020-06-22 07:42 UTC by OpenShift BugZilla Robot
Modified: 2020-07-06 20:47 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1851338 (view as bug list)
Environment:
Last Closed: 2020-07-06 20:47:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
NP test results (parallel 3) (913.19 KB, application/gzip)
2020-06-29 16:38 UTC, rlobillo
no flags Details
ETCD metrics during test execution (448.85 KB, application/pdf)
2020-06-29 16:44 UTC, rlobillo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 678 0 None closed [release-4.4] [release-4.5] Bug 1849540: Split etcd sg rule ports range into different sg rules 2020-07-06 22:17:06 UTC
Red Hat Product Errata RHBA-2020:2786 0 None None None 2020-07-06 20:47:39 UTC

Comment 3 rlobillo 2020-06-29 16:38:49 UTC
Created attachment 1699193 [details]
NP test results (parallel 3)

Comment 4 rlobillo 2020-06-29 16:44:31 UTC
Created attachment 1699194 [details]
ETCD metrics during test execution

Comment 5 rlobillo 2020-06-29 16:45:50 UTC
Verified on 4.4.0-0.nightly-2020-06-29-071755 with OSP16.1 (RHOS-16.1-RHEL-8-20200625.n.0) with OVN.

NP tests run with parallelism set to 3 with expected results. It took 1h 9 minutes to be executed (from  Jun 29 11:21:54.059 UTC to Jun 29 12:30:44.356 UTC)

No etcd leader change observed during test execution:

(overcloud) [stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'became leader'; done
Mon Jun 29 12:16:20 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:44:57.414342 I | raft: e2f2fc9d46f0eb5c became leader at term 4

No timeouts on port 2380 during test execution (on day 2020-06-24):

$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'timeout'; done
Mon Jun 29 12:17:02 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:45:37.233344 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 39.988079ms, to 45a5708909c764fb)
2020-06-29 10:45:37.233408 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.06606ms, to b417308bc0582b13)
2020-06-29 10:46:21.765507 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 245.167291ms, to 45a5708909c764fb)
2020-06-29 10:46:21.773092 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 252.830754ms, to b417308bc0582b13)
2020-06-29 10:46:22.695692 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.289407ms, to b417308bc0582b13)
2020-06-29 10:46:22.695734 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.344808ms, to 45a5708909c764fb)
2020-06-29 10:47:45.588674 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.406816ms, to b417308bc0582b13)
2020-06-29 10:47:45.588762 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.522375ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456351 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.278383ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456826 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.791253ms, to b417308bc0582b13)
2020-06-29 10:54:28.937621 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.725382ms, to 45a5708909c764fb)
2020-06-29 10:54:28.937697 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.825453ms, to b417308bc0582b13)
2020-06-29 11:04:32.088895 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.816893ms, to 45a5708909c764fb)
2020-06-29 11:04:32.088974 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.933145ms, to b417308bc0582b13)
2020-06-29 14:04:31.574536 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.48505ms, to 45a5708909c764fb)
2020-06-29 14:04:31.574930 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.903726ms, to b417308bc0582b13)
2020-06-29 16:04:31.264962 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 28.598692ms, to 45a5708909c764fb)
2020-06-29 16:04:31.265383 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 29.054097ms, to b417308bc0582b13)

Furthermore, etcd metrics show an stable behaviour (Report Attached)

Comment 7 errata-xmlrpc 2020-07-06 20:47:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786


Note You need to log in before you can comment on or make changes to this bug.