Bug 1849540

Summary: Tests are failing due to constant etcd leader elections changes
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NetworkingAssignee: Maysa Macedo <mdemaced>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: ltomasbo, rlobillo
Version: 4.5   
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1851338 (view as bug list) Environment:
Last Closed: 2020-07-06 20:47:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1849051    
Bug Blocks: 1851338    
Attachments:
Description Flags
NP test results (parallel 3)
none
ETCD metrics during test execution none

Comment 3 rlobillo 2020-06-29 16:38:49 UTC
Created attachment 1699193 [details]
NP test results (parallel 3)

Comment 4 rlobillo 2020-06-29 16:44:31 UTC
Created attachment 1699194 [details]
ETCD metrics during test execution

Comment 5 rlobillo 2020-06-29 16:45:50 UTC
Verified on 4.4.0-0.nightly-2020-06-29-071755 with OSP16.1 (RHOS-16.1-RHEL-8-20200625.n.0) with OVN.

NP tests run with parallelism set to 3 with expected results. It took 1h 9 minutes to be executed (from  Jun 29 11:21:54.059 UTC to Jun 29 12:30:44.356 UTC)

No etcd leader change observed during test execution:

(overcloud) [stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'became leader'; done
Mon Jun 29 12:16:20 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:44:57.414342 I | raft: e2f2fc9d46f0eb5c became leader at term 4

No timeouts on port 2380 during test execution (on day 2020-06-24):

$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'timeout'; done
Mon Jun 29 12:17:02 EDT 2020
# pod/etcd-ostest-l6xkl-master-0
# pod/etcd-ostest-l6xkl-master-1
# pod/etcd-ostest-l6xkl-master-2
2020-06-29 10:45:37.233344 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 39.988079ms, to 45a5708909c764fb)
2020-06-29 10:45:37.233408 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.06606ms, to b417308bc0582b13)
2020-06-29 10:46:21.765507 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 245.167291ms, to 45a5708909c764fb)
2020-06-29 10:46:21.773092 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 252.830754ms, to b417308bc0582b13)
2020-06-29 10:46:22.695692 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.289407ms, to b417308bc0582b13)
2020-06-29 10:46:22.695734 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.344808ms, to 45a5708909c764fb)
2020-06-29 10:47:45.588674 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.406816ms, to b417308bc0582b13)
2020-06-29 10:47:45.588762 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.522375ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456351 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.278383ms, to 45a5708909c764fb)
2020-06-29 10:53:26.456826 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.791253ms, to b417308bc0582b13)
2020-06-29 10:54:28.937621 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.725382ms, to 45a5708909c764fb)
2020-06-29 10:54:28.937697 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.825453ms, to b417308bc0582b13)
2020-06-29 11:04:32.088895 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.816893ms, to 45a5708909c764fb)
2020-06-29 11:04:32.088974 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.933145ms, to b417308bc0582b13)
2020-06-29 14:04:31.574536 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.48505ms, to 45a5708909c764fb)
2020-06-29 14:04:31.574930 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.903726ms, to b417308bc0582b13)
2020-06-29 16:04:31.264962 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 28.598692ms, to 45a5708909c764fb)
2020-06-29 16:04:31.265383 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 29.054097ms, to b417308bc0582b13)

Furthermore, etcd metrics show an stable behaviour (Report Attached)

Comment 7 errata-xmlrpc 2020-07-06 20:47:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786