Bug 1849540
Summary: | Tests are failing due to constant etcd leader elections changes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> | ||||||
Component: | Networking | Assignee: | Maysa Macedo <mdemaced> | ||||||
Networking sub component: | kuryr | QA Contact: | GenadiC <gcheresh> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | high | ||||||||
Priority: | urgent | CC: | ltomasbo, rlobillo | ||||||
Version: | 4.5 | ||||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.4.z | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1851338 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-07-06 20:47:17 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1849051 | ||||||||
Bug Blocks: | 1851338 | ||||||||
Attachments: |
|
Created attachment 1699194 [details]
ETCD metrics during test execution
Verified on 4.4.0-0.nightly-2020-06-29-071755 with OSP16.1 (RHOS-16.1-RHEL-8-20200625.n.0) with OVN. NP tests run with parallelism set to 3 with expected results. It took 1h 9 minutes to be executed (from Jun 29 11:21:54.059 UTC to Jun 29 12:30:44.356 UTC) No etcd leader change observed during test execution: (overcloud) [stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'became leader'; done Mon Jun 29 12:16:20 EDT 2020 # pod/etcd-ostest-l6xkl-master-0 # pod/etcd-ostest-l6xkl-master-1 # pod/etcd-ostest-l6xkl-master-2 2020-06-29 10:44:57.414342 I | raft: e2f2fc9d46f0eb5c became leader at term 4 No timeouts on port 2380 during test execution (on day 2020-06-24): $ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'timeout'; done Mon Jun 29 12:17:02 EDT 2020 # pod/etcd-ostest-l6xkl-master-0 # pod/etcd-ostest-l6xkl-master-1 # pod/etcd-ostest-l6xkl-master-2 2020-06-29 10:45:37.233344 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 39.988079ms, to 45a5708909c764fb) 2020-06-29 10:45:37.233408 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.06606ms, to b417308bc0582b13) 2020-06-29 10:46:21.765507 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 245.167291ms, to 45a5708909c764fb) 2020-06-29 10:46:21.773092 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 252.830754ms, to b417308bc0582b13) 2020-06-29 10:46:22.695692 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.289407ms, to b417308bc0582b13) 2020-06-29 10:46:22.695734 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 174.344808ms, to 45a5708909c764fb) 2020-06-29 10:47:45.588674 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.406816ms, to b417308bc0582b13) 2020-06-29 10:47:45.588762 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 97.522375ms, to 45a5708909c764fb) 2020-06-29 10:53:26.456351 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.278383ms, to 45a5708909c764fb) 2020-06-29 10:53:26.456826 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 48.791253ms, to b417308bc0582b13) 2020-06-29 10:54:28.937621 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.725382ms, to 45a5708909c764fb) 2020-06-29 10:54:28.937697 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 196.825453ms, to b417308bc0582b13) 2020-06-29 11:04:32.088895 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.816893ms, to 45a5708909c764fb) 2020-06-29 11:04:32.088974 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 40.933145ms, to b417308bc0582b13) 2020-06-29 14:04:31.574536 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.48505ms, to 45a5708909c764fb) 2020-06-29 14:04:31.574930 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 6.903726ms, to b417308bc0582b13) 2020-06-29 16:04:31.264962 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 28.598692ms, to 45a5708909c764fb) 2020-06-29 16:04:31.265383 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 29.054097ms, to b417308bc0582b13) Furthermore, etcd metrics show an stable behaviour (Report Attached) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2786 |
Created attachment 1699193 [details] NP test results (parallel 3)