Bug 1851338

Summary: Tests are failing due to constant etcd leader elections changes
Product: OpenShift Container Platform Reporter: Maysa Macedo <mdemaced>
Component: NetworkingAssignee: Maysa Macedo <mdemaced>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: cdaley, gcheresh, ltomasbo, openshift-bugzilla-robot, rlobillo
Version: 4.5   
Target Milestone: ---   
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1849540 Environment:
Last Closed: 2020-07-14 16:11:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1849540    
Bug Blocks:    
Attachments:
Description Flags
NP test results
none
ETCD metrics during test execution none

Comment 3 Jan Safranek 2020-07-02 07:59:59 UTC
*** Bug 1852990 has been marked as a duplicate of this bug. ***

Comment 6 rlobillo 2020-07-07 10:08:58 UTC
Created attachment 1700134 [details]
NP test results

Comment 7 rlobillo 2020-07-07 10:11:58 UTC
Created attachment 1700135 [details]
ETCD metrics during test execution

Comment 8 rlobillo 2020-07-07 10:13:00 UTC
Verified on OCP4.3.0-0.nightly-2020-07-06-074036 with OSP16.1
(RHOS-16.1-RHEL-8-20200701.n.0) with OVN.

Ingress rules to etcd are splitted in two instead of setting a range:

(shiftstack) [stack@undercloud-0 ~]$ openstack security group show ostest-h5nsm-master |
grep 10.196.0.0 | grep -e 2379 -e 2380
| | created_at='2020-07-06T14:19:41Z', direction='ingress', ethertype='IPv4',
id='45689162-6486-4a62-988e-7fc75f3b9178', port_range_max='2379', port_range_min='2379',
protocol='tcp', remote_ip_prefix='10.196.0.0/16', updated_at='2020-07-06T14:19:41Z' |
| | created_at='2020-07-06T14:19:41Z', direction='ingress', ethertype='IPv4',
id='b7230eda-b467-4ea7-8b1e-1aa48fae8818', port_range_max='2380',
port_range_min='2380', protocol='tcp', remote_ip_prefix='10.196.0.0/16',
updated_at='2020-07-06T14:19:41Z' |

NP tests run with parallelism set to 2 with expected results.

No etcd leader change observed during test execution (on day 2020-07-6 from 17:00 onwards):

(overcloud) [stack@undercloud-0 ~]$ for i in $(oc get pods -n openshift-etcd -l
k8s-app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd-member
|grep 'became leader'; done
# pod/etcd-member-ostest-h5nsm-master-0
2020-07-06 14:17:22.082454 I | raft: 7e92ed1f2b132c63 became leader at term 8
# pod/etcd-member-ostest-h5nsm-master-1
# pod/etcd-member-ostest-h5nsm-master-2

No timeouts on port 2380 during test execution:

(overcloud) [stack@undercloud-0 ~]$ for i in $(oc get pods -n openshift-etcd -l
k8s-app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd-member
|grep 'timeout'; done
# pod/etcd-member-ostest-h5nsm-master-0
# pod/etcd-member-ostest-h5nsm-master-1
# pod/etcd-member-ostest-h5nsm-master-2

Furthermore, etcd metrics show an stable behaviour during the same (attached).

Comment 10 errata-xmlrpc 2020-07-14 16:11:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2872