Bug 1851338

Summary:

Tests are failing due to constant etcd leader elections changes

Product:

OpenShift Container Platform

Reporter:

Maysa Macedo <mdemaced>

Component:

Networking

Assignee:

Maysa Macedo <mdemaced>

Networking sub component:

kuryr

QA Contact:

GenadiC <gcheresh>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

urgent

CC:

cdaley, gcheresh, ltomasbo, openshift-bugzilla-robot, rlobillo

Version:

4.5

Target Milestone:

---

Target Release:

4.3.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

1849540

Environment:

Last Closed:

2020-07-14 16:11:52 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1849540

Bug Blocks:

Attachments:

Description	Flags
NP test results	none
ETCD metrics during test execution	none

Comment 3 Jan Safranek 2020-07-02 07:59:59 UTC

*** Bug 1852990 has been marked as a duplicate of this bug. ***

Comment 6 rlobillo 2020-07-07 10:08:58 UTC

Created attachment 1700134 [details]
NP test results

Comment 7 rlobillo 2020-07-07 10:11:58 UTC

Created attachment 1700135 [details]
ETCD metrics during test execution

Comment 8 rlobillo 2020-07-07 10:13:00 UTC

Verified on OCP4.3.0-0.nightly-2020-07-06-074036 with OSP16.1
(RHOS-16.1-RHEL-8-20200701.n.0) with OVN.

Ingress rules to etcd are splitted in two instead of setting a range:

(shiftstack) [stack@undercloud-0 ~]$ openstack security group show ostest-h5nsm-master |
grep 10.196.0.0 | grep -e 2379 -e 2380
| | created_at='2020-07-06T14:19:41Z', direction='ingress', ethertype='IPv4',
id='45689162-6486-4a62-988e-7fc75f3b9178', port_range_max='2379', port_range_min='2379',
protocol='tcp', remote_ip_prefix='10.196.0.0/16', updated_at='2020-07-06T14:19:41Z' |
| | created_at='2020-07-06T14:19:41Z', direction='ingress', ethertype='IPv4',
id='b7230eda-b467-4ea7-8b1e-1aa48fae8818', port_range_max='2380',
port_range_min='2380', protocol='tcp', remote_ip_prefix='10.196.0.0/16',
updated_at='2020-07-06T14:19:41Z' |

NP tests run with parallelism set to 2 with expected results.

No etcd leader change observed during test execution (on day 2020-07-6 from 17:00 onwards):

(overcloud) [stack@undercloud-0 ~]$ for i in $(oc get pods -n openshift-etcd -l
k8s-app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd-member
|grep 'became leader'; done
# pod/etcd-member-ostest-h5nsm-master-0
2020-07-06 14:17:22.082454 I | raft: 7e92ed1f2b132c63 became leader at term 8
# pod/etcd-member-ostest-h5nsm-master-1
# pod/etcd-member-ostest-h5nsm-master-2

No timeouts on port 2380 during test execution:

(overcloud) [stack@undercloud-0 ~]$ for i in $(oc get pods -n openshift-etcd -l
k8s-app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd-member
|grep 'timeout'; done
# pod/etcd-member-ostest-h5nsm-master-0
# pod/etcd-member-ostest-h5nsm-master-1
# pod/etcd-member-ostest-h5nsm-master-2

Furthermore, etcd metrics show an stable behaviour during the same (attached).

Comment 10 errata-xmlrpc 2020-07-14 16:11:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2872