1847313 – Tests are failing due to constant etcd leader elections changes

Bug 1847313 - Tests are failing due to constant etcd leader elections changes

Summary: Tests are failing due to constant etcd leader elections changes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Maysa Macedo
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1849051
TreeView+	depends on / blocked

Reported:	2020-06-16 08:00 UTC by Maysa Macedo
Modified:	2020-10-27 16:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:07:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
NP test results (847.29 KB, application/gzip) 2020-07-22 09:46 UTC, rlobillo	no flags	Details
ETCD metrics during test execution (456.00 KB, application/pdf) 2020-07-22 09:47 UTC, rlobillo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 671	0	None	closed	Bug 1847313: Split etcd sg rule ports range into different sg rules	2021-02-17 12:11:32 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:07:38 UTC

Description Maysa Macedo 2020-06-16 08:00:13 UTC

Description of problem:

Depending on the load running on the cluster, etcd leader change is happening more constantly causing Network Policy and Tempests tests to fail. Some tests failed in different stages, but with the following errors:

should enforce multiple, stacked policies with overlapping podSelectors [Feature:NetworkPolicy-10] [BeforeEach]
    /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:488

    Jun  1 22:12:08.856: Pod did not finish as expected.
    Unexpected error:
        <*errors.StatusError | 0xc0013c4c80>: {
            ErrStatus: {
                TypeMeta: {Kind: "", APIVersion: ""},
                ListMeta: {
                    SelfLink: "",
                    ResourceVersion: "",
                    Continue: "",
                    RemainingItemCount: nil,
                },
                Status: "Failure",
                Message: "rpc error: code = Unavailable desc = etcdserver: leader changed",
                Reason: "",
                Details: nil,
                Code: 500,
            },
        }
        rpc error: code = Unavailable desc = etcdserver: leader changed
    occurred

 should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy-08] [BeforeEach]
    /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:382

    Jun  1 22:16:30.619: Pod did not finish as expected.
    Unexpected error:
        <*url.Error | 0xc002f52360>: {
            Op: "Get",
            URL: "https://api.ostest.shiftstack.com:6443/api/v1/namespaces/network-policy-7642/pods/client-can-connect-80-4gp4f",
            Err: {s: "EOF"},
        }
        Get https://api.ostest.shiftstack.com:6443/api/v1/namespaces/network-policy-7642/pods/client-can-connect-80-4gp4f: EOF
    occurred


Version-Release number of selected component (if applicable):


How reproducible:
Red Hat OpenStack Platform release 16.1.0 Beta (Train)
OVN

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 rlobillo 2020-07-22 09:46:33 UTC

Created attachment 1702042 [details]
NP test results

Comment 4 rlobillo 2020-07-22 09:47:44 UTC

Created attachment 1702043 [details]
ETCD metrics during test execution

Comment 5 rlobillo 2020-07-22 09:50:25 UTC

Verified on OCP4.6.0-0.nightly-2020-07-21-004949 with OSP16.1 (RHOS-16.1-RHEL-8-20200714.n.0) with OVN.

NP tests run with parallelism set to 2 with expected results.

- No etcd leader change observed during test execution (on day 2020-07-22):

	[stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o
	NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'became leader'; done
	Wed Jul 22 04:41:33 EDT 2020
	# pod/etcd-ostest-tzdfc-master-0
	# pod/etcd-ostest-tzdfc-master-1
	raft2020/07/21 16:00:39 INFO: f56b8ef5cf671236 became leader at term 8
	# pod/etcd-ostest-tzdfc-master-2

- 4 timeouts on port 2380 during test execution on master-1 but recovered succesfully (on
day 2020-07-22)::

	[stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o
	NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'timeout'; done
	Wed Jul 22 04:42:04 EDT 2020
	# pod/etcd-ostest-tzdfc-master-0
	# pod/etcd-ostest-tzdfc-master-1
	2020-07-22 05:46:36.727875 W | etcdserver: failed to send out heartbeat on time (exceeded
	the 100ms timeout for 2.559533ms, to fbb05cfa50510a87)
	2020-07-22 05:46:36.727982 W | etcdserver: failed to send out heartbeat on time (exceeded
	the 100ms timeout for 2.697912ms, to c0e6832f3d3c32b7)
	2020-07-22 07:51:15.350022 W | etcdserver: failed to send out heartbeat on time (exceeded
	the 100ms timeout for 3.185218ms, to fbb05cfa50510a87)
	2020-07-22 07:51:15.350080 W | etcdserver: failed to send out heartbeat on time (exceeded
	the 100ms timeout for 3.255695ms, to c0e6832f3d3c32b7)
	# pod/etcd-ostest-tzdfc-master-2

Furthermore, etcd metrics show an stable behaviour. Attached test logs and metrics: attachment 1702042 [details] & attachment 1702043 [details].

Comment 7 errata-xmlrpc 2020-10-27 16:07:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.