Bug 1849051

Summary: Tests are failing due to constant etcd leader elections changes
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NetworkingAssignee: Maysa Macedo <mdemaced>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: ltomasbo, rlobillo
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:44:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1847313    
Bug Blocks: 1849540    
Attachments:
Description Flags
NP test results
none
ETCD metrics during test execution none

Description OpenShift BugZilla Robot 2020-06-19 14:15:55 UTC
+++ This bug was initially created as a clone of Bug #1847313 +++

Description of problem:

Depending on the load running on the cluster, etcd leader change is happening more constantly causing Network Policy and Tempests tests to fail. Some tests failed in different stages, but with the following errors:

should enforce multiple, stacked policies with overlapping podSelectors [Feature:NetworkPolicy-10] [BeforeEach]
    /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:488

    Jun  1 22:12:08.856: Pod did not finish as expected.
    Unexpected error:
        <*errors.StatusError | 0xc0013c4c80>: {
            ErrStatus: {
                TypeMeta: {Kind: "", APIVersion: ""},
                ListMeta: {
                    SelfLink: "",
                    ResourceVersion: "",
                    Continue: "",
                    RemainingItemCount: nil,
                },
                Status: "Failure",
                Message: "rpc error: code = Unavailable desc = etcdserver: leader changed",
                Reason: "",
                Details: nil,
                Code: 500,
            },
        }
        rpc error: code = Unavailable desc = etcdserver: leader changed
    occurred

 should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy-08] [BeforeEach]
    /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:382

    Jun  1 22:16:30.619: Pod did not finish as expected.
    Unexpected error:
        <*url.Error | 0xc002f52360>: {
            Op: "Get",
            URL: "https://api.ostest.shiftstack.com:6443/api/v1/namespaces/network-policy-7642/pods/client-can-connect-80-4gp4f",
            Err: {s: "EOF"},
        }
        Get https://api.ostest.shiftstack.com:6443/api/v1/namespaces/network-policy-7642/pods/client-can-connect-80-4gp4f: EOF
    occurred


Version-Release number of selected component (if applicable):


How reproducible:
Red Hat OpenStack Platform release 16.1.0 Beta (Train)
OVN

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 rlobillo 2020-06-24 13:47:55 UTC
Created attachment 1698598 [details]
NP test results

Comment 4 rlobillo 2020-06-24 13:48:54 UTC
Created attachment 1698599 [details]
ETCD metrics during test execution

Comment 5 rlobillo 2020-06-24 13:50:09 UTC
Verified on OCP4.5.0-0.nightly-2020-06-23-075004 with OSP16.1 (RHOS-16.1-RHEL-8-20200623.n.0) with OVN.

NP tests run with parallelism set to 3 with expected results. 

No etcd leader change observed during test execution (on day 2020-06-24):

[stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'became leader'; done
Wed Jun 24 08:30:46 EDT 2020
# pod/etcd-ostest-rl79c-master-0
# pod/etcd-ostest-rl79c-master-1
raft2020/06/23 19:20:32 INFO: 95db74b7d4920873 became leader at term 4
# pod/etcd-ostest-rl79c-master-2

No timeouts on port 2380 during test execution (on day 2020-06-24)::

[stack@undercloud-0 ~]$ date && for i in $(oc get pods -n openshift-etcd -l app=etcd -o NAME); do echo "# $i"; oc logs $i -n openshift-etcd -c etcd |grep 'timeout'; done
Wed Jun 24 08:32:24 EDT 2020
# pod/etcd-ostest-rl79c-master-0
# pod/etcd-ostest-rl79c-master-1
2020-06-23 19:22:07.725074 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 99.146027ms, to 669c7d0c57a3d244)
2020-06-23 19:22:07.725270 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 99.345401ms, to 498ed5c98fdb1ab8)
2020-06-23 19:36:22.023038 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 5.072443ms, to 669c7d0c57a3d244)
2020-06-23 19:36:22.023217 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 5.25579ms, to 498ed5c98fdb1ab8)
2020-06-23 19:36:39.306596 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 225.467102ms, to 669c7d0c57a3d244)
2020-06-23 19:36:39.306625 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 225.499258ms, to 498ed5c98fdb1ab8)
2020-06-23 19:37:21.550861 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 276.735925ms, to 669c7d0c57a3d244)
2020-06-23 19:37:21.551163 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 277.040716ms, to 498ed5c98fdb1ab8)
2020-06-24 01:50:08.688456 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 13.412946ms, to 669c7d0c57a3d244)
2020-06-24 01:50:08.688518 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 13.482507ms, to 498ed5c98fdb1ab8)
# pod/etcd-ostest-rl79c-master-2

Furthermore, etcd metrics show an stable behaviour. Attached test logs and metrics. attachment 1698598 [details] & attachment 1698599 [details].

Comment 6 errata-xmlrpc 2020-07-13 17:44:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409