Description of problem: Etcd leader change is happening more constantly causing the Network Policy tests to fail. The tests failed in different points, but with the following errors: should enforce multiple, stacked policies with overlapping podSelectors [Feature:NetworkPolicy-10] [BeforeEach] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:488 Jun 1 22:12:08.856: Pod did not finish as expected. Unexpected error: <*errors.StatusError | 0xc0013c4c80>: { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "rpc error: code = Unavailable desc = etcdserver: leader changed", Reason: "", Details: nil, Code: 500, }, } rpc error: code = Unavailable desc = etcdserver: leader changed occurred should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy-08] [BeforeEach] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:382 Jun 1 22:16:30.619: Pod did not finish as expected. Unexpected error: <*url.Error | 0xc002f52360>: { Op: "Get", URL: "https://api.ostest.shiftstack.com:6443/api/v1/namespaces/network-policy-7642/pods/client-can-connect-80-4gp4f", Err: {s: "EOF"}, } Get https://api.ostest.shiftstack.com:6443/api/v1/namespaces/network-policy-7642/pods/client-can-connect-80-4gp4f: EOF occurred The list of failed tests is: [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should allow ingress access on one named port [Feature:NetworkPolicy-12] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:62 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should enforce policy based on NamespaceSelector with MatchExpressions[Feature:NetworkPolicy-05] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:62 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should support a 'default-deny' policy [Feature:NetworkPolicy-01] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:62 [Fail] [sig-network] NetworkPolicy [LinuxOnly] [BeforeEach] NetworkPolicy between server and client should allow egress access to server in CIDR block [Feature:NetworkPolicy-22] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:210 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should enforce policy to allow traffic from pods within server namespace based on PodSelector [Feature:NetworkPolicy-02] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:62 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [It] should allow ingress access from updated pod [Feature:NetworkPolicy-17] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:1427 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should enforce multiple, stacked policies with overlapping podSelectors [Feature:NetworkPolicy-10] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:1427 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should enforce egress policy allowing traffic to a server in a different namespace based on PodSelector and NamespaceSelector [Feature:Net workPolicy-18] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:62 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:Networ kPolicy-08] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:1427 [Fail] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [BeforeEach] should support allow-all policy [Feature:NetworkPolicy-11] /home/stack/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/network_policy.go:62 Ran 23 of 4843 Specs in 8205.748 seconds FAIL! -- 13 Passed | 10 Failed | 0 Pending | 4820 Skipped Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 16.0.2 (Train) 4.3.0-0.nightly-2020-06-01-043839 Octavia Amphoras + Ceph + OVN are used. How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
*** Bug 1843053 has been marked as a duplicate of this bug. ***
*** Bug 1843061 has been marked as a duplicate of this bug. ***
*** Bug 1843062 has been marked as a duplicate of this bug. ***
*** Bug 1843063 has been marked as a duplicate of this bug. ***
*** Bug 1843066 has been marked as a duplicate of this bug. ***
*** Bug 1843067 has been marked as a duplicate of this bug. ***
*** Bug 1843068 has been marked as a duplicate of this bug. ***
*** Bug 1843069 has been marked as a duplicate of this bug. ***
*** Bug 1843070 has been marked as a duplicate of this bug. ***
*** Bug 1843071 has been marked as a duplicate of this bug. ***
We've had each of the failed tests in separate bugzilla. Marking them as duplicate since they all seem to have the same root cause and this is much saner way to track the fix.
We are open to assisting but we need must-gather or links to CI runs to investigate. Few things before we begin that. - Ceph: etcd requires fast disks in order to properly facilitate its serial workloads(fsync), Ceph has generally not been a good match for etcd because the actual storage layer is generally not SSD. We actually document that now explicitly [1]. > Message: "rpc error: code = Unavailable desc = etcdserver: leader changed", As leader elections can be the direct result of poor disk I/O I would like to direct all focus on storage. [1] https://github.com/openshift/openshift-docs/pull/20939/files
*** Bug 1843802 has been marked as a duplicate of this bug. ***
I'm going to run the following metrics after running the tests again: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) max(etcd_server_leader_changes_seen_total) histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) This will be with a setup of: OSP13 + OVN + OCP4.3 without Ceph
This sounds good but the bug was against a cluster using Ceph right?
Yes, it was against a cluster using Ceph. We'll spin a cluster with those config and provide the needed info.
From initial review of leader election changes, this is a performance issue as expected. Because I could not immediately isolate the issue with isolated screenshots of queries I have asked for a full prom db dump for further review.
@Itzik / @Maysa is the issue here that: 1) etcd is seeing slow storage and thus triggering leader elections 2) etcd storage is using Ceph, a network-based storage provider 3) OpenShift networking for the cluster is provided by Kuryr on OpenStack 4) OpenStack networking itself is using the ml2/ovn network plugin Therefore, the current thought is that ml2/ovn is not providing sufficent network performance to support etcd's latency requirements? Is that correct?
Hi Dan, Yes for all points, with exception of 2. Ceph is present on the cluster, but is not backing up the etcd storage. While running the Network Policy tests every now and then we saw the following error (read tcp 10.196.1.147:47230->10.196.2.105:2380: i/o timeout). This lead us to believe that under certain load the connection can timeout. The traffic is currently being allowed between nodes on that port by creating a security group rule with the port range(2379-2380, which seem to be not always enforced): direction='ingress', ethertype='IPv4', id='a0d98acc-8065-41d9-9686-d12265a3ed9c', port_range_max='2380', port_range_min='2379', protocol='tcp', remote_ip_prefix='10.196.0.0/16' However, once security group rules are created for each etcd port(2379 and 2380) the issue stopped happening: direction='ingress', ethertype='IPv4', id='a0d98acc-8065-41d9-9686-d12265a3ed9c', port_range_max='2379', port_range_min='2379', protocol='tcp', remote_ip_prefix='10.196.0.0/16' direction='ingress', ethertype='IPv4', id='27301a08-2ca8-45d8-aa46-4261440be72c', port_range_max='2380', port_range_min='2380', protocol='tcp', remote_ip_prefix='10.196.0.0/16' So, this seems to be an issue more related to OVN rather than to the use on Ceph.
> So, this seems to be an issue more related to OVN rather than to the use on Ceph. moving to OVN for review.
Changed the BZ title to reflect the OVN issue. The issue has been detected in ShiftOnStack, specifically on the etcd ports causing the leader to change and Network Policy tests to fail. The existing testing coverage for port range filtering is passing so it's not something 100% reproducible and still requires RCA.
Some more info that could be useful to narrow down where the problem may be: - When the problem appears, restarting ovn-controller does not help, the problem is still there - We have seen this both in OSP13 and OSP16 environments, but only with OVN backend. We have not seen the issue with ml2/ovs - First time we saw this problem (on the etcd side) was a month ago (May 19th). It took us some time until we realised it was the SG range not being always enforced, and the bugzilla was moved across different groups (kuryr, etcd, and now OVN).
We finally found the root cause. The currently used OVN version recalculates conjunction ids on changes like port groups or port bindings. Port ranges in ACLs are implemented using conjuncions, meaning that these rules change their conjunction ID even when change is not related to the given ACL. It causes a little network disruption in the data plane that triggers leader election in etcd cluster. I tested newer OVN version - ovn2.13-20.06.1-2.el8fdp and we're no longer able to reproduce the hiccup with the minimal reproducer we had. Now we're running the full OCP tests that found the issue. Worth to note this OVN version is not tested with OSP yet, so we may hit some regressions.
This is fixed in ovn2.13-20.06.2-11 which should be part of the current compose. Thus moving to ON_QA to test it.
Ran NP tests with OCP4.5 and OCP4.6 one time with the Kuryr W/A and one time without (Using Maysa's release image). NP Tests were ran using the same seed. Also ran OCP4.6 tempest and NP tests without using the see option. Versions OSP16 - RHOS-16.1-RHEL-8-20201007.n.0 OCP 4.5.0-0.nightly-2020-10-25-174204 4.6.0-0.nightly-2020-10-22-034051
For future reference we used seed 1594215440. The Ginkgo seed option ensures we are running the tests in the same order each time. We used the mentioned seed because it's the one we saw most of the issues caused by this bug.