Bug 1842876
Summary: | [OVN] Port range filtering sometimes does not allow traffic to the entire range | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Maysa Macedo <mdemaced> |
Component: | python-networking-ovn | Assignee: | Jakub Libosvar <jlibosva> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | GenadiC <gcheresh> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 16.1 (Train) | CC: | apevec, awels, dalvarez, dcbw, eduen, gcheresh, itbrown, jlibosva, lhh, ltomasbo, majopela, njohnston, nunnatsa, nusiddiq, oblaut, racedoro, scohen, spower, tsmetana, wking |
Target Milestone: | z2 | Keywords: | AutomationBlocker, TestBlockerForLayeredProduct, TestOnly, Triaged |
Target Release: | 16.1 (Train on RHEL 8.2) | Flags: | dmellado:
needinfo-
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-16 14:48:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1858878 | ||
Bug Blocks: |
Description
Maysa Macedo
2020-06-02 09:51:47 UTC
*** Bug 1843053 has been marked as a duplicate of this bug. *** *** Bug 1843061 has been marked as a duplicate of this bug. *** *** Bug 1843062 has been marked as a duplicate of this bug. *** *** Bug 1843063 has been marked as a duplicate of this bug. *** *** Bug 1843066 has been marked as a duplicate of this bug. *** *** Bug 1843067 has been marked as a duplicate of this bug. *** *** Bug 1843068 has been marked as a duplicate of this bug. *** *** Bug 1843069 has been marked as a duplicate of this bug. *** *** Bug 1843070 has been marked as a duplicate of this bug. *** *** Bug 1843071 has been marked as a duplicate of this bug. *** We've had each of the failed tests in separate bugzilla. Marking them as duplicate since they all seem to have the same root cause and this is much saner way to track the fix. We are open to assisting but we need must-gather or links to CI runs to investigate. Few things before we begin that. - Ceph: etcd requires fast disks in order to properly facilitate its serial workloads(fsync), Ceph has generally not been a good match for etcd because the actual storage layer is generally not SSD. We actually document that now explicitly [1]. > Message: "rpc error: code = Unavailable desc = etcdserver: leader changed", As leader elections can be the direct result of poor disk I/O I would like to direct all focus on storage. [1] https://github.com/openshift/openshift-docs/pull/20939/files *** Bug 1843802 has been marked as a duplicate of this bug. *** I'm going to run the following metrics after running the tests again: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) max(etcd_server_leader_changes_seen_total) histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) This will be with a setup of: OSP13 + OVN + OCP4.3 without Ceph This sounds good but the bug was against a cluster using Ceph right? Yes, it was against a cluster using Ceph. We'll spin a cluster with those config and provide the needed info. From initial review of leader election changes, this is a performance issue as expected. Because I could not immediately isolate the issue with isolated screenshots of queries I have asked for a full prom db dump for further review. @Itzik / @Maysa is the issue here that: 1) etcd is seeing slow storage and thus triggering leader elections 2) etcd storage is using Ceph, a network-based storage provider 3) OpenShift networking for the cluster is provided by Kuryr on OpenStack 4) OpenStack networking itself is using the ml2/ovn network plugin Therefore, the current thought is that ml2/ovn is not providing sufficent network performance to support etcd's latency requirements? Is that correct? Hi Dan, Yes for all points, with exception of 2. Ceph is present on the cluster, but is not backing up the etcd storage. While running the Network Policy tests every now and then we saw the following error (read tcp 10.196.1.147:47230->10.196.2.105:2380: i/o timeout). This lead us to believe that under certain load the connection can timeout. The traffic is currently being allowed between nodes on that port by creating a security group rule with the port range(2379-2380, which seem to be not always enforced): direction='ingress', ethertype='IPv4', id='a0d98acc-8065-41d9-9686-d12265a3ed9c', port_range_max='2380', port_range_min='2379', protocol='tcp', remote_ip_prefix='10.196.0.0/16' However, once security group rules are created for each etcd port(2379 and 2380) the issue stopped happening: direction='ingress', ethertype='IPv4', id='a0d98acc-8065-41d9-9686-d12265a3ed9c', port_range_max='2379', port_range_min='2379', protocol='tcp', remote_ip_prefix='10.196.0.0/16' direction='ingress', ethertype='IPv4', id='27301a08-2ca8-45d8-aa46-4261440be72c', port_range_max='2380', port_range_min='2380', protocol='tcp', remote_ip_prefix='10.196.0.0/16' So, this seems to be an issue more related to OVN rather than to the use on Ceph. > So, this seems to be an issue more related to OVN rather than to the use on Ceph.
moving to OVN for review.
Changed the BZ title to reflect the OVN issue. The issue has been detected in ShiftOnStack, specifically on the etcd ports causing the leader to change and Network Policy tests to fail. The existing testing coverage for port range filtering is passing so it's not something 100% reproducible and still requires RCA. Some more info that could be useful to narrow down where the problem may be: - When the problem appears, restarting ovn-controller does not help, the problem is still there - We have seen this both in OSP13 and OSP16 environments, but only with OVN backend. We have not seen the issue with ml2/ovs - First time we saw this problem (on the etcd side) was a month ago (May 19th). It took us some time until we realised it was the SG range not being always enforced, and the bugzilla was moved across different groups (kuryr, etcd, and now OVN). We finally found the root cause. The currently used OVN version recalculates conjunction ids on changes like port groups or port bindings. Port ranges in ACLs are implemented using conjuncions, meaning that these rules change their conjunction ID even when change is not related to the given ACL. It causes a little network disruption in the data plane that triggers leader election in etcd cluster. I tested newer OVN version - ovn2.13-20.06.1-2.el8fdp and we're no longer able to reproduce the hiccup with the minimal reproducer we had. Now we're running the full OCP tests that found the issue. Worth to note this OVN version is not tested with OSP yet, so we may hit some regressions. This is fixed in ovn2.13-20.06.2-11 which should be part of the current compose. Thus moving to ON_QA to test it. Ran NP tests with OCP4.5 and OCP4.6 one time with the Kuryr W/A and one time without (Using Maysa's release image). NP Tests were ran using the same seed. Also ran OCP4.6 tempest and NP tests without using the see option. Versions OSP16 - RHOS-16.1-RHEL-8-20201007.n.0 OCP 4.5.0-0.nightly-2020-10-25-174204 4.6.0-0.nightly-2020-10-22-034051 For future reference we used seed 1594215440. The Ginkgo seed option ensures we are running the tests in the same order each time. We used the mentioned seed because it's the one we saw most of the issues caused by this bug. |