TL;DR: @hewang and @jmaxwell - My testbed can reproduce the drops. I do not know enough about the customer's testbed to replicate the env exactly, but on my testbed, with same iperf3 params (iperf3 -6 -u -b200m -ll256), I see packet drops. I applied various tunings incrementally. The test with a 30-pods deployment churns shows packet drops regardless of the tuning. With OCP version 4.10.18: (https://docs.google.com/spreadsheets/d/1v9-pc2cu25DXbbaHAytplWs7lLM68OTz6V4LdCMEyI4/edit#gid=1826112686) Intra-node: DROP Inter-node: DROP With OCP 4.11.2: (https://docs.google.com/spreadsheets/d/1v9-pc2cu25DXbbaHAytplWs7lLM68OTz6V4LdCMEyI4/edit#gid=2064496500) Intra-node: NO drop <====================== Only scenario with NO drop Inter-node: DROP Tunings: - disable prometheus - increase rx NIC ring size - apply sysctl socket mem and backlog params - Change ovs-vswitchd to sched_rt - Other scheduling, renice tweaks not documented in the above google sheets.
Patch backported: https://gitlab.cee.redhat.com/nst/openvswitch/openvswitch2.16/-/commit/ce553c99e2f9b8b3784ac66a759c363528c67c8b
* Tue Oct 11 2022 Aaron Conole <aconole> - 2.16.0-103 - netdev-linux: Skip some internal kernel stats gathering. [RH git: ce553c99e2] (#2118848) For netdev_linux_update_via_netlink(), hint to the kernel that we do not need it to gather netlink internal stats when we want to update the netlink flags, as those stats are not rendered within OVS. Background: ovs-vswitchd can spend quite a bit of time blocked by the kernel during netlink calls, especially systems with many cores. This time is dominated by the kernel-side internal stats gathering mechanism in netlink, specifically: inet6_fill_link_af inet6_fill_ifla6_attrs __snmp6_fill_stats64 In Linux 4.4+, there exists a hint for netlink requests to not trigger the ipv6 stats gathering mechanism, which greatly reduces the amount of time that ovs-vswitchd is on CPU. Testing and Results: Tested booting 320 VM's and measuring OVS utilization with perf record, then visualized into a flamegraph using a patched version of ovs 2.14.2. Calls under bridge_run() seem to get hit the worst by this issue. Before bridge_run() == 11.3% of samples After bridge_run() == 3.4% of samples Note that there are at least two observed netlink calls under bridge_run that are still kernel stats heavy after this patch: Call 1: bridge_run -> netdev_run -> route_table_run -> route_table_reset -> ovs_router_insert -> ovs_router_insert__ -> get_src_addr -> netdev_ger_addr_list -> netdev_linux_get_addr_list -> getifaddrs Since the actual netlink call is coming from getifaddrs() in glibc, fixing would likely involve either duplicating glibc code in ovs source or patch glibc. Call 2: bridge_run -> iface_refresh_stats -> netdev_get_stats -> netdev_linux_get_stats -> get_stats_via_netlink This does use netlink based stats; however, it isn't immediately clear if just dropping the stats from inet6_fill_link_af would impact anything or not. Given this call is more intermittent, its of lesser concern. Acked-by: Greg Smith <gasmith> Signed-off-by: Jon Kohler <jon> Signed-off-by: Ilya Maximets <i.maximets> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2118848
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (openvswitch2.16 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7390