Bug 2118848

Summary: Backport: [ovs-dev] netdev-linux: skip some internal kernel stats gathering
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jonathan Maxwell <jmaxwell>
Component: openvswitch2.16Assignee: Aaron Conole <aconole>
Status: CLOSED ERRATA QA Contact: Hekai Wang <hewang>
Severity: high Docs Contact:
Priority: unspecified    
Version: RHEL 8.0CC: aconole, ctrautma, eglottma, fbaudin, fleitner, hnhan, jhsiao, ovs-qe, ralongi, tredaelli, xzhou
Target Milestone: ---Flags: hewang: needinfo-
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch2.16-2.16.0-103.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-03 00:30:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 10 hnhan 2022-09-29 12:38:41 UTC
TL;DR: @hewang and @jmaxwell - My testbed can reproduce the drops. 

I do not know enough about the customer's testbed to replicate the env exactly, but on my testbed, with same iperf3 params (iperf3 -6 -u -b200m -ll256), I see packet drops. I applied various tunings incrementally. The test with a 30-pods deployment churns shows packet drops regardless of the tuning. 

With OCP version 4.10.18: (https://docs.google.com/spreadsheets/d/1v9-pc2cu25DXbbaHAytplWs7lLM68OTz6V4LdCMEyI4/edit#gid=1826112686)
  Intra-node: DROP
  Inter-node: DROP

With OCP 4.11.2: (https://docs.google.com/spreadsheets/d/1v9-pc2cu25DXbbaHAytplWs7lLM68OTz6V4LdCMEyI4/edit#gid=2064496500)
  Intra-node: NO drop  <====================== Only scenario with NO drop
  Inter-node: DROP

Tunings:
  - disable prometheus
  - increase rx NIC ring size
  - apply sysctl socket mem and backlog params
  - Change ovs-vswitchd to sched_rt
  - Other scheduling, renice tweaks not documented in the above google sheets.

Comment 12 OvS team 2022-10-11 19:47:56 UTC
* Tue Oct 11 2022 Aaron Conole <aconole> - 2.16.0-103
- netdev-linux: Skip some internal kernel stats gathering. [RH git: ce553c99e2] (#2118848)
    For netdev_linux_update_via_netlink(), hint to the kernel that
    we do not need it to gather netlink internal stats when we want
    to update the netlink flags, as those stats are not rendered
    within OVS.
    
    Background:
    ovs-vswitchd can spend quite a bit of time blocked by the kernel
    during netlink calls, especially systems with many cores. This
    time is dominated by the kernel-side internal stats gathering
    mechanism in netlink, specifically:
      inet6_fill_link_af
        inet6_fill_ifla6_attrs
          __snmp6_fill_stats64
    
    In Linux 4.4+, there exists a hint for netlink requests to not
    trigger the ipv6 stats gathering mechanism, which greatly reduces
    the amount of time that ovs-vswitchd is on CPU.
    
    Testing and Results:
    Tested booting 320 VM's and measuring OVS utilization with perf
    record, then visualized into a flamegraph using a patched version
    of ovs 2.14.2. Calls under bridge_run() seem to get hit the worst
    by this issue.
    
    Before bridge_run() == 11.3% of samples
    After bridge_run() == 3.4% of samples
    
    Note that there are at least two observed netlink calls under
    bridge_run that are still kernel stats heavy after this patch:
    
    Call 1:
      bridge_run -> netdev_run -> route_table_run -> route_table_reset ->
        ovs_router_insert -> ovs_router_insert__ -> get_src_addr ->
          netdev_ger_addr_list -> netdev_linux_get_addr_list -> getifaddrs
    
    Since the actual netlink call is coming from getifaddrs() in glibc,
    fixing would likely involve either duplicating glibc code in ovs
    source or patch glibc.
    
    Call 2:
      bridge_run -> iface_refresh_stats -> netdev_get_stats ->
        netdev_linux_get_stats -> get_stats_via_netlink
    
    This does use netlink based stats; however, it isn't immediately
    clear if just dropping the stats from inet6_fill_link_af would
    impact anything or not. Given this call is more intermittent, its
    of lesser concern.
    
    Acked-by: Greg Smith <gasmith>
    Signed-off-by: Jon Kohler <jon>
    Signed-off-by: Ilya Maximets <i.maximets>
    
    Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2118848

Comment 18 errata-xmlrpc 2022-11-03 00:30:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openvswitch2.16 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7390