Bug 2078986

Summary: [OVN SCALE] Scalability issues due to arp responder logical flows
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dumitru Ceara <dceara>
Component: OVNAssignee: Ales Musil <amusil>
Status: CLOSED WONTFIX QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: high    
Version: FDP 22.CCC: amusil, ctrautma, dcbw, jiji, mmichels, surya
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-04 14:14:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2084668    
Bug Blocks:    
Attachments:
Description Flags
density-light-120node NB DB none

Description Dumitru Ceara 2022-04-26 16:09:41 UTC
Created attachment 1875125 [details]
density-light-120node NB DB

Description of problem:

In a large scale deployment, e.g., during a density-light OpenShift
scale test running a cluster of 120 nodes and 13K pods, northd spends
a large amount of time processing and generating logical flows that
are used to reply to ARP requests.

With the attached database, focusing on a single logical port that
corresponds to an OCP POD (13b39b78-node-density-20220329_node-density-8311):

    port 13b39b78-node-density-20220329_node-density-8311
        addresses: ["0a:58:0a:a8:00:4f 10.168.0.79"]

There are two types of ARP responder flows:

1. In the logical switch pipeline:

  table=18(ls_in_arp_rsp      ), priority=100  , match=(arp.tpa == 10.168.0.79 && arp.op == 1 && inport == "13b39b78-node-density-20220329_node-density-8311"), action=(next;)
  table=18(ls_in_arp_rsp      ), priority=50   , match=(arp.tpa == 10.168.0.79 && arp.op == 1), action=(eth.dst = eth.src; eth.src = 0a:58:0a:a8:00:4f; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = 0a:58:0a:a8:00:4f; arp.tpa = arp.spa; arp.spa = 10.168.0.79; outport = inport; flags.loopback = 1; output;)

These flows above can probably be skipped if all the VIF logical ports
that are part of that logical switch are claimed by the same chassis.
In such cases ARP requests will never leave br-int and there's no point
to try to optimize packet flow with an explicit ARP responder flow.  We
can just as easily let the VIF that owns the IP reply to the ARP itself.

2. In the logical router pipeline:

  table=15(lr_in_arp_resolve  ), priority=100  , match=(outport == "rtos-ip-10-0-177-133.us-west-2.compute.internal" && reg0 == 10.168.0.79), action=(eth.dst = 0a:58:0a:a8:00:4f; next;)

These flows can probably be skipped if the logical router is configured
to dynamically resolve unknown next-hops, i.e., if the logical router
is configured with NB.Logical_Router.options:dynamic_neigh_routers=true.

In ovn-kubernetes the ovn_cluster_router does *not* have
dynamic_neigh_routers=true but there should be no reason to not enable
it.

All in all, measuring the impact of avoiding generating these two types
of logical flows in ovn-northd when running with the attached database,
we see that one ovn-northd event processing loop iteration is reduced by
~300ms (from ~1500ms to ~1200ms).

Comment 3 Dan Williams 2023-08-04 13:50:20 UTC
Upstream patchset for MAC binding aging: http://patchwork.ozlabs.org/project/ovn/list/?series=366554&state=%2A&archive=both