Created attachment 1875125 [details] density-light-120node NB DB Description of problem: In a large scale deployment, e.g., during a density-light OpenShift scale test running a cluster of 120 nodes and 13K pods, northd spends a large amount of time processing and generating logical flows that are used to reply to ARP requests. With the attached database, focusing on a single logical port that corresponds to an OCP POD (13b39b78-node-density-20220329_node-density-8311): port 13b39b78-node-density-20220329_node-density-8311 addresses: ["0a:58:0a:a8:00:4f 10.168.0.79"] There are two types of ARP responder flows: 1. In the logical switch pipeline: table=18(ls_in_arp_rsp ), priority=100 , match=(arp.tpa == 10.168.0.79 && arp.op == 1 && inport == "13b39b78-node-density-20220329_node-density-8311"), action=(next;) table=18(ls_in_arp_rsp ), priority=50 , match=(arp.tpa == 10.168.0.79 && arp.op == 1), action=(eth.dst = eth.src; eth.src = 0a:58:0a:a8:00:4f; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = 0a:58:0a:a8:00:4f; arp.tpa = arp.spa; arp.spa = 10.168.0.79; outport = inport; flags.loopback = 1; output;) These flows above can probably be skipped if all the VIF logical ports that are part of that logical switch are claimed by the same chassis. In such cases ARP requests will never leave br-int and there's no point to try to optimize packet flow with an explicit ARP responder flow. We can just as easily let the VIF that owns the IP reply to the ARP itself. 2. In the logical router pipeline: table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "rtos-ip-10-0-177-133.us-west-2.compute.internal" && reg0 == 10.168.0.79), action=(eth.dst = 0a:58:0a:a8:00:4f; next;) These flows can probably be skipped if the logical router is configured to dynamically resolve unknown next-hops, i.e., if the logical router is configured with NB.Logical_Router.options:dynamic_neigh_routers=true. In ovn-kubernetes the ovn_cluster_router does *not* have dynamic_neigh_routers=true but there should be no reason to not enable it. All in all, measuring the impact of avoiding generating these two types of logical flows in ovn-northd when running with the attached database, we see that one ovn-northd event processing loop iteration is reduced by ~300ms (from ~1500ms to ~1200ms).
Upstream patchset for MAC binding aging: http://patchwork.ozlabs.org/project/ovn/list/?series=366554&state=%2A&archive=both