Bug 2097782

Summary: Revisit revalidator flow-size reduction algorithm
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Adrián Moreno <amorenoz>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
openvswitch sub component: daemons and tools QA Contact: qding
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ctrautma, jhsiao
Version: FDP 20.E   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Adrián Moreno 2022-06-16 14:31:03 UTC
Currently, the revalidator has the following logic:

            duration = MAX(time_msec() - start_time, 1);
            if (duration > 2000) {
                flow_limit /= duration / 1000;
            } else if (duration > 1300) {
                flow_limit = flow_limit * 3 / 4;
            } else if (duration < 1000 &&
                       flow_limit < n_flows * 1000 / duration) {
                flow_limit += 1000;
            }

The goal of this mechanism is to always guarantee that we apply changes to the datapath within a "reasonable time": 2 seconds.

In an overloaded system, reducing the number of flows in the cache leads to flows being evicted, which can lead to higher number of upcalls which then leads to higher pressure on upcall handlers (that typically use the same cores as revalidators) and possible packet drops.

This task is to try revisit this, test it under high pressure and see if we can make OVS more robust or at least find a good balance between revalidation time and upcalls.

Since we're seeing deployments where ovs-vswitchd is being restricted to a small number of CPUs (e.g: PAO) this becomes more relevant.