The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1944098 - [OVN-SCALE] ovn-controller: OF rule explosion for ACLs with conjunctive matches applied on multiple datapaths
Summary: [OVN-SCALE] ovn-controller: OF rule explosion for ACLs with conjunctive match...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: FDP 20.H
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Ehsan Elahi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-29 09:41 UTC by Dumitru Ceara
Modified: 2021-07-29 20:05 UTC (History)
8 users (show)

Fixed In Version: ovn-2021-21.06.0-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-29 20:05:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
OVN NB database. (8.66 MB, text/plain)
2021-03-29 09:41 UTC, Dumitru Ceara
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:2969 0 None None None 2021-07-29 20:05:13 UTC

Description Dumitru Ceara 2021-03-29 09:41:35 UTC
Created attachment 1767290 [details]
OVN NB database.

Description of problem:

In ovn-kubernetes (or similar) deployments, ACLs used for implementing network policies are applied to port groups that include all ports of the namespace. This translates to ACLs being applied to independently to all logical switches that have ports included in the port gorup.

To differentiate between the logical datapath on which the ACL is applied ovn-controller generates one flow per datapath appending the additional "metadata=<datapath-tunnel-key>" match to the match expression parsed from the ACL's logical flow match.

This duplication of OF rules (once for each logical switch) creates an OF rule explosion in ovn-controller/ovs-vswitchd.

For example, with the attached OVN NB database extracted from a scale test run, and with the following interfaces bound to a single node OVN deployment:

lports=(lp_17.1.0.9 lp_17.1.0.10 lp_17.1.0.11 lp_17.1.0.12 lp_17.1.0.13 lp_17.1.0.14 lp_17.1.0.15 lp_17.1.0.16 lp_17.1.0.17 lp_17.1.0.18)
for lp in ${lports[@]}; do
    ovs-vsctl add-port br-int $lp \
        -- set interface $lp type=internal \
        -- set interface $lp external_ids:iface-id=$lp
done

To avoid SB/OVS disconnects also increase timeouts:
ovn-sbctl set connection . inactivity_probe=180000
ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=180
ovs-vsctl set open . external_ids:ovn-remote-probe-interval=180000

We notice in the ovn-controller log:

2021-03-26T22:36:48.436Z|24385|timeval|WARN|Unreasonably long 47246ms poll interval (45010ms user, 1997ms system)
...
2021-03-26T22:51:55.727Z|24839|memory|INFO|peak resident set size grew 53% in last 21677.7 seconds, from 3855720 kB to 5881796 kB
2021-03-26T22:51:55.727Z|24840|memory|INFO|lflow-cache-entries-cache-conj-id:16 lflow-cache-entries-cache-matches:164344 lflow-cache-size-KB:785612

Focusing on the OF rules generated by ovn-controller from ACLs:
# grep conj /tmp/OF-rules | grep -e 'conjunction(18,' -e 'conj_id=18' | grep "17.143.0.5" |
 head -10
 cookie=0x0, duration=181.726s, table=45, n_packets=0, n_bytes=0, idle_age=181, priority=2010,ip,reg0=0x80/0x80,metadata=0x3ea,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=181.420s, table=45, n_packets=0, n_bytes=0, idle_age=181, priority=2010,ip,reg0=0x80/0x80,metadata=0x430,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=180.893s, table=45, n_packets=0, n_bytes=0, idle_age=180, priority=2010,ip,reg0=0x80/0x80,metadata=0x297,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=180.777s, table=45, n_packets=0, n_bytes=0, idle_age=180, priority=2010,ip,reg0=0x80/0x80,metadata=0x3cf,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=180.519s, table=45, n_packets=0, n_bytes=0, idle_age=180, priority=2010,ip,reg0=0x80/0x80,metadata=0x43f,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=180.349s, table=45, n_packets=0, n_bytes=0, idle_age=180, priority=2010,ip,reg0=0x80/0x80,metadata=0x3fd,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=179.866s, table=45, n_packets=0, n_bytes=0, idle_age=179, priority=2010,ip,reg0=0x80/0x80,metadata=0x49c,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=179.698s, table=45, n_packets=0, n_bytes=0, idle_age=179, priority=2010,ip,reg0=0x80/0x80,metadata=0x3be,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=179.637s, table=45, n_packets=0, n_bytes=0, idle_age=179, priority=2010,ip,reg0=0x80/0x80,metadata=0x488,nw_dst=17.143.0.5 actions=conjunction(18,1/2)
 cookie=0x0, duration=179.556s, table=45, n_packets=0, n_bytes=0, idle_age=179, priority=2010,ip,reg0=0x80/0x80,metadata=0x40d,nw_dst=17.143.0.5 actions=conjunction(18,1/2)

The only difference between the above flow matches is the metadata value (logical datapath tunnel key).

# grep conj /tmp/OF-rules | grep -e 'conjunction(18,' -e 'conj_id=18' | grep "17.143.0.5" | wc -l
200

This is repeated 200 times (as the PG includes ports from 200 logical switches).

As ACLs are very similar (just different port groups and address sets) this scenario happens for all ACLs.

The total number of conjunctive match OF rules is:

# grep -c conj /tmp/OF-rules
8004048

On this specific setup, if the metadata match would be included in the conjunctive match the number of OF rules would decrease by a factor of x200.

The same issue was also reported upstream:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-March/381082.html

Comment 2 Dan Williams 2021-05-25 18:23:49 UTC
I think 1-4 in http://patchwork.ozlabs.org/project/ovn/list/?series=241037&state=%2A&archive=both are prereqs of that commit, right?

Comment 4 Dumitru Ceara 2021-06-02 11:39:12 UTC
The following patch needs to be accepted upstream too and also backported along with the aforementioned ones:

http://patchwork.ozlabs.org/project/ovn/patch/20210602070731.3736171-1-hzhou@ovn.org/

Comment 12 errata-xmlrpc 2021-07-29 20:05:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2969


Note You need to log in before you can comment on or make changes to this bug.