The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1945415 - OVN builds too many lflows for ARP responding for load balancer VIPs
Summary: OVN builds too many lflows for ARP responding for load balancer VIPs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Ilya Maximets
QA Contact: ying xu
URL:
Whiteboard:
Depends On:
Blocks: 1943631 1954122
TreeView+ depends on / blocked
 
Reported: 2021-03-31 20:19 UTC by Tim Rozet
Modified: 2021-07-28 15:25 UTC (History)
8 users (show)

Fixed In Version: ovn2.13-20.12.0-118.el7fdp ovn2.13-20.12.0-118.el8fdp ovn-2021-21.03.0-34.el8fdp ovn-2021-21.03.0-34.el7fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-21 14:44:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:2507 0 None None None 2021-06-21 14:46:02 UTC

Description Tim Rozet 2021-03-31 20:19:03 UTC
Description of problem:
We see in https://bugzilla.redhat.com/show_bug.cgi?id=1943631 that 655k logical flows are generated as we scale up k8s services. A kubernetes service translates to 1 or more VIPs that exist on logical switches, as well as gateway routers. Northd is generating a logical flow for each one of these. We need to reduce at least the number of logical flows here.

Comment 1 Ilya Maximets 2021-04-01 07:10:58 UTC
I prepared a patch for this case:
  https://github.com/igsilya/ovn/commit/122a90c02086221b112789b59ab9abe45ec1ef8c

It will need some polishing and DDlog implementation before being accepted in
upstream, but it seems to work fine.

Current OVN works like this:

  for_each_gateway_port(port) {
      for_each_load_balancer_ip(ip) {
          add_arp_flow(datapath, port, ip);
      }
  }

Where add_arp_flow() generates flow like this:

  match  : inport == **port** && arp.op == 1 && arp.tpa == **ip**
  actions: eth.dst = eth.src;
           eth.src = xreg0[0..47];
           arp.op = 2; /* ARP reply */
           arp.tha = arp.sha;
           arp.sha = xreg0[0..47];
           arp.tpa = arp.spa;
           arp.spa = **ip**;
           outport = inport;
           flags.loopback = 1;
           output;

We can see that this flow matches on 'arp.tpa == **ip**', but it
also has actions 'arp.tpa = arp.spa; arp.spa = **ip**;'
Instead of overwriting arp.tpa with arp.spa, we can just swap them
and get an action like this: 'arp.tpa <-> arp.spa;'

Result:

  match  : inport == **port** && arp.op == 1 && arp.tpa == **ip**
  actions: eth.dst = eth.src;
           eth.src = xreg0[0..47];
           arp.op = 2; /* ARP reply */
           arp.tha = arp.sha;
           arp.sha = xreg0[0..47];
           arp.tpa <-> arp.spa;
           outport = inport;
           flags.loopback = 1;
           output;

Now we can see that actions are constant, i.e. doesn't depend on
port or ip.  At this point we can replace ip in the match with
address set or just a list of all ips relevant for this port:

  match  : inport == **port** && arp.op == 1 && arp.tpa == **all-ips**

The loop will transform into:

  for_each_gateway_port(port) {
      all_ips = ''
      for_each_load_balancer_ip(ip) {
          app_ips += ip;
      }
      add_arp_flow(datapath, port, all_ips);
  }

So, instead of N_PORTS * N_IPs we will have N_PORTS number of lflows.

In the case from BZ1943631 this change reduces total number of lflows
from ~850K down to ~350K and reduces the DB size from 500 to 200MB.

I have a scratch build with this patch applied:
  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35891888

@Raul, could you give it a shot in your test?

Comment 2 Ilya Maximets 2021-04-01 08:10:48 UTC
Oh.  It seems that the base branch (on top of which I prepared a scratch build)
has a bug in a load balancer code.  So, it may worth to hold on testing until
we figure it out and new build prepared.

Comment 3 Ilya Maximets 2021-04-01 09:50:33 UTC
Dumitru figured out the problem on current master and prepared a fix:
  https://patchwork.ozlabs.org/project/ovn/patch/20210401092539.1009-1-dceara@redhat.com/

I refined my 'arp flow' fix, applied the patch above and prepared a
new scratch build that should be OK to test with:

  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35901176

Refined version of the 'arp flow' fix:
  https://github.com/igsilya/ovn/commit/732365025682cdb2987d601d3caedee1a94dfcf7
I updated the way how string joined and added some new tests.
It still needs a DDlog implementation, though.

Comment 4 Raul Sevilla 2021-04-05 15:05:40 UTC
I've done some tests with the OVN build provided by Ilya. I did create with success 2k iterations of our cluster-density test-suite on a 250 node cluster.
Some interesting data:

250 nodes (with column diff enabled)

sh-4.4# rpm -qa | grep ovn
ovn2.13-20.12.0-99.el8fdp.x86_64
ovn2.13-host-20.12.0-99.el8fdp.x86_64
ovn2.13-central-20.12.0-99.el8fdp.x86_64
ovn2.13-vtep-20.12.0-99.el8fdp.x86_64


Steady state
sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l
182212

sh-4.4# ls -lh /etc/openvswitch/
total 66M
-rw-r-----. 1 root root 5.2M Apr  5 09:24 ovnnb_db.db
-rw-r-----. 1 root root  32M Apr  5 09:24 ovnsb_db.db
rsevilla@wonderland ~/Downloads $ oc adm top pods -l app=ovnkube-master --containers
POD                    NAME              CPU(cores)   MEMORY(bytes)   
ovnkube-master-dzshl   ovn-dbchecker     34m          32Mi            
ovnkube-master-dzshl   kube-rbac-proxy   0m           20Mi            
ovnkube-master-dzshl   northd            0m           344Mi           
ovnkube-master-dzshl   ovnkube-master    0m           1149Mi          
ovnkube-master-dzshl   sbdb              6m           1417Mi          
ovnkube-master-dzshl   nbdb              2m           183Mi           
ovnkube-master-m2fsw   ovnkube-master    568m         2396Mi          
ovnkube-master-m2fsw   kube-rbac-proxy   0m           21Mi            
ovnkube-master-m2fsw   nbdb              2m           218Mi           
ovnkube-master-m2fsw   northd            799m         515Mi           
ovnkube-master-m2fsw   ovn-dbchecker     0m           35Mi            
ovnkube-master-m2fsw   sbdb              9m           2269Mi          
ovnkube-master-rl97b   ovn-dbchecker     33m          33Mi            
ovnkube-master-rl97b   northd            0m           353Mi           
ovnkube-master-rl97b   nbdb              1m           184Mi           
ovnkube-master-rl97b   ovnkube-master    0m           1137Mi          
ovnkube-master-rl97b   sbdb              2m           2185Mi          
ovnkube-master-rl97b   kube-rbac-proxy   0m           20Mi 


After creating 2000 cluster-density
sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l
6611424

sh-4.4# ls -lh /etc/openvswitch/
total 252M
-rw-r-----. 1 root root  18M Apr  5 14:06 ovnnb_db.db
-rw-r-----. 1 root root 235M Apr  5 14:05 ovnsb_db.db

# Resource usage from OVN control plane componentes after DB compation
$ oc adm top pods -l app=ovnkube-master --containers  
POD                    NAME              CPU(cores)   MEMORY(bytes)   
ovnkube-master-dzshl   ovnkube-master    0m           2473Mi          
ovnkube-master-dzshl   kube-rbac-proxy   0m           20Mi            
ovnkube-master-dzshl   northd            0m           1306Mi          
ovnkube-master-dzshl   sbdb              3m           5156Mi          
ovnkube-master-dzshl   nbdb              1m           543Mi           
ovnkube-master-dzshl   ovn-dbchecker     0m           34Mi            
ovnkube-master-m2fsw   sbdb              3m           11806Mi         
ovnkube-master-m2fsw   nbdb              2m           734Mi           
ovnkube-master-m2fsw   ovnkube-master    19m          4174Mi          
ovnkube-master-m2fsw   northd            0m           1328Mi          
ovnkube-master-m2fsw   ovn-dbchecker     0m           34Mi            
ovnkube-master-m2fsw   kube-rbac-proxy   0m           22Mi            
ovnkube-master-rl97b   ovnkube-master    0m           2492Mi          
ovnkube-master-rl97b   northd            590m         5451Mi          
ovnkube-master-rl97b   kube-rbac-proxy   0m           20Mi            
ovnkube-master-rl97b   sbdb              3m           6105Mi          
ovnkube-master-rl97b   ovn-dbchecker     199m         35Mi            
ovnkube-master-rl97b   nbdb              2m           386Mi

# Flow count w/o datapath group
sh-4.4#  ovsdb-tool query  ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' |  wc -l
313618

# Top frequent not-grouped flows
sh-4.4#  ovsdb-tool query  ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' | grep -oE "ovn-northd.c:[0-9]*"  | sort | uniq -c | sort| tail -n 20
    783 ovn-northd.c:8822
    840 ovn-northd.c:6648
    932 ovn-northd.c:10859
    932 ovn-northd.c:9282
   1260 ovn-northd.c:5166
   1260 ovn-northd.c:5169
   1543 ovn-northd.c:9223
   2750 ovn-northd.c:8710
   9605 ovn-northd.c:7186
   9630 ovn-northd.c:7204
  21350 ovn-northd.c:11574
  21350 ovn-northd.c:4570
  21350 ovn-northd.c:4587
  21350 ovn-northd.c:4641
  21560 ovn-northd.c:10294
  21980 ovn-northd.c:7615
  22400 ovn-northd.c:5023
  22400 ovn-northd.c:5118
  42700 ovn-northd.c:4682
  42700 ovn-northd.c:4719

During the test, I didn't see any database leader change. Attaching ovnsb database to the BZ in case you want to take a look at it.

Comment 6 Ilya Maximets 2021-04-20 19:52:03 UTC
Seems like the old build expired.  Here is a new one (and a more permanent link):

  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/ovn2.13/20.12.0/99.el8fdp/

Comment 11 ying xu 2021-06-07 14:20:37 UTC
test on the old version:
set a load_balancer for lr, and add more than 1 vip
# ovn-nbctl list load_balancer
_uuid               : d77b4550-46ef-4598-a3a5-ab7d1ccbd8dc
external_ids        : {}
health_check        : [8b5d65df-68ba-4eb8-8e79-91b64e9651c5]
ip_port_mappings    : {"192.168.0.1"="ls1p1:192.168.0.254"}
name                : lb0
options             : {}
protocol            : udp
selection_fields    : []
vips                : {"192.168.2.1:12345"="192.168.0.1:12345,192.168.0.2:12345", "30.0.0.1:8000"="192.168.0.1:12345,192.168.0.2:12345", "[3000::100]:12345"="[3001::1]:12345,[3001::2]:12345"}

check flows,there are seperate flows for 192.168.2.1 and 30.0.0.1
[root@dell-per730-19 load_balance]# ovn-sbctl dump-flows|grep 192.168.2.1
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;)
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;)
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;)

[root@dell-per730-19 load_balance]# ovn-sbctl dump-flows|grep 30.0.0.1
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;)
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;)
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;)

verified on version:
# rpm -qa|grep ovn
ovn-2021-central-21.03.0-40.el8fdp.x86_64
ovn-2021-host-21.03.0-40.el8fdp.x86_64
ovn-2021-21.03.0-40.el8fdp.x86_64

# ovn-nbctl list load_balancer
_uuid               : 9fcc692e-a252-4a34-b727-8e2c25d8f8cc
external_ids        : {}
health_check        : [5c76f7b7-fc57-4958-9936-cb8c03368205]
ip_port_mappings    : {"192.168.0.1"="ls1p1:192.168.0.254"}
name                : lb0
options             : {}
protocol            : udp
selection_fields    : []
vips                : {"192.168.2.1:12345"="192.168.0.1:12345,192.168.0.2:12345", "30.0.0.1:8000"="192.168.0.1:12345,192.168.0.2:12345", "[3000::100]:12345"="[3001::1]:12345,[3001::2]:12345"}

check the flows,all vips only generate the same flow.
# ovn-sbctl dump-flows|grep 30.0.0.1
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;)
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;)
  table=3 (lr_in_ip_input     ), priority=90   , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;)

Comment 13 errata-xmlrpc 2021-06-21 14:44:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2507


Note You need to log in before you can comment on or make changes to this bug.