Description of problem: We see in https://bugzilla.redhat.com/show_bug.cgi?id=1943631 that 655k logical flows are generated as we scale up k8s services. A kubernetes service translates to 1 or more VIPs that exist on logical switches, as well as gateway routers. Northd is generating a logical flow for each one of these. We need to reduce at least the number of logical flows here.
I prepared a patch for this case: https://github.com/igsilya/ovn/commit/122a90c02086221b112789b59ab9abe45ec1ef8c It will need some polishing and DDlog implementation before being accepted in upstream, but it seems to work fine. Current OVN works like this: for_each_gateway_port(port) { for_each_load_balancer_ip(ip) { add_arp_flow(datapath, port, ip); } } Where add_arp_flow() generates flow like this: match : inport == **port** && arp.op == 1 && arp.tpa == **ip** actions: eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = **ip**; outport = inport; flags.loopback = 1; output; We can see that this flow matches on 'arp.tpa == **ip**', but it also has actions 'arp.tpa = arp.spa; arp.spa = **ip**;' Instead of overwriting arp.tpa with arp.spa, we can just swap them and get an action like this: 'arp.tpa <-> arp.spa;' Result: match : inport == **port** && arp.op == 1 && arp.tpa == **ip** actions: eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output; Now we can see that actions are constant, i.e. doesn't depend on port or ip. At this point we can replace ip in the match with address set or just a list of all ips relevant for this port: match : inport == **port** && arp.op == 1 && arp.tpa == **all-ips** The loop will transform into: for_each_gateway_port(port) { all_ips = '' for_each_load_balancer_ip(ip) { app_ips += ip; } add_arp_flow(datapath, port, all_ips); } So, instead of N_PORTS * N_IPs we will have N_PORTS number of lflows. In the case from BZ1943631 this change reduces total number of lflows from ~850K down to ~350K and reduces the DB size from 500 to 200MB. I have a scratch build with this patch applied: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35891888 @Raul, could you give it a shot in your test?
Oh. It seems that the base branch (on top of which I prepared a scratch build) has a bug in a load balancer code. So, it may worth to hold on testing until we figure it out and new build prepared.
Dumitru figured out the problem on current master and prepared a fix: https://patchwork.ozlabs.org/project/ovn/patch/20210401092539.1009-1-dceara@redhat.com/ I refined my 'arp flow' fix, applied the patch above and prepared a new scratch build that should be OK to test with: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35901176 Refined version of the 'arp flow' fix: https://github.com/igsilya/ovn/commit/732365025682cdb2987d601d3caedee1a94dfcf7 I updated the way how string joined and added some new tests. It still needs a DDlog implementation, though.
I've done some tests with the OVN build provided by Ilya. I did create with success 2k iterations of our cluster-density test-suite on a 250 node cluster. Some interesting data: 250 nodes (with column diff enabled) sh-4.4# rpm -qa | grep ovn ovn2.13-20.12.0-99.el8fdp.x86_64 ovn2.13-host-20.12.0-99.el8fdp.x86_64 ovn2.13-central-20.12.0-99.el8fdp.x86_64 ovn2.13-vtep-20.12.0-99.el8fdp.x86_64 Steady state sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l 182212 sh-4.4# ls -lh /etc/openvswitch/ total 66M -rw-r-----. 1 root root 5.2M Apr 5 09:24 ovnnb_db.db -rw-r-----. 1 root root 32M Apr 5 09:24 ovnsb_db.db rsevilla@wonderland ~/Downloads $ oc adm top pods -l app=ovnkube-master --containers POD NAME CPU(cores) MEMORY(bytes) ovnkube-master-dzshl ovn-dbchecker 34m 32Mi ovnkube-master-dzshl kube-rbac-proxy 0m 20Mi ovnkube-master-dzshl northd 0m 344Mi ovnkube-master-dzshl ovnkube-master 0m 1149Mi ovnkube-master-dzshl sbdb 6m 1417Mi ovnkube-master-dzshl nbdb 2m 183Mi ovnkube-master-m2fsw ovnkube-master 568m 2396Mi ovnkube-master-m2fsw kube-rbac-proxy 0m 21Mi ovnkube-master-m2fsw nbdb 2m 218Mi ovnkube-master-m2fsw northd 799m 515Mi ovnkube-master-m2fsw ovn-dbchecker 0m 35Mi ovnkube-master-m2fsw sbdb 9m 2269Mi ovnkube-master-rl97b ovn-dbchecker 33m 33Mi ovnkube-master-rl97b northd 0m 353Mi ovnkube-master-rl97b nbdb 1m 184Mi ovnkube-master-rl97b ovnkube-master 0m 1137Mi ovnkube-master-rl97b sbdb 2m 2185Mi ovnkube-master-rl97b kube-rbac-proxy 0m 20Mi After creating 2000 cluster-density sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l 6611424 sh-4.4# ls -lh /etc/openvswitch/ total 252M -rw-r-----. 1 root root 18M Apr 5 14:06 ovnnb_db.db -rw-r-----. 1 root root 235M Apr 5 14:05 ovnsb_db.db # Resource usage from OVN control plane componentes after DB compation $ oc adm top pods -l app=ovnkube-master --containers POD NAME CPU(cores) MEMORY(bytes) ovnkube-master-dzshl ovnkube-master 0m 2473Mi ovnkube-master-dzshl kube-rbac-proxy 0m 20Mi ovnkube-master-dzshl northd 0m 1306Mi ovnkube-master-dzshl sbdb 3m 5156Mi ovnkube-master-dzshl nbdb 1m 543Mi ovnkube-master-dzshl ovn-dbchecker 0m 34Mi ovnkube-master-m2fsw sbdb 3m 11806Mi ovnkube-master-m2fsw nbdb 2m 734Mi ovnkube-master-m2fsw ovnkube-master 19m 4174Mi ovnkube-master-m2fsw northd 0m 1328Mi ovnkube-master-m2fsw ovn-dbchecker 0m 34Mi ovnkube-master-m2fsw kube-rbac-proxy 0m 22Mi ovnkube-master-rl97b ovnkube-master 0m 2492Mi ovnkube-master-rl97b northd 590m 5451Mi ovnkube-master-rl97b kube-rbac-proxy 0m 20Mi ovnkube-master-rl97b sbdb 3m 6105Mi ovnkube-master-rl97b ovn-dbchecker 199m 35Mi ovnkube-master-rl97b nbdb 2m 386Mi # Flow count w/o datapath group sh-4.4# ovsdb-tool query ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' | wc -l 313618 # Top frequent not-grouped flows sh-4.4# ovsdb-tool query ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' | grep -oE "ovn-northd.c:[0-9]*" | sort | uniq -c | sort| tail -n 20 783 ovn-northd.c:8822 840 ovn-northd.c:6648 932 ovn-northd.c:10859 932 ovn-northd.c:9282 1260 ovn-northd.c:5166 1260 ovn-northd.c:5169 1543 ovn-northd.c:9223 2750 ovn-northd.c:8710 9605 ovn-northd.c:7186 9630 ovn-northd.c:7204 21350 ovn-northd.c:11574 21350 ovn-northd.c:4570 21350 ovn-northd.c:4587 21350 ovn-northd.c:4641 21560 ovn-northd.c:10294 21980 ovn-northd.c:7615 22400 ovn-northd.c:5023 22400 ovn-northd.c:5118 42700 ovn-northd.c:4682 42700 ovn-northd.c:4719 During the test, I didn't see any database leader change. Attaching ovnsb database to the BZ in case you want to take a look at it.
Seems like the old build expired. Here is a new one (and a more permanent link): http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/ovn2.13/20.12.0/99.el8fdp/
v1 sent for review: https://patchwork.ozlabs.org/project/ovn/patch/20210507162256.3661118-1-i.maximets@ovn.org/
test on the old version: set a load_balancer for lr, and add more than 1 vip # ovn-nbctl list load_balancer _uuid : d77b4550-46ef-4598-a3a5-ab7d1ccbd8dc external_ids : {} health_check : [8b5d65df-68ba-4eb8-8e79-91b64e9651c5] ip_port_mappings : {"192.168.0.1"="ls1p1:192.168.0.254"} name : lb0 options : {} protocol : udp selection_fields : [] vips : {"192.168.2.1:12345"="192.168.0.1:12345,192.168.0.2:12345", "30.0.0.1:8000"="192.168.0.1:12345,192.168.0.2:12345", "[3000::100]:12345"="[3001::1]:12345,[3001::2]:12345"} check flows,there are seperate flows for 192.168.2.1 and 30.0.0.1 [root@dell-per730-19 load_balance]# ovn-sbctl dump-flows|grep 192.168.2.1 table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;) table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;) table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;) [root@dell-per730-19 load_balance]# ovn-sbctl dump-flows|grep 30.0.0.1 table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;) table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;) table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;) verified on version: # rpm -qa|grep ovn ovn-2021-central-21.03.0-40.el8fdp.x86_64 ovn-2021-host-21.03.0-40.el8fdp.x86_64 ovn-2021-21.03.0-40.el8fdp.x86_64 # ovn-nbctl list load_balancer _uuid : 9fcc692e-a252-4a34-b727-8e2c25d8f8cc external_ids : {} health_check : [5c76f7b7-fc57-4958-9936-cb8c03368205] ip_port_mappings : {"192.168.0.1"="ls1p1:192.168.0.254"} name : lb0 options : {} protocol : udp selection_fields : [] vips : {"192.168.2.1:12345"="192.168.0.1:12345,192.168.0.2:12345", "30.0.0.1:8000"="192.168.0.1:12345,192.168.0.2:12345", "[3000::100]:12345"="[3001::1]:12345,[3001::2]:12345"} check the flows,all vips only generate the same flow. # ovn-sbctl dump-flows|grep 30.0.0.1 table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;) table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;) table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2507