Bug 1945415
| Summary: | OVN builds too many lflows for ARP responding for load balancer VIPs | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Tim Rozet <trozet> |
| Component: | OVN | Assignee: | Ilya Maximets <i.maximets> |
| Status: | CLOSED ERRATA | QA Contact: | ying xu <yinxu> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | RHEL 8.0 | CC: | ctrautma, dblack, dceara, i.maximets, keyoung, mark.d.gray, rsevilla, smalleni |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ovn2.13-20.12.0-118.el7fdp ovn2.13-20.12.0-118.el8fdp ovn-2021-21.03.0-34.el8fdp ovn-2021-21.03.0-34.el7fdp | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-21 14:44:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1943631, 1954122 | ||
|
Description
Tim Rozet
2021-03-31 20:19:03 UTC
I prepared a patch for this case: https://github.com/igsilya/ovn/commit/122a90c02086221b112789b59ab9abe45ec1ef8c It will need some polishing and DDlog implementation before being accepted in upstream, but it seems to work fine. Current OVN works like this: for_each_gateway_port(port) { for_each_load_balancer_ip(ip) { add_arp_flow(datapath, port, ip); } } Where add_arp_flow() generates flow like this: match : inport == **port** && arp.op == 1 && arp.tpa == **ip** actions: eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = **ip**; outport = inport; flags.loopback = 1; output; We can see that this flow matches on 'arp.tpa == **ip**', but it also has actions 'arp.tpa = arp.spa; arp.spa = **ip**;' Instead of overwriting arp.tpa with arp.spa, we can just swap them and get an action like this: 'arp.tpa <-> arp.spa;' Result: match : inport == **port** && arp.op == 1 && arp.tpa == **ip** actions: eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output; Now we can see that actions are constant, i.e. doesn't depend on port or ip. At this point we can replace ip in the match with address set or just a list of all ips relevant for this port: match : inport == **port** && arp.op == 1 && arp.tpa == **all-ips** The loop will transform into: for_each_gateway_port(port) { all_ips = '' for_each_load_balancer_ip(ip) { app_ips += ip; } add_arp_flow(datapath, port, all_ips); } So, instead of N_PORTS * N_IPs we will have N_PORTS number of lflows. In the case from BZ1943631 this change reduces total number of lflows from ~850K down to ~350K and reduces the DB size from 500 to 200MB. I have a scratch build with this patch applied: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35891888 @Raul, could you give it a shot in your test? Oh. It seems that the base branch (on top of which I prepared a scratch build) has a bug in a load balancer code. So, it may worth to hold on testing until we figure it out and new build prepared. Dumitru figured out the problem on current master and prepared a fix: https://patchwork.ozlabs.org/project/ovn/patch/20210401092539.1009-1-dceara@redhat.com/ I refined my 'arp flow' fix, applied the patch above and prepared a new scratch build that should be OK to test with: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35901176 Refined version of the 'arp flow' fix: https://github.com/igsilya/ovn/commit/732365025682cdb2987d601d3caedee1a94dfcf7 I updated the way how string joined and added some new tests. It still needs a DDlog implementation, though. I've done some tests with the OVN build provided by Ilya. I did create with success 2k iterations of our cluster-density test-suite on a 250 node cluster.
Some interesting data:
250 nodes (with column diff enabled)
sh-4.4# rpm -qa | grep ovn
ovn2.13-20.12.0-99.el8fdp.x86_64
ovn2.13-host-20.12.0-99.el8fdp.x86_64
ovn2.13-central-20.12.0-99.el8fdp.x86_64
ovn2.13-vtep-20.12.0-99.el8fdp.x86_64
Steady state
sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l
182212
sh-4.4# ls -lh /etc/openvswitch/
total 66M
-rw-r-----. 1 root root 5.2M Apr 5 09:24 ovnnb_db.db
-rw-r-----. 1 root root 32M Apr 5 09:24 ovnsb_db.db
rsevilla@wonderland ~/Downloads $ oc adm top pods -l app=ovnkube-master --containers
POD NAME CPU(cores) MEMORY(bytes)
ovnkube-master-dzshl ovn-dbchecker 34m 32Mi
ovnkube-master-dzshl kube-rbac-proxy 0m 20Mi
ovnkube-master-dzshl northd 0m 344Mi
ovnkube-master-dzshl ovnkube-master 0m 1149Mi
ovnkube-master-dzshl sbdb 6m 1417Mi
ovnkube-master-dzshl nbdb 2m 183Mi
ovnkube-master-m2fsw ovnkube-master 568m 2396Mi
ovnkube-master-m2fsw kube-rbac-proxy 0m 21Mi
ovnkube-master-m2fsw nbdb 2m 218Mi
ovnkube-master-m2fsw northd 799m 515Mi
ovnkube-master-m2fsw ovn-dbchecker 0m 35Mi
ovnkube-master-m2fsw sbdb 9m 2269Mi
ovnkube-master-rl97b ovn-dbchecker 33m 33Mi
ovnkube-master-rl97b northd 0m 353Mi
ovnkube-master-rl97b nbdb 1m 184Mi
ovnkube-master-rl97b ovnkube-master 0m 1137Mi
ovnkube-master-rl97b sbdb 2m 2185Mi
ovnkube-master-rl97b kube-rbac-proxy 0m 20Mi
After creating 2000 cluster-density
sh-4.4# ovn-sbctl --no-leader-only lflow-list | wc -l
6611424
sh-4.4# ls -lh /etc/openvswitch/
total 252M
-rw-r-----. 1 root root 18M Apr 5 14:06 ovnnb_db.db
-rw-r-----. 1 root root 235M Apr 5 14:05 ovnsb_db.db
# Resource usage from OVN control plane componentes after DB compation
$ oc adm top pods -l app=ovnkube-master --containers
POD NAME CPU(cores) MEMORY(bytes)
ovnkube-master-dzshl ovnkube-master 0m 2473Mi
ovnkube-master-dzshl kube-rbac-proxy 0m 20Mi
ovnkube-master-dzshl northd 0m 1306Mi
ovnkube-master-dzshl sbdb 3m 5156Mi
ovnkube-master-dzshl nbdb 1m 543Mi
ovnkube-master-dzshl ovn-dbchecker 0m 34Mi
ovnkube-master-m2fsw sbdb 3m 11806Mi
ovnkube-master-m2fsw nbdb 2m 734Mi
ovnkube-master-m2fsw ovnkube-master 19m 4174Mi
ovnkube-master-m2fsw northd 0m 1328Mi
ovnkube-master-m2fsw ovn-dbchecker 0m 34Mi
ovnkube-master-m2fsw kube-rbac-proxy 0m 22Mi
ovnkube-master-rl97b ovnkube-master 0m 2492Mi
ovnkube-master-rl97b northd 590m 5451Mi
ovnkube-master-rl97b kube-rbac-proxy 0m 20Mi
ovnkube-master-rl97b sbdb 3m 6105Mi
ovnkube-master-rl97b ovn-dbchecker 199m 35Mi
ovnkube-master-rl97b nbdb 2m 386Mi
# Flow count w/o datapath group
sh-4.4# ovsdb-tool query ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' | wc -l
313618
# Top frequent not-grouped flows
sh-4.4# ovsdb-tool query ovnsb_db-standalone.db '["OVN_Southbound",{"op":"select","table":"Logical_Flow","where":[["logical_dp_group", "==", ["set", []]]]}]' | sed 's/logical_dp_group/\nlogical_dp_group/g' | grep -oE "ovn-northd.c:[0-9]*" | sort | uniq -c | sort| tail -n 20
783 ovn-northd.c:8822
840 ovn-northd.c:6648
932 ovn-northd.c:10859
932 ovn-northd.c:9282
1260 ovn-northd.c:5166
1260 ovn-northd.c:5169
1543 ovn-northd.c:9223
2750 ovn-northd.c:8710
9605 ovn-northd.c:7186
9630 ovn-northd.c:7204
21350 ovn-northd.c:11574
21350 ovn-northd.c:4570
21350 ovn-northd.c:4587
21350 ovn-northd.c:4641
21560 ovn-northd.c:10294
21980 ovn-northd.c:7615
22400 ovn-northd.c:5023
22400 ovn-northd.c:5118
42700 ovn-northd.c:4682
42700 ovn-northd.c:4719
During the test, I didn't see any database leader change. Attaching ovnsb database to the BZ in case you want to take a look at it.
Seems like the old build expired. Here is a new one (and a more permanent link): http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/ovn2.13/20.12.0/99.el8fdp/ v1 sent for review: https://patchwork.ozlabs.org/project/ovn/patch/20210507162256.3661118-1-i.maximets@ovn.org/ test on the old version:
set a load_balancer for lr, and add more than 1 vip
# ovn-nbctl list load_balancer
_uuid : d77b4550-46ef-4598-a3a5-ab7d1ccbd8dc
external_ids : {}
health_check : [8b5d65df-68ba-4eb8-8e79-91b64e9651c5]
ip_port_mappings : {"192.168.0.1"="ls1p1:192.168.0.254"}
name : lb0
options : {}
protocol : udp
selection_fields : []
vips : {"192.168.2.1:12345"="192.168.0.1:12345,192.168.0.2:12345", "30.0.0.1:8000"="192.168.0.1:12345,192.168.0.2:12345", "[3000::100]:12345"="[3001::1]:12345,[3001::2]:12345"}
check flows,there are seperate flows for 192.168.2.1 and 30.0.0.1
[root@dell-per730-19 load_balance]# ovn-sbctl dump-flows|grep 192.168.2.1
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;)
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;)
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == 192.168.2.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 192.168.2.1; outport = inport; flags.loopback = 1; output;)
[root@dell-per730-19 load_balance]# ovn-sbctl dump-flows|grep 30.0.0.1
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;)
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;)
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == 30.0.0.1), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa = arp.spa; arp.spa = 30.0.0.1; outport = inport; flags.loopback = 1; output;)
verified on version:
# rpm -qa|grep ovn
ovn-2021-central-21.03.0-40.el8fdp.x86_64
ovn-2021-host-21.03.0-40.el8fdp.x86_64
ovn-2021-21.03.0-40.el8fdp.x86_64
# ovn-nbctl list load_balancer
_uuid : 9fcc692e-a252-4a34-b727-8e2c25d8f8cc
external_ids : {}
health_check : [5c76f7b7-fc57-4958-9936-cb8c03368205]
ip_port_mappings : {"192.168.0.1"="ls1p1:192.168.0.254"}
name : lb0
options : {}
protocol : udp
selection_fields : []
vips : {"192.168.2.1:12345"="192.168.0.1:12345,192.168.0.2:12345", "30.0.0.1:8000"="192.168.0.1:12345,192.168.0.2:12345", "[3000::100]:12345"="[3001::1]:12345,[3001::2]:12345"}
check the flows,all vips only generate the same flow.
# ovn-sbctl dump-flows|grep 30.0.0.1
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls1" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;)
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1ls2" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;)
table=3 (lr_in_ip_input ), priority=90 , match=(inport == "lr1p" && arp.op == 1 && arp.tpa == { 192.168.2.1, 30.0.0.1 }), action=(eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output;)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2507 |