Bug 1978605
| Summary: | [OVN SCALE][ovn-controller] Load_Balancer changes trigger unnecessary additional flow computations. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Dumitru Ceara <dceara> |
| Component: | OVN | Assignee: | Dumitru Ceara <dceara> |
| Status: | CLOSED ERRATA | QA Contact: | ying xu <yinxu> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | FDP 20.H | CC: | ctrautma, i.maximets, jiji, kfida, nusiddiq, surya, trozet |
| Target Milestone: | --- | ||
| Target Release: | FDP 21.I | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | perfscale-ovn | ||
| Fixed In Version: | ovn21.09-21.09.0-10.el8fdp ovn-2021-21.09.0-5.el8fdp ovn2.13-20.12.0-183.el8fdp | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-12-09 15:37:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1996886, 2011385 | ||
Fix sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=252352&state=* v2 sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=253029&state=* v3 sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=253094&state=* Follow up patch to fix load_balancer add/delete update: http://patchwork.ozlabs.org/project/ovn/list/?series=255245&state=* reproduced on version:
rpm -qa|grep ovn
ovn-2021-host-21.06.0-29.el8fdp.x86_64
ovn-2021-central-21.06.0-29.el8fdp.x86_64
ovn-2021-21.06.0-29.el8fdp.x86_64
use the script below:
topo like this:
vm0---------s0-----------r1-------s1----------vm1
|
rn
|
sn
|
vmn (n>=400)
#sw public
ovn-nbctl set NB_GLOBAL . options:northd_probe_interval=180000
ovn-nbctl set connection . inactivity_probe=180000
ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=180
ovs-vsctl set open . external_ids:ovn-remote-probe-interval=180000
ovn-sbctl set connection . inactivity_probe=180000
ovn-nbctl ls-add public
ovn-nbctl lb-add lb0 100.10.1.2:880 172.16.1.2:9000
# r1
i=1
for m in `seq 0 9`;do
for n in `seq 1 99`;do
ovn-nbctl lr-add r${i}
ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16
ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24
ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2
ovn-nbctl lrp-set-gateway-chassis r${i}_public hv1
# s1
ovn-nbctl ls-add s${i}
ovn-nbctl ls-lb-add s${i} lb0
# s1 - r1
ovn-nbctl lsp-add s${i} s${i}_r${i}
ovn-nbctl lsp-set-type s${i}_r${i} router
ovn-nbctl lsp-set-addresses s${i}_r${i} "00:de:ad:fe:$m:$n 173.$m.$n.1"
ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i}
# s1 - vm1
ovn-nbctl lsp-add s$i vm$i
ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2"
ovn-nbctl lrp-add r$i r${i}_public 40:44:00:00:$m:$n 172.16.$m.$n/16
ovn-nbctl lsp-add public public_r${i}
ovn-nbctl lsp-set-type public_r${i} router
ovn-nbctl lsp-set-addresses public_r${i} router
ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public nat-addresses=router
let i++
if [ $i -gt 300 ];then
break;
fi
done
if [ $i -gt 300 ];then
break;
fi
done
ovn-nbctl lsp-add public ln_p1
ovn-nbctl lsp-set-addresses ln_p1 unknown
ovn-nbctl lsp-set-type ln_p1 localnet
ovn-nbctl lsp-set-options ln_p1 network_name=nattest
after ovn install all flows,check the ovn-controller log;
cat /var/log/ovn/ovn-controller.log |tail -n 1
2021-10-26T09:47:42.418Z|00958|timeval|WARN|context switches: 0 voluntary, 2270 involuntary
then add a new vip to the LB,
ovn-nbctl lb-add lb0 100.10.2.3:80 172.16.1.2:8000
wait some time,check the log again:
cat /var/log/ovn/ovn-controller.log |tail -n 5
2021-10-26T09:47:42.418Z|00958|timeval|WARN|context switches: 0 voluntary, 2270 involuntary
2021-10-26T09:57:14.533Z|00959|timeval|WARN|Unreasonably long 1901ms poll interval (970ms user, 8ms system) -----------------------------here
2021-10-26T09:57:14.533Z|00960|timeval|WARN|faults: 5907 minor, 0 major
2021-10-26T09:57:14.534Z|00961|timeval|WARN|context switches: 0 voluntary, 653 involuntary
2021-10-26T09:57:14.534Z|00962|coverage|INFO|Dropped 6 log messages in last 619 seconds (most recently, 572 seconds ago) due to excessive rate
with the fix,before add a new vip to the LB,
check the log
# cat /var/log/ovn/ovn-controller.log |tail -n 1
2021-10-26T10:13:59.595Z|00945|coverage|INFO|102 events never hit
after,
# cat /var/log/ovn/ovn-controller.log |tail -n 10
2021-10-26T10:13:59.595Z|00936|coverage|INFO|stream_open 0.0/sec 0.000/sec 0.0019/sec total: 7
2021-10-26T10:13:59.595Z|00937|coverage|INFO|util_xalloc 336.6/sec 730081.517/sec 48148.7742/sec total: 186319518
2021-10-26T10:13:59.595Z|00938|coverage|INFO|vconn_open 0.0/sec 0.000/sec 0.0014/sec total: 5
2021-10-26T10:13:59.595Z|00939|coverage|INFO|vconn_received 0.0/sec 0.700/sec 1.2583/sec total: 4530
2021-10-26T10:13:59.595Z|00940|coverage|INFO|vconn_sent 120.4/sec 468.467/sec 137.4411/sec total: 497031
2021-10-26T10:13:59.595Z|00941|coverage|INFO|netlink_received 0.0/sec 1.133/sec 1.9711/sec total: 7100
2021-10-26T10:13:59.595Z|00942|coverage|INFO|netlink_recv_jumbo 0.0/sec 0.283/sec 0.4925/sec total: 1774
2021-10-26T10:13:59.595Z|00943|coverage|INFO|netlink_sent 0.0/sec 1.133/sec 1.9711/sec total: 7100
2021-10-26T10:13:59.595Z|00944|coverage|INFO|cmap_expand 0.0/sec 0.000/sec 0.0008/sec total: 3 --------------no new logs
2021-10-26T10:13:59.595Z|00945|coverage|INFO|102 events never hit
delete the LB,no "Unreasonably long 1901ms poll interval" ,too
version:
# rpm -qa|grep ovn
ovn-2021-host-21.09.0-12.el8fdp.x86_64
ovn-2021-central-21.09.0-12.el8fdp.x86_64
ovn-2021-21.09.0-12.el8fdp.x86_64
set verified.
also verified on version: # rpm -qa|grep ovn ovn2.13-host-20.12.0-185.el8fdp.x86_64h ovn2.13-central-20.12.0-185.el8fdp.x86_64 ovn2.13-20.12.0-185.el8fdp.x86_64 before add new vip: # cat /var/log/ovn/ovn-controller.log |tail -n 10 2021-10-26T11:15:10.079Z|00820|coverage|INFO|vconn_received 0.0/sec 0.517/sec 0.3950/sec total: 1422 2021-10-26T11:15:10.079Z|00821|coverage|INFO|vconn_sent 0.0/sec 522.767/sec 125.8050/sec total: 452898 2021-10-26T11:15:10.079Z|00822|coverage|INFO|netlink_received 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00823|coverage|INFO|netlink_recv_jumbo 0.0/sec 0.500/sec 0.4989/sec total: 1796 2021-10-26T11:15:10.079Z|00824|coverage|INFO|netlink_sent 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00825|coverage|INFO|cmap_expand 0.0/sec 0.000/sec 0.0008/sec total: 3 2021-10-26T11:15:10.079Z|00826|coverage|INFO|97 events never hit 2021-10-26T11:15:18.361Z|00827|timeval|WARN|Unreasonably long 7779ms poll interval (3728ms user, 72ms system) 2021-10-26T11:15:18.361Z|00828|timeval|WARN|faults: 80618 minor, 0 major 2021-10-26T11:15:18.361Z|00829|timeval|WARN|context switches: 0 voluntary, 807 involuntary after,no new logs shown [root@dell-per730-19 bz1776712_broadcast_limit]# ovn-nbctl lb-add lb0 100.10.2.3:80 172.16.1.2:8000 [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# cat /var/log/ovn/ovn-controller.log |tail -n 10 2021-10-26T11:15:10.079Z|00820|coverage|INFO|vconn_received 0.0/sec 0.517/sec 0.3950/sec total: 1422 2021-10-26T11:15:10.079Z|00821|coverage|INFO|vconn_sent 0.0/sec 522.767/sec 125.8050/sec total: 452898 2021-10-26T11:15:10.079Z|00822|coverage|INFO|netlink_received 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00823|coverage|INFO|netlink_recv_jumbo 0.0/sec 0.500/sec 0.4989/sec total: 1796 2021-10-26T11:15:10.079Z|00824|coverage|INFO|netlink_sent 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00825|coverage|INFO|cmap_expand 0.0/sec 0.000/sec 0.0008/sec total: 3 2021-10-26T11:15:10.079Z|00826|coverage|INFO|97 events never hit 2021-10-26T11:15:18.361Z|00827|timeval|WARN|Unreasonably long 7779ms poll interval (3728ms user, 72ms system) 2021-10-26T11:15:18.361Z|00828|timeval|WARN|faults: 80618 minor, 0 major 2021-10-26T11:15:18.361Z|00829|timeval|WARN|context switches: 0 voluntary, 807 involuntary Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5059 |
Description of problem: Whenever a Load_Balancer is updated, e.g., a VIP is added, the following sequence of events happens: 1. The Southbound Load_Balancer record is updated. 2. The Southbound Datapath_Binding records on which the Load_Balancer is applied are updated. 3. Southbound ovsdb-server sends updates about the Load_Balancer and Datapath_Binding records to ovn-controller. 4. The IDL layer in ovn-controller processes the updates at #3, but because of the SB schema references between tables [0] all logical flows referencing the updated Datapath_Binding are marked as "updated". The same is true for Logical_DP_Group records referencing the Datapath_Binding, and also for all logical flows pointing to the new "updated" datapath groups. 5. ovn-controller ends up recomputing (removing/readding) all flows for all these tracked updates. [0] From the SB Schema: "Datapath_Binding": { "columns": { [...] "load_balancers": {"type": {"key": {"type": "uuid", "refTable": "Load_Balancer", "refType": "weak"}, "min": 0, "max": "unlimited"}}, [...] "Load_Balancer": { "columns": { "datapaths": { [...] "type": {"key": {"type": "uuid", "refTable": "Datapath_Binding"}, "min": 0, "max": "unlimited"}}, [...] "Logical_DP_Group": { "columns": { "datapaths": {"type": {"key": {"type": "uuid", "refTable": "Datapath_Binding", "refType": "weak"}, "min": 0, "max": "unlimited"}}}, [...] "Logical_Flow": { "columns": { "logical_datapath": {"type": {"key": {"type": "uuid", "refTable": "Datapath_Binding"}, "min": 0, "max": 1}}, "logical_dp_group": {"type": {"key": {"type": "uuid", "refTable": "Logical_DP_Group"}, Version-Release number of selected component (if applicable): Upstream OVN v21.06.0. Potential solution: Stop populating the SB.Datapath_Binding.load_balancer column. This would break the "update notification chain" when a load balancer is udpated in the southbound. This is used only when a new Datapath_Binding is added to determine which load balancer flows have to be installed for this new datapath. However, it's quite easy to determine those without explicitly storing the list of load balancers in the datapath record. Like this a Load_Balancer record update will not trigger a Datapath_Binding update and in turn it won't cause all logical flows corresponding to the datapath to be updated.