Description of problem: Whenever a Load_Balancer is updated, e.g., a VIP is added, the following sequence of events happens: 1. The Southbound Load_Balancer record is updated. 2. The Southbound Datapath_Binding records on which the Load_Balancer is applied are updated. 3. Southbound ovsdb-server sends updates about the Load_Balancer and Datapath_Binding records to ovn-controller. 4. The IDL layer in ovn-controller processes the updates at #3, but because of the SB schema references between tables [0] all logical flows referencing the updated Datapath_Binding are marked as "updated". The same is true for Logical_DP_Group records referencing the Datapath_Binding, and also for all logical flows pointing to the new "updated" datapath groups. 5. ovn-controller ends up recomputing (removing/readding) all flows for all these tracked updates. [0] From the SB Schema: "Datapath_Binding": { "columns": { [...] "load_balancers": {"type": {"key": {"type": "uuid", "refTable": "Load_Balancer", "refType": "weak"}, "min": 0, "max": "unlimited"}}, [...] "Load_Balancer": { "columns": { "datapaths": { [...] "type": {"key": {"type": "uuid", "refTable": "Datapath_Binding"}, "min": 0, "max": "unlimited"}}, [...] "Logical_DP_Group": { "columns": { "datapaths": {"type": {"key": {"type": "uuid", "refTable": "Datapath_Binding", "refType": "weak"}, "min": 0, "max": "unlimited"}}}, [...] "Logical_Flow": { "columns": { "logical_datapath": {"type": {"key": {"type": "uuid", "refTable": "Datapath_Binding"}, "min": 0, "max": 1}}, "logical_dp_group": {"type": {"key": {"type": "uuid", "refTable": "Logical_DP_Group"}, Version-Release number of selected component (if applicable): Upstream OVN v21.06.0. Potential solution: Stop populating the SB.Datapath_Binding.load_balancer column. This would break the "update notification chain" when a load balancer is udpated in the southbound. This is used only when a new Datapath_Binding is added to determine which load balancer flows have to be installed for this new datapath. However, it's quite easy to determine those without explicitly storing the list of load balancers in the datapath record. Like this a Load_Balancer record update will not trigger a Datapath_Binding update and in turn it won't cause all logical flows corresponding to the datapath to be updated.
Fix sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=252352&state=*
v2 sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=253029&state=*
v3 sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=253094&state=*
Follow up patch to fix load_balancer add/delete update: http://patchwork.ozlabs.org/project/ovn/list/?series=255245&state=*
reproduced on version: rpm -qa|grep ovn ovn-2021-host-21.06.0-29.el8fdp.x86_64 ovn-2021-central-21.06.0-29.el8fdp.x86_64 ovn-2021-21.06.0-29.el8fdp.x86_64 use the script below: topo like this: vm0---------s0-----------r1-------s1----------vm1 | rn | sn | vmn (n>=400) #sw public ovn-nbctl set NB_GLOBAL . options:northd_probe_interval=180000 ovn-nbctl set connection . inactivity_probe=180000 ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=180 ovs-vsctl set open . external_ids:ovn-remote-probe-interval=180000 ovn-sbctl set connection . inactivity_probe=180000 ovn-nbctl ls-add public ovn-nbctl lb-add lb0 100.10.1.2:880 172.16.1.2:9000 # r1 i=1 for m in `seq 0 9`;do for n in `seq 1 99`;do ovn-nbctl lr-add r${i} ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16 ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24 ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2 ovn-nbctl lrp-set-gateway-chassis r${i}_public hv1 # s1 ovn-nbctl ls-add s${i} ovn-nbctl ls-lb-add s${i} lb0 # s1 - r1 ovn-nbctl lsp-add s${i} s${i}_r${i} ovn-nbctl lsp-set-type s${i}_r${i} router ovn-nbctl lsp-set-addresses s${i}_r${i} "00:de:ad:fe:$m:$n 173.$m.$n.1" ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i} # s1 - vm1 ovn-nbctl lsp-add s$i vm$i ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2" ovn-nbctl lrp-add r$i r${i}_public 40:44:00:00:$m:$n 172.16.$m.$n/16 ovn-nbctl lsp-add public public_r${i} ovn-nbctl lsp-set-type public_r${i} router ovn-nbctl lsp-set-addresses public_r${i} router ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public nat-addresses=router let i++ if [ $i -gt 300 ];then break; fi done if [ $i -gt 300 ];then break; fi done ovn-nbctl lsp-add public ln_p1 ovn-nbctl lsp-set-addresses ln_p1 unknown ovn-nbctl lsp-set-type ln_p1 localnet ovn-nbctl lsp-set-options ln_p1 network_name=nattest after ovn install all flows,check the ovn-controller log; cat /var/log/ovn/ovn-controller.log |tail -n 1 2021-10-26T09:47:42.418Z|00958|timeval|WARN|context switches: 0 voluntary, 2270 involuntary then add a new vip to the LB, ovn-nbctl lb-add lb0 100.10.2.3:80 172.16.1.2:8000 wait some time,check the log again: cat /var/log/ovn/ovn-controller.log |tail -n 5 2021-10-26T09:47:42.418Z|00958|timeval|WARN|context switches: 0 voluntary, 2270 involuntary 2021-10-26T09:57:14.533Z|00959|timeval|WARN|Unreasonably long 1901ms poll interval (970ms user, 8ms system) -----------------------------here 2021-10-26T09:57:14.533Z|00960|timeval|WARN|faults: 5907 minor, 0 major 2021-10-26T09:57:14.534Z|00961|timeval|WARN|context switches: 0 voluntary, 653 involuntary 2021-10-26T09:57:14.534Z|00962|coverage|INFO|Dropped 6 log messages in last 619 seconds (most recently, 572 seconds ago) due to excessive rate with the fix,before add a new vip to the LB, check the log # cat /var/log/ovn/ovn-controller.log |tail -n 1 2021-10-26T10:13:59.595Z|00945|coverage|INFO|102 events never hit after, # cat /var/log/ovn/ovn-controller.log |tail -n 10 2021-10-26T10:13:59.595Z|00936|coverage|INFO|stream_open 0.0/sec 0.000/sec 0.0019/sec total: 7 2021-10-26T10:13:59.595Z|00937|coverage|INFO|util_xalloc 336.6/sec 730081.517/sec 48148.7742/sec total: 186319518 2021-10-26T10:13:59.595Z|00938|coverage|INFO|vconn_open 0.0/sec 0.000/sec 0.0014/sec total: 5 2021-10-26T10:13:59.595Z|00939|coverage|INFO|vconn_received 0.0/sec 0.700/sec 1.2583/sec total: 4530 2021-10-26T10:13:59.595Z|00940|coverage|INFO|vconn_sent 120.4/sec 468.467/sec 137.4411/sec total: 497031 2021-10-26T10:13:59.595Z|00941|coverage|INFO|netlink_received 0.0/sec 1.133/sec 1.9711/sec total: 7100 2021-10-26T10:13:59.595Z|00942|coverage|INFO|netlink_recv_jumbo 0.0/sec 0.283/sec 0.4925/sec total: 1774 2021-10-26T10:13:59.595Z|00943|coverage|INFO|netlink_sent 0.0/sec 1.133/sec 1.9711/sec total: 7100 2021-10-26T10:13:59.595Z|00944|coverage|INFO|cmap_expand 0.0/sec 0.000/sec 0.0008/sec total: 3 --------------no new logs 2021-10-26T10:13:59.595Z|00945|coverage|INFO|102 events never hit delete the LB,no "Unreasonably long 1901ms poll interval" ,too version: # rpm -qa|grep ovn ovn-2021-host-21.09.0-12.el8fdp.x86_64 ovn-2021-central-21.09.0-12.el8fdp.x86_64 ovn-2021-21.09.0-12.el8fdp.x86_64 set verified.
also verified on version: # rpm -qa|grep ovn ovn2.13-host-20.12.0-185.el8fdp.x86_64h ovn2.13-central-20.12.0-185.el8fdp.x86_64 ovn2.13-20.12.0-185.el8fdp.x86_64 before add new vip: # cat /var/log/ovn/ovn-controller.log |tail -n 10 2021-10-26T11:15:10.079Z|00820|coverage|INFO|vconn_received 0.0/sec 0.517/sec 0.3950/sec total: 1422 2021-10-26T11:15:10.079Z|00821|coverage|INFO|vconn_sent 0.0/sec 522.767/sec 125.8050/sec total: 452898 2021-10-26T11:15:10.079Z|00822|coverage|INFO|netlink_received 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00823|coverage|INFO|netlink_recv_jumbo 0.0/sec 0.500/sec 0.4989/sec total: 1796 2021-10-26T11:15:10.079Z|00824|coverage|INFO|netlink_sent 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00825|coverage|INFO|cmap_expand 0.0/sec 0.000/sec 0.0008/sec total: 3 2021-10-26T11:15:10.079Z|00826|coverage|INFO|97 events never hit 2021-10-26T11:15:18.361Z|00827|timeval|WARN|Unreasonably long 7779ms poll interval (3728ms user, 72ms system) 2021-10-26T11:15:18.361Z|00828|timeval|WARN|faults: 80618 minor, 0 major 2021-10-26T11:15:18.361Z|00829|timeval|WARN|context switches: 0 voluntary, 807 involuntary after,no new logs shown [root@dell-per730-19 bz1776712_broadcast_limit]# ovn-nbctl lb-add lb0 100.10.2.3:80 172.16.1.2:8000 [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# [root@dell-per730-19 bz1776712_broadcast_limit]# cat /var/log/ovn/ovn-controller.log |tail -n 10 2021-10-26T11:15:10.079Z|00820|coverage|INFO|vconn_received 0.0/sec 0.517/sec 0.3950/sec total: 1422 2021-10-26T11:15:10.079Z|00821|coverage|INFO|vconn_sent 0.0/sec 522.767/sec 125.8050/sec total: 452898 2021-10-26T11:15:10.079Z|00822|coverage|INFO|netlink_received 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00823|coverage|INFO|netlink_recv_jumbo 0.0/sec 0.500/sec 0.4989/sec total: 1796 2021-10-26T11:15:10.079Z|00824|coverage|INFO|netlink_sent 0.0/sec 2.000/sec 1.9967/sec total: 7188 2021-10-26T11:15:10.079Z|00825|coverage|INFO|cmap_expand 0.0/sec 0.000/sec 0.0008/sec total: 3 2021-10-26T11:15:10.079Z|00826|coverage|INFO|97 events never hit 2021-10-26T11:15:18.361Z|00827|timeval|WARN|Unreasonably long 7779ms poll interval (3728ms user, 72ms system) 2021-10-26T11:15:18.361Z|00828|timeval|WARN|faults: 80618 minor, 0 major 2021-10-26T11:15:18.361Z|00829|timeval|WARN|context switches: 0 voluntary, 807 involuntary
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5059