The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1978605 - [OVN SCALE][ovn-controller] Load_Balancer changes trigger unnecessary additional flow computations.
Summary: [OVN SCALE][ovn-controller] Load_Balancer changes trigger unnecessary additio...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: FDP 20.H
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: FDP 21.I
Assignee: Dumitru Ceara
QA Contact: ying xu
URL:
Whiteboard: perfscale-ovn
Depends On:
Blocks: 1996886 2011385
TreeView+ depends on / blocked
 
Reported: 2021-07-02 09:28 UTC by Dumitru Ceara
Modified: 2021-12-09 15:37 UTC (History)
7 users (show)

Fixed In Version: ovn21.09-21.09.0-10.el8fdp ovn-2021-21.09.0-5.el8fdp ovn2.13-20.12.0-183.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-09 15:37:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1408 0 None None None 2021-10-05 14:26:53 UTC
Red Hat Product Errata RHBA-2021:5059 0 None None None 2021-12-09 15:37:56 UTC

Internal Links: 2011391 2011402

Description Dumitru Ceara 2021-07-02 09:28:03 UTC
Description of problem:
Whenever a Load_Balancer is updated, e.g., a VIP is added, the following
sequence of events happens:

1. The Southbound Load_Balancer record is updated.
2. The Southbound Datapath_Binding records on which the Load_Balancer is
   applied are updated.
3. Southbound ovsdb-server sends updates about the Load_Balancer and
   Datapath_Binding records to ovn-controller.
4. The IDL layer in ovn-controller processes the updates at #3, but
   because of the SB schema references between tables [0] all logical
   flows referencing the updated Datapath_Binding are marked as
   "updated".  The same is true for Logical_DP_Group records
   referencing the Datapath_Binding, and also for all logical flows
   pointing to the new "updated" datapath groups.
5. ovn-controller ends up recomputing (removing/readding) all flows for
   all these tracked updates.

[0] From the SB Schema:
        "Datapath_Binding": {
            "columns": {
                [...]
                "load_balancers": {"type": {"key": {"type": "uuid",
                                                   "refTable": "Load_Balancer",
                                                   "refType": "weak"},
                                            "min": 0,
                                            "max": "unlimited"}},
        [...]
        "Load_Balancer": {
            "columns": {
                "datapaths": {
                [...]
                    "type": {"key": {"type": "uuid",
                                     "refTable": "Datapath_Binding"},
                             "min": 0, "max": "unlimited"}},
        [...]
        "Logical_DP_Group": {
            "columns": {
                "datapaths":
                    {"type": {"key": {"type": "uuid",
                                      "refTable": "Datapath_Binding",
                                      "refType": "weak"},
                              "min": 0, "max": "unlimited"}}},
        [...]
        "Logical_Flow": {
            "columns": {
                "logical_datapath":
                    {"type": {"key": {"type": "uuid",
                                      "refTable": "Datapath_Binding"},
                              "min": 0, "max": 1}},
                "logical_dp_group":
                    {"type": {"key": {"type": "uuid",
                                      "refTable": "Logical_DP_Group"},



Version-Release number of selected component (if applicable):
Upstream OVN v21.06.0.

Potential solution:

Stop populating the SB.Datapath_Binding.load_balancer column.  This would
break the "update notification chain" when a load balancer is udpated in
the southbound.

This is used only when a new Datapath_Binding is added to determine
which load balancer flows have to be installed for this new datapath.

However, it's quite easy to determine those without explicitly storing
the list of load balancers in the datapath record.

Like this a Load_Balancer record update will not trigger a
Datapath_Binding update and in turn it won't cause all logical flows
corresponding to the datapath to be updated.

Comment 1 Dumitru Ceara 2021-07-07 14:59:38 UTC
Fix sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=252352&state=*

Comment 2 Dumitru Ceara 2021-07-12 08:10:03 UTC
v2 sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=253029&state=*

Comment 3 Dumitru Ceara 2021-07-12 14:18:51 UTC
v3 sent for review: http://patchwork.ozlabs.org/project/ovn/list/?series=253094&state=*

Comment 4 Dumitru Ceara 2021-07-26 14:04:05 UTC
Follow up patch to fix load_balancer add/delete update: http://patchwork.ozlabs.org/project/ovn/list/?series=255245&state=*

Comment 8 ying xu 2021-10-26 10:54:55 UTC
reproduced on version:
 rpm -qa|grep ovn
ovn-2021-host-21.06.0-29.el8fdp.x86_64
ovn-2021-central-21.06.0-29.el8fdp.x86_64
ovn-2021-21.06.0-29.el8fdp.x86_64

use the script below:
topo like this:

vm0---------s0-----------r1-------s1----------vm1
             |
             rn
             |
             sn
             |
             vmn    (n>=400)

                #sw public

                ovn-nbctl set NB_GLOBAL . options:northd_probe_interval=180000
                ovn-nbctl set connection . inactivity_probe=180000
                ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=180
                ovs-vsctl set open . external_ids:ovn-remote-probe-interval=180000
                ovn-sbctl set connection . inactivity_probe=180000

                ovn-nbctl ls-add public
                ovn-nbctl lb-add lb0 100.10.1.2:880 172.16.1.2:9000

                # r1
                i=1
        for m in `seq 0 9`;do
                for n in `seq 1 99`;do
                ovn-nbctl lr-add r${i}
                ovn-nbctl lrp-add r${i} r${i}_public 00:de:ad:ff:$m:$n 172.16.$m.$n/16
                ovn-nbctl lrp-add r${i} r${i}_s${i} 00:de:ad:fe:$m:$n 173.$m.$n.1/24
                ovn-nbctl lr-nat-add r${i} dnat_and_snat 172.16.${m}.$((n+100)) 173.$m.$n.2
                ovn-nbctl lrp-set-gateway-chassis r${i}_public hv1

                # s1
                ovn-nbctl ls-add s${i}
                ovn-nbctl ls-lb-add s${i} lb0

                # s1 - r1
                ovn-nbctl lsp-add s${i} s${i}_r${i}
                ovn-nbctl lsp-set-type s${i}_r${i} router
                ovn-nbctl lsp-set-addresses s${i}_r${i} "00:de:ad:fe:$m:$n 173.$m.$n.1"
                ovn-nbctl lsp-set-options s${i}_r${i} router-port=r${i}_s${i}
                # s1 - vm1
                ovn-nbctl lsp-add s$i vm$i
                ovn-nbctl lsp-set-addresses vm$i "00:de:ad:01:$m:$n 173.$m.$n.2"
                ovn-nbctl lrp-add r$i r${i}_public 40:44:00:00:$m:$n 172.16.$m.$n/16

                ovn-nbctl lsp-add public public_r${i}
                ovn-nbctl lsp-set-type public_r${i} router
                ovn-nbctl lsp-set-addresses public_r${i} router
                ovn-nbctl lsp-set-options public_r${i} router-port=r${i}_public nat-addresses=router
                let i++
                if [ $i -gt 300 ];then
                        break;
                fi
                done
                if [ $i -gt 300 ];then
                        break;
                fi
        done
                ovn-nbctl lsp-add public ln_p1
                ovn-nbctl lsp-set-addresses ln_p1 unknown
                ovn-nbctl lsp-set-type ln_p1 localnet
                ovn-nbctl lsp-set-options ln_p1 network_name=nattest

after ovn install all flows,check the ovn-controller log;
cat /var/log/ovn/ovn-controller.log |tail -n 1
2021-10-26T09:47:42.418Z|00958|timeval|WARN|context switches: 0 voluntary, 2270 involuntary

then add a new vip to the LB,
ovn-nbctl lb-add lb0 100.10.2.3:80 172.16.1.2:8000

wait some time,check the log again:
cat /var/log/ovn/ovn-controller.log |tail -n 5
2021-10-26T09:47:42.418Z|00958|timeval|WARN|context switches: 0 voluntary, 2270 involuntary
2021-10-26T09:57:14.533Z|00959|timeval|WARN|Unreasonably long 1901ms poll interval (970ms user, 8ms system)  -----------------------------here
2021-10-26T09:57:14.533Z|00960|timeval|WARN|faults: 5907 minor, 0 major
2021-10-26T09:57:14.534Z|00961|timeval|WARN|context switches: 0 voluntary, 653 involuntary
2021-10-26T09:57:14.534Z|00962|coverage|INFO|Dropped 6 log messages in last 619 seconds (most recently, 572 seconds ago) due to excessive rate


with the fix,before add a new vip to the LB,
check the log
# cat /var/log/ovn/ovn-controller.log |tail -n 1
2021-10-26T10:13:59.595Z|00945|coverage|INFO|102 events never hit

after,
# cat /var/log/ovn/ovn-controller.log |tail -n 10
2021-10-26T10:13:59.595Z|00936|coverage|INFO|stream_open                0.0/sec     0.000/sec        0.0019/sec   total: 7
2021-10-26T10:13:59.595Z|00937|coverage|INFO|util_xalloc              336.6/sec 730081.517/sec    48148.7742/sec   total: 186319518
2021-10-26T10:13:59.595Z|00938|coverage|INFO|vconn_open                 0.0/sec     0.000/sec        0.0014/sec   total: 5
2021-10-26T10:13:59.595Z|00939|coverage|INFO|vconn_received             0.0/sec     0.700/sec        1.2583/sec   total: 4530
2021-10-26T10:13:59.595Z|00940|coverage|INFO|vconn_sent               120.4/sec   468.467/sec      137.4411/sec   total: 497031
2021-10-26T10:13:59.595Z|00941|coverage|INFO|netlink_received           0.0/sec     1.133/sec        1.9711/sec   total: 7100
2021-10-26T10:13:59.595Z|00942|coverage|INFO|netlink_recv_jumbo         0.0/sec     0.283/sec        0.4925/sec   total: 1774
2021-10-26T10:13:59.595Z|00943|coverage|INFO|netlink_sent               0.0/sec     1.133/sec        1.9711/sec   total: 7100
2021-10-26T10:13:59.595Z|00944|coverage|INFO|cmap_expand                0.0/sec     0.000/sec        0.0008/sec   total: 3              --------------no new logs
2021-10-26T10:13:59.595Z|00945|coverage|INFO|102 events never hit

delete the LB,no "Unreasonably long 1901ms poll interval" ,too

version:
# rpm -qa|grep ovn
ovn-2021-host-21.09.0-12.el8fdp.x86_64
ovn-2021-central-21.09.0-12.el8fdp.x86_64
ovn-2021-21.09.0-12.el8fdp.x86_64

set verified.

Comment 9 ying xu 2021-10-26 13:05:05 UTC
also verified on version:
# rpm -qa|grep ovn
ovn2.13-host-20.12.0-185.el8fdp.x86_64h
ovn2.13-central-20.12.0-185.el8fdp.x86_64
ovn2.13-20.12.0-185.el8fdp.x86_64

before add new vip:
# cat /var/log/ovn/ovn-controller.log |tail -n 10
2021-10-26T11:15:10.079Z|00820|coverage|INFO|vconn_received             0.0/sec     0.517/sec        0.3950/sec   total: 1422
2021-10-26T11:15:10.079Z|00821|coverage|INFO|vconn_sent                 0.0/sec   522.767/sec      125.8050/sec   total: 452898
2021-10-26T11:15:10.079Z|00822|coverage|INFO|netlink_received           0.0/sec     2.000/sec        1.9967/sec   total: 7188
2021-10-26T11:15:10.079Z|00823|coverage|INFO|netlink_recv_jumbo         0.0/sec     0.500/sec        0.4989/sec   total: 1796
2021-10-26T11:15:10.079Z|00824|coverage|INFO|netlink_sent               0.0/sec     2.000/sec        1.9967/sec   total: 7188
2021-10-26T11:15:10.079Z|00825|coverage|INFO|cmap_expand                0.0/sec     0.000/sec        0.0008/sec   total: 3
2021-10-26T11:15:10.079Z|00826|coverage|INFO|97 events never hit
2021-10-26T11:15:18.361Z|00827|timeval|WARN|Unreasonably long 7779ms poll interval (3728ms user, 72ms system)
2021-10-26T11:15:18.361Z|00828|timeval|WARN|faults: 80618 minor, 0 major
2021-10-26T11:15:18.361Z|00829|timeval|WARN|context switches: 0 voluntary, 807 involuntary




after,no new logs shown
[root@dell-per730-19 bz1776712_broadcast_limit]# ovn-nbctl lb-add lb0 100.10.2.3:80 172.16.1.2:8000
[root@dell-per730-19 bz1776712_broadcast_limit]# 
[root@dell-per730-19 bz1776712_broadcast_limit]# 
[root@dell-per730-19 bz1776712_broadcast_limit]# 
[root@dell-per730-19 bz1776712_broadcast_limit]# 
[root@dell-per730-19 bz1776712_broadcast_limit]# cat /var/log/ovn/ovn-controller.log |tail -n 10
2021-10-26T11:15:10.079Z|00820|coverage|INFO|vconn_received             0.0/sec     0.517/sec        0.3950/sec   total: 1422
2021-10-26T11:15:10.079Z|00821|coverage|INFO|vconn_sent                 0.0/sec   522.767/sec      125.8050/sec   total: 452898
2021-10-26T11:15:10.079Z|00822|coverage|INFO|netlink_received           0.0/sec     2.000/sec        1.9967/sec   total: 7188
2021-10-26T11:15:10.079Z|00823|coverage|INFO|netlink_recv_jumbo         0.0/sec     0.500/sec        0.4989/sec   total: 1796
2021-10-26T11:15:10.079Z|00824|coverage|INFO|netlink_sent               0.0/sec     2.000/sec        1.9967/sec   total: 7188
2021-10-26T11:15:10.079Z|00825|coverage|INFO|cmap_expand                0.0/sec     0.000/sec        0.0008/sec   total: 3
2021-10-26T11:15:10.079Z|00826|coverage|INFO|97 events never hit
2021-10-26T11:15:18.361Z|00827|timeval|WARN|Unreasonably long 7779ms poll interval (3728ms user, 72ms system)
2021-10-26T11:15:18.361Z|00828|timeval|WARN|faults: 80618 minor, 0 major
2021-10-26T11:15:18.361Z|00829|timeval|WARN|context switches: 0 voluntary, 807 involuntary

Comment 11 errata-xmlrpc 2021-12-09 15:37:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5059


Note You need to log in before you can comment on or make changes to this bug.