The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 2069623 - [OVN SCALE][ovn-northd] Unnecessary SB record no-op changes added to SB transaction.
Summary: [OVN SCALE][ovn-northd] Unnecessary SB record no-op changes added to SB trans...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: FDP 22.B
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Dumitru Ceara
QA Contact: ying xu
URL:
Whiteboard:
Depends On:
Blocks: 2073473
TreeView+ depends on / blocked
 
Reported: 2022-03-29 09:37 UTC by Dumitru Ceara
Modified: 2022-06-30 18:00 UTC (History)
5 users (show)

Fixed In Version: ovn22.03-22.03.0-24.el8fdp ovn22.03-22.03.0-24.el9fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2073473 (view as bug list)
Environment:
Last Closed: 2022-06-30 18:00:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Scale test NB database. (1.74 MB, application/gzip)
2022-03-29 09:37 UTC, Dumitru Ceara
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1860 0 None None None 2022-03-29 12:51:38 UTC
Red Hat Product Errata RHBA-2022:5447 0 None None None 2022-06-30 18:00:33 UTC

Description Dumitru Ceara 2022-03-29 09:37:38 UTC
Created attachment 1868934 [details]
Scale test NB database.

OVS commit 1cc618c32524 ("ovsdb-idl: Fix atomicity of writes that
don't change a column's value.") [0] explains why writes that don't
change a column's value cannot be optimized out early if the column is
read/write.

In northd, most tables have change tracking enabled, making all their
columns read/write.  That means that a write to a column (even if it
doesn't change the value) will add the row to the current transaction.
Validation is eventually performed before sending the transaction to the
server but if there are lots of such records this becomes costly.

Profiling what happens in northd when running with a NB database taken
from an ovn-k8s-like scale test (16K load balancers applied to 120
logical switches and routers) we notice that ovn-northd was always
writing to the SB.Load_Balancer columns even if nothing changed.

This doesn't translate into any transaction being sent to the SB server
but does, however, increase the amount of work the IDL needs to
perform.

On a test setup, we see that with the attached database, the overhead
is ~6 seconds (out of a total of ~13 seconds ovn-northd spends in a
poll loop).

[0] https://github.com/openvswitch/ovs/commit/1cc618c32524

Comment 1 Dumitru Ceara 2022-03-29 09:41:53 UTC
v2 posted for review: https://patchwork.ozlabs.org/project/ovn/list/?series=292465&state=*

Comment 2 Dumitru Ceara 2022-04-01 11:26:13 UTC
Posted patch to add support for change tracking of write-only columns in the IDL instead:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=293071&state=*

This is a more generic solution.

Comment 3 Dumitru Ceara 2022-04-04 13:57:24 UTC
Moving back to ASSIGNED until the OVS patch is accepted and a patch to bump the OVS submodule in OVN is posted.

Comment 4 Surya Seetharaman 2022-04-11 15:38:47 UTC
We are testing https://patchwork.ozlabs.org/project/ovn/patch/20220401101640.1139426-1-xsimonar@redhat.com/ which delays setting port-up:true in case SBDB is readonly.

Quoting Dumitru:
It's just to avoid some unnecessary full recomputes locally. Right, if ovn-controller wakes up and processes some changes incrementally, if the result of that processing means it has to write something in the SB but the previous transaction it had sent to the SB is still in progress then we abort the incremental processing and trigger a full recompute (of everything) after the in-flight transaction is completed.


We did four runs:

1st run:  8.26p99/4.44avg
2nd run:  7.5 P99 4.12avg
3rd run with logging: 7.77 P99 / 4.22 avg
4th run with logging: 8.16 P99 / 4.39 avg

$ grep ovn-installed ./ovnkube-node-*/ovn-controller/ovn-controller/logs/* | grep -oE 'delayed by [0-9]+ msec' | sort -h -k 3 | tail -50
delayed by 80 msec
delayed by 80 msec
delayed by 82 msec
delayed by 85 msec
delayed by 85 msec
delayed by 85 msec
delayed by 87 msec
delayed by 87 msec
delayed by 87 msec
delayed by 88 msec
delayed by 88 msec
delayed by 89 msec
delayed by 91 msec
delayed by 94 msec
delayed by 95 msec
delayed by 96 msec
delayed by 97 msec
delayed by 99 msec
delayed by 102 msec
delayed by 108 msec
delayed by 114 msec
delayed by 118 msec
delayed by 119 msec
delayed by 121 msec
delayed by 123 msec
delayed by 123 msec
delayed by 132 msec
delayed by 136 msec
delayed by 139 msec
delayed by 139 msec
delayed by 147 msec
delayed by 147 msec
delayed by 147 msec
delayed by 147 msec
delayed by 149 msec
delayed by 151 msec
delayed by 156 msec
delayed by 159 msec
delayed by 168 msec
delayed by 170 msec
delayed by 170 msec
delayed by 170 msec
delayed by 171 msec
delayed by 176 msec
delayed by 176 msec
delayed by 183 msec
delayed by 208 msec
delayed by 208 msec
delayed by 235 msec
delayed by 525 msec

Xavier/Dumitru is investigating the delay on their side. Probably might not get in before FF, but will get in before CF.

Comment 5 Dumitru Ceara 2022-04-11 16:06:45 UTC
(In reply to Surya Seetharaman from comment #4)
> We are testing
> https://patchwork.ozlabs.org/project/ovn/patch/20220401101640.1139426-1-
> xsimonar/ which delays setting port-up:true in case SBDB is
> readonly.
> 

I think there might be some confusion.  Xavier's patch is independent of
the fix for this BZ.  We were just scale testing it in a build which
included the fix for bug 2069623 too.

The proper fix for bug 2069623 is under review upstream and has two
parts:

1. changing the IDL to support change-tracking of write-only columns:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=293071&state=*
2. changing ovn-northd to use the new IDL mode.

> Quoting Dumitru:
> It's just to avoid some unnecessary full recomputes locally. Right, if
> ovn-controller wakes up and processes some changes incrementally, if the
> result of that processing means it has to write something in the SB but the
> previous transaction it had sent to the SB is still in progress then we
> abort the incremental processing and trigger a full recompute (of
> everything) after the in-flight transaction is completed.
> 

[...]

> 
> Xavier/Dumitru is investigating the delay on their side. Probably might not
> get in before FF, but will get in before CF.

For bug 2069623 we don't need Xavier's change.  We just need the two
parts mentioned above to happen before OCP code freeze.

Comment 6 Surya Seetharaman 2022-04-11 16:28:41 UTC
(In reply to Dumitru Ceara from comment #5)
> (In reply to Surya Seetharaman from comment #4)
> > We are testing
> > https://patchwork.ozlabs.org/project/ovn/patch/20220401101640.1139426-1-
> > xsimonar/ which delays setting port-up:true in case SBDB is
> > readonly.
> > 
> 
> I think there might be some confusion.  Xavier's patch is independent of
> the fix for this BZ.  We were just scale testing it in a build which
> included the fix for bug 2069623 too.
> 
> The proper fix for bug 2069623 is under review upstream and has two
> parts:
> 
> 1. changing the IDL to support change-tracking of write-only columns:
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=293071&state=*
> 2. changing ovn-northd to use the new IDL mode.
> 
> > Quoting Dumitru:
> > It's just to avoid some unnecessary full recomputes locally. Right, if
> > ovn-controller wakes up and processes some changes incrementally, if the
> > result of that processing means it has to write something in the SB but the
> > previous transaction it had sent to the SB is still in progress then we
> > abort the incremental processing and trigger a full recompute (of
> > everything) after the in-flight transaction is completed.
> > 
> 
> [...]
> 
> > 
> > Xavier/Dumitru is investigating the delay on their side. Probably might not
> > get in before FF, but will get in before CF.
> 
> For bug 2069623 we don't need Xavier's change.  We just need the two
> parts mentioned above to happen before OCP code freeze.

Thanks Dumitru, I was the confused one, clearly I have too many things clouding my judgement and mixing up bz-es/prs

Comment 7 Dumitru Ceara 2022-04-15 14:24:59 UTC
v2 of the IDL patch:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=295369&state=*

Comment 8 Dumitru Ceara 2022-04-20 19:33:12 UTC
v3 of the IDL patch:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=296026&state=*

Comment 9 Dumitru Ceara 2022-04-26 10:41:19 UTC
v4 of the IDL patch:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=296892&state=*

Comment 10 Dumitru Ceara 2022-04-28 22:57:19 UTC
IDL patch was accepted, posted OVN patch to use the new IDL mode:
https://patchwork.ozlabs.org/project/ovn/list/?series=297426&state=*

Comment 12 Dumitru Ceara 2022-05-11 11:49:40 UTC
The patch to bump the OVS submodule was accepted upstream:
https://github.com/ovn-org/ovn/commit/f93206ce40ae38f26ffaf69c5d79a63abfbd5e3c

And backported downstream to:
ovn22.03-22.03.0-24.el8fdp
ovn22.03-22.03.0-24.el9fdp

Comment 15 ying xu 2022-06-13 03:10:35 UTC
reproduced on version:
# rpm -qa|grep ovn
ovn22.03-central-22.03.0-22.el8fdp.x86_64
ovn22.03-host-22.03.0-22.el8fdp.x86_64
ovn22.03-22.03.0-22.el8fdp.x86_64

	systemctl start openvswitch                          
	systemctl start ovn-northd                                                                            
	ovn-nbctl set-connection ptcp:6641                                                                    
	ovn-sbctl set-connection ptcp:6642
	ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=127.0.0.1
	systemctl restart ovn-controller
	ovs-vsctl set open . external_ids:ovn-monitor-all=true

	 
	ovn-nbctl ls-add ls -- lsp-add ls lsp
	sleep 2    
	
	ovn-appctl -t ovn-northd vlog/disable-rate-limit
	ovn-appctl -t ovn-northd vlog/set jsonrpc:dbg
	sleep 5
	> /var/log/ovn/ovn-northd.log

	ovn-nbctl --wait=sb sync

	sleep 5
	cat /var/log/ovn/ovn-northd.log |grep OVN_Southbound


cat /var/log/ovn/ovn-northd.log |grep OVN_Southbound'
2022-06-13T03:04:55.054Z|00015|jsonrpc|DBG|unix:/run/ovn/ovnsb_db.sock: send request, method="transact", params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","2a210b2d-6365-4a29-92b4-d31f070dbca3"]]],"row":{"ports":["uuid","68807ccd-7c12-4b9e-bdb9-941b40b58d1c"]},"op":"update","table":"Multicast_Group"},{"where":[["_uuid","==",["uuid","9d3f9379-6535-4aaf-addf-ea214cd08904"]]],"row":{"query_max_resp":1,"query_interval":150,"datapath":["uuid","c5330e17-6e9f-49e1-927a-d3e8cd9270e9"],"table_size":2048,"querier":true,"idle_timeout":300,"enabled":false},"op":"update","table":"IP_Multicast"},{"where":[["_uuid","==",["uuid","68807ccd-7c12-4b9e-bdb9-941b40b58d1c"]]],"row":{"parent_port":["set",[]],"mac":["set",[]],"external_ids":["map",[]],"options":["map",[]],"nat_addresses":["set",[]],"type":"","datapath":["uuid","c5330e17-6e9f-49e1-927a-d3e8cd9270e9"],"tag":["set",[]]},"op":"update","table":"Port_Binding"},{"where":[["_uuid","==",["uuid","814e67e7-2104-4cf8-ae34-fef88f361d54"]]],"row":{"ports":["uuid","68807ccd-7c12-4b9e-bdb9-941b40b58d1c"]},"op":"update","table":"Multicast_Group"},{"where":[["_uuid","==",["uuid","b9be81e3-a57e-4ee9-a2a6-a141a407ae5a"]]],"row":{"addresses":"42:ff:bf:e2:aa:ec"},"op":"update","table":"Address_Set"},{"where":[["_uuid","==",["uuid","d318b1df-dc64-4839-b311-71b3229a1425"]]],"row":{"nb_cfg":1,"options":["map",[["mac_prefix","5a:0b:d4"],["max_tunid","16711680"],["northd_internal_version","22.03.1-20.21.0-58.3"],["svc_monitor_mac","42:ff:bf:e2:aa:ec"]]]},"op":"update","table":"SB_Global"},{"where":[["_uuid","==",["uuid","c5330e17-6e9f-49e1-927a-d3e8cd9270e9"]]],"row":{"external_ids":["map",[["logical-switch","79ed4e8e-0836-46bb-b37b-0a968fb7c5cd"],["name","ls"]]]},"op":"update","table":"Datapath_Binding"},{"comment":"ovn-northd","op":"comment"},{"lock":"ovn_northd","op":"assert"}], id=19



verified on version:

ovn22.03-central-22.03.0-52.el8fdp.x86_64
ovn22.03-host-22.03.0-52.el8fdp.x86_64
ovn22.03-22.03.0-52.el8fdp.x86_64

cat /var/log/ovn/ovn-northd.log |grep OVN_Southbound'
2022-06-13T02:27:46.536Z|00015|jsonrpc|DBG|unix:/run/ovn/ovnsb_db.sock: send request, method="transact", params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","96cdaf1c-9453-446a-aca7-587e0e758bae"]]],"row":{"nb_cfg":1},"op":"update","table":"SB_Global"},{"comment":"ovn-northd","op":"comment"},{"lock":"ovn_northd","op":"assert"}], id=18

Comment 17 errata-xmlrpc 2022-06-30 18:00:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn22.03 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5447


Note You need to log in before you can comment on or make changes to this bug.