Bug 1847125

Summary: [RFE] Improve 20% performance on creating 1000 bridge over 1000 VLANs
Product: Red Hat Enterprise Linux 8 Reporter: Thomas Haller <thaller>
Component: NetworkManagerAssignee: Thomas Haller <thaller>
Status: CLOSED ERRATA QA Contact: Filip Pokryvka <fpokryvk>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0CC: acardace, aloughla, atragler, bgalvani, desktop-qa-list, fge, fgiudici, jmaxwell, lrintel, pasik, rkhan, sukulkar, thaller, till, vbenes
Target Milestone: rcKeywords: FutureFeature, Triaged
Target Release: 8.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.32.2-1.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1711215 Environment:
Last Closed: 2021-11-09 19:28:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1711215    
Bug Blocks: 1935910    

Comment 3 Gris Ge 2021-03-05 04:48:30 UTC
Hi Thomas,

Please provide a way to mature this performance improvement.

Thank you!

Comment 4 Till Maas 2021-03-05 08:28:12 UTC
Gris, please update the summary the exact scenario that is going to be improved and also in a comment since it has now devel ack. The original proposal was DHCPv4 with 3000 devices but AFAIU, it is now something about 1000 deviced and no DHCP. Thank you.

Comment 5 Vladimir Benes 2021-03-08 12:24:42 UTC
We have a test for 500 vlans with DHCPv4 and it takes some time to set it all up 
https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/726

We can easily move it to 1000+

Comment 6 Gris Ge 2021-03-08 13:59:33 UTC
This RHV use case could be the base line of performance:

When creating 1000 VLANs from eth1(pre-created veth) and 1000 bridge over each vlans, nmstate takes 10m38.439s.

NetworkManager-1.30.0-2.el8.x86_64
nmstate-1.0.2-5.el8.noarch
trace log disabled.

Comment 7 Till Maas 2021-03-08 19:18:41 UTC
Gris, please also add the goal that needs to be achieved to consider this feature being implemented and update the summary accordingly. Thanks.

Comment 8 Gris Ge 2021-03-24 04:39:22 UTC
20% improvement is good enough for me.

Comment 10 Thomas Haller 2021-06-22 07:52:31 UTC
I did some optimizations:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/890
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/894
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/900

these well be soon merged to nm-1-32 and will then reach rhel-8.5


These optimizations mainly make frequently called code run faster.
What they don't do, is changing those code to call them less frequently.
As such, it optimizes some lower layers (which was simpler, but is limited
in effectiveness). I mean, when we call a function millions of times, then making
that functions fast helps. But what helps more is to not call it that often
(but that is also often harder).

One rework of the higher layers is in progress with Layer3Config rework. So
I did not want to address that.




Some testing:

The test scripts are commited to git:

  https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/tree/26090bafc9fd2eceeccffb758937427ae5dd160b

I created a setup with

  sudo DO_ADD_VLAN_CON=1 NUM_DEVS=1 NUM_VLAN_DEVS=1000 contrib/scripts/test-create-many-device-setup.sh setup

(NetworkManager.dispatcher disabled and dns=none)

then I ran

 #1  time examples/python/gi/nm-up-many.py c-a1.{1..1000}-po
 #2  time examples/python/gi/nm-up-many.py c-a1.{1..1000}-po
 #3  time examples/python/gi/nm-up-many.py c-a1.{1..100}-po
 #4  time examples/python/gi/nm-up-many.py c-a1.{1..200}-po


this only activates the ports, and does not wait for the bridges to get their IP addresses.


Timings:

Test#	1.32.0	<new>	diff %
#1	484	472	-2.47933884297521
			
#2	998	827	-17.1342685370741
			
#3	72	60	-16.6666666666667
#3	42	25	-40.4761904761905
#3	40	25	-37.5
#3	36	35	-2.77777777777778
			
#4	137	125	-8.75912408759124
#4	218	87	-60.0917431192661
#4	115	78	-32.1739130434783


we see large fluctuations, but I think it about 20% better :)



In the future, we need to address the higher layers, to significantly improve the performance beyond a lower percentage number.
This was still useful to look at valgrind runs, and to write some test scripts.

Comment 14 Filip Pokryvka 2021-07-29 17:41:33 UTC
I added the test adding 2000 connection together via libnm, it seems to be running 540s on 1.30 and 400s on 1.32 (on average), which is at least 20% improvement, so verifying.

Comment 16 errata-xmlrpc 2021-11-09 19:28:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: NetworkManager security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4361