Bug 1711215

Summary: Improve NetworkManager's performance with many devices
Product: Red Hat Enterprise Linux 8 Reporter: Thomas Haller <thaller>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0CC: aloughla, atragler, bgalvani, fgiudici, jmaxwell, lrintel, pasik, rkhan, sukulkar, thaller, vbenes
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.26.0-0.1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1847125 (view as bug list) Environment:
Last Closed: 2020-11-04 01:48:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1807630, 1825061, 1847125    
Attachments:
Description Flags
test script
none
Test2 - activate.py
none
Test2 - setup.sh (needed by activate.py) none

Description Thomas Haller 2019-05-17 08:48:50 UTC
Created attachment 1569988 [details]
test script

I wrote a naive script, that creates a number of veth devices, all connected to a bridge (in another namespace) that runs dnsmasq.

NetworkManager then creates a auto-default connection and attemps DHCP on them.

That does not work well:

- devices take a long time to reach full activation.

- some devices time out and end "disconnected" (for good!! Where is the rety?)

- some devices stay in state "unavailable"

- generally, there is a high CPU load.



Now, I might have made some mistakes in the script (like dnsmask not replying quickly to DHCP requests). But for the CPU load there is no excuse.


In the scripts are some "sleep". If you remove them, it gets only worse.


The goal of this bug is to run the script, that creates at least 100 devices without problems.

Comment 1 sushil kulkarni 2019-10-15 15:49:02 UTC
Parking this for 8.3.

-Sushil

Comment 2 Thomas Haller 2020-04-08 09:29:48 UTC
some tests: https://bugzilla.redhat.com/show_bug.cgi?id=1820009#c2

Comment 4 Beniamino Galvani 2020-05-29 13:15:33 UTC
The long delay to reach activation seems related more to a bottleneck
in dnsmasq than NM. If I change Thomas' script to launch dnsmasq with
'--no-ping', then 100 devices can activate in few seconds. From what I
could understand, dnsmasq uses a ping by default to determine whether
the address is free and serializes all those requests; with many
devices that mechanism becomes very slow and somehow unreliable.

I also prepared a couple of script to create many veth devices and
measure the time to complete DHCP on them in parallel. These are the
results in a VM with 4 cores, with avahi and NM-dispatcher services
masked to save CPU usage:

 Devices   Time (s)
    50       1
   100       2
   150       4
   200       5
   250       7
   300      11
   350      13
   400      15
   450      18
   500      21

Comment 5 Beniamino Galvani 2020-05-29 13:17:03 UTC
Created attachment 1693372 [details]
Test2 - activate.py

Comment 6 Beniamino Galvani 2020-05-29 13:17:50 UTC
Created attachment 1693373 [details]
Test2 - setup.sh (needed by activate.py)

Comment 7 Beniamino Galvani 2020-06-08 13:43:49 UTC
Since NM can activate hundreds of devices in few seconds, I think we can consider this bz done.

Comment 10 Vladimir Benes 2020-07-20 13:34:38 UTC
test to add and activate 100 devices in less than 7s added to CI
https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/606

Comment 13 errata-xmlrpc 2020-11-04 01:48:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4499