RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1711215 - Improve NetworkManager's performance with many devices
Summary: Improve NetworkManager's performance with many devices
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: NetworkManager
Version: 8.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.0
Assignee: Beniamino Galvani
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks: 1807630 1825061 1847125
TreeView+ depends on / blocked
 
Reported: 2019-05-17 08:48 UTC by Thomas Haller
Modified: 2020-11-04 01:48 UTC (History)
11 users (show)

Fixed In Version: NetworkManager-1.26.0-0.1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1847125 (view as bug list)
Environment:
Last Closed: 2020-11-04 01:48:32 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test script (1.37 KB, text/plain)
2019-05-17 08:48 UTC, Thomas Haller
no flags Details
Test2 - activate.py (4.55 KB, text/x-python)
2020-05-29 13:17 UTC, Beniamino Galvani
no flags Details
Test2 - setup.sh (needed by activate.py) (1.28 KB, application/x-shellscript)
2020-05-29 13:17 UTC, Beniamino Galvani
no flags Details

Description Thomas Haller 2019-05-17 08:48:50 UTC
Created attachment 1569988 [details]
test script

I wrote a naive script, that creates a number of veth devices, all connected to a bridge (in another namespace) that runs dnsmasq.

NetworkManager then creates a auto-default connection and attemps DHCP on them.

That does not work well:

- devices take a long time to reach full activation.

- some devices time out and end "disconnected" (for good!! Where is the rety?)

- some devices stay in state "unavailable"

- generally, there is a high CPU load.



Now, I might have made some mistakes in the script (like dnsmask not replying quickly to DHCP requests). But for the CPU load there is no excuse.


In the scripts are some "sleep". If you remove them, it gets only worse.


The goal of this bug is to run the script, that creates at least 100 devices without problems.

Comment 1 sushil kulkarni 2019-10-15 15:49:02 UTC
Parking this for 8.3.

-Sushil

Comment 2 Thomas Haller 2020-04-08 09:29:48 UTC
some tests: https://bugzilla.redhat.com/show_bug.cgi?id=1820009#c2

Comment 4 Beniamino Galvani 2020-05-29 13:15:33 UTC
The long delay to reach activation seems related more to a bottleneck
in dnsmasq than NM. If I change Thomas' script to launch dnsmasq with
'--no-ping', then 100 devices can activate in few seconds. From what I
could understand, dnsmasq uses a ping by default to determine whether
the address is free and serializes all those requests; with many
devices that mechanism becomes very slow and somehow unreliable.

I also prepared a couple of script to create many veth devices and
measure the time to complete DHCP on them in parallel. These are the
results in a VM with 4 cores, with avahi and NM-dispatcher services
masked to save CPU usage:

 Devices   Time (s)
    50       1
   100       2
   150       4
   200       5
   250       7
   300      11
   350      13
   400      15
   450      18
   500      21

Comment 5 Beniamino Galvani 2020-05-29 13:17:03 UTC
Created attachment 1693372 [details]
Test2 - activate.py

Comment 6 Beniamino Galvani 2020-05-29 13:17:50 UTC
Created attachment 1693373 [details]
Test2 - setup.sh (needed by activate.py)

Comment 7 Beniamino Galvani 2020-06-08 13:43:49 UTC
Since NM can activate hundreds of devices in few seconds, I think we can consider this bz done.

Comment 10 Vladimir Benes 2020-07-20 13:34:38 UTC
test to add and activate 100 devices in less than 7s added to CI
https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/606

Comment 13 errata-xmlrpc 2020-11-04 01:48:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4499


Note You need to log in before you can comment on or make changes to this bug.