Bug 1711215

Summary:

Improve NetworkManager's performance with many devices

Product:

Red Hat Enterprise Linux 8

Reporter:

Thomas Haller <thaller>

Component:

NetworkManager

Assignee:

Beniamino Galvani <bgalvani>

Status:

CLOSED ERRATA

QA Contact:

Desktop QE <desktop-qa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

8.0

CC:

aloughla, atragler, bgalvani, fgiudici, jmaxwell, lrintel, pasik, rkhan, sukulkar, thaller, vbenes

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

NetworkManager-1.26.0-0.1.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1847125 (view as bug list)

Environment:

Last Closed:

2020-11-04 01:48:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1807630, 1825061, 1847125

Attachments:

Description	Flags
test script	none
Test2 - activate.py	none
Test2 - setup.sh (needed by activate.py)	none

Description Thomas Haller 2019-05-17 08:48:50 UTC

Created attachment 1569988 [details]
test script

I wrote a naive script, that creates a number of veth devices, all connected to a bridge (in another namespace) that runs dnsmasq.

NetworkManager then creates a auto-default connection and attemps DHCP on them.

That does not work well:

- devices take a long time to reach full activation.

- some devices time out and end "disconnected" (for good!! Where is the rety?)

- some devices stay in state "unavailable"

- generally, there is a high CPU load.



Now, I might have made some mistakes in the script (like dnsmask not replying quickly to DHCP requests). But for the CPU load there is no excuse.


In the scripts are some "sleep". If you remove them, it gets only worse.


The goal of this bug is to run the script, that creates at least 100 devices without problems.

Comment 1 sushil kulkarni 2019-10-15 15:49:02 UTC

Parking this for 8.3.

-Sushil

Comment 2 Thomas Haller 2020-04-08 09:29:48 UTC

some tests: https://bugzilla.redhat.com/show_bug.cgi?id=1820009#c2

Comment 4 Beniamino Galvani 2020-05-29 13:15:33 UTC

The long delay to reach activation seems related more to a bottleneck
in dnsmasq than NM. If I change Thomas' script to launch dnsmasq with
'--no-ping', then 100 devices can activate in few seconds. From what I
could understand, dnsmasq uses a ping by default to determine whether
the address is free and serializes all those requests; with many
devices that mechanism becomes very slow and somehow unreliable.

I also prepared a couple of script to create many veth devices and
measure the time to complete DHCP on them in parallel. These are the
results in a VM with 4 cores, with avahi and NM-dispatcher services
masked to save CPU usage:

 Devices   Time (s)
    50       1
   100       2
   150       4
   200       5
   250       7
   300      11
   350      13
   400      15
   450      18
   500      21

Comment 5 Beniamino Galvani 2020-05-29 13:17:03 UTC

Created attachment 1693372 [details]
Test2 - activate.py

Comment 6 Beniamino Galvani 2020-05-29 13:17:50 UTC

Created attachment 1693373 [details]
Test2 - setup.sh (needed by activate.py)

Comment 7 Beniamino Galvani 2020-06-08 13:43:49 UTC

Since NM can activate hundreds of devices in few seconds, I think we can consider this bz done.

Comment 10 Vladimir Benes 2020-07-20 13:34:38 UTC

test to add and activate 100 devices in less than 7s added to CI
https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/606

Comment 13 errata-xmlrpc 2020-11-04 01:48:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4499