Bug 1711215
| Summary: | Improve NetworkManager's performance with many devices | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Thomas Haller <thaller> | ||||||||
| Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 8.0 | CC: | aloughla, atragler, bgalvani, fgiudici, jmaxwell, lrintel, pasik, rkhan, sukulkar, thaller, vbenes | ||||||||
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
||||||||
| Target Release: | 8.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | NetworkManager-1.26.0-0.1.el8 | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 1847125 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2020-11-04 01:48:32 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1807630, 1825061, 1847125 | ||||||||||
| Attachments: |
|
||||||||||
Parking this for 8.3. -Sushil The long delay to reach activation seems related more to a bottleneck
in dnsmasq than NM. If I change Thomas' script to launch dnsmasq with
'--no-ping', then 100 devices can activate in few seconds. From what I
could understand, dnsmasq uses a ping by default to determine whether
the address is free and serializes all those requests; with many
devices that mechanism becomes very slow and somehow unreliable.
I also prepared a couple of script to create many veth devices and
measure the time to complete DHCP on them in parallel. These are the
results in a VM with 4 cores, with avahi and NM-dispatcher services
masked to save CPU usage:
Devices Time (s)
50 1
100 2
150 4
200 5
250 7
300 11
350 13
400 15
450 18
500 21
Created attachment 1693372 [details]
Test2 - activate.py
Created attachment 1693373 [details]
Test2 - setup.sh (needed by activate.py)
Since NM can activate hundreds of devices in few seconds, I think we can consider this bz done. test to add and activate 100 devices in less than 7s added to CI https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/606 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4499 |
Created attachment 1569988 [details] test script I wrote a naive script, that creates a number of veth devices, all connected to a bridge (in another namespace) that runs dnsmasq. NetworkManager then creates a auto-default connection and attemps DHCP on them. That does not work well: - devices take a long time to reach full activation. - some devices time out and end "disconnected" (for good!! Where is the rety?) - some devices stay in state "unavailable" - generally, there is a high CPU load. Now, I might have made some mistakes in the script (like dnsmask not replying quickly to DHCP requests). But for the CPU load there is no excuse. In the scripts are some "sleep". If you remove them, it gets only worse. The goal of this bug is to run the script, that creates at least 100 devices without problems.