Bug 1231526
Summary: | nmcli slow with large numbers of VLANs | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jeremy Harris <jeharris> | ||||||||
Component: | NetworkManager | Assignee: | Lubomir Rintel <lrintel> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 7.1 | CC: | aiyengar, aloughla, atragler, bgalvani, dcbw, kzhang, lrintel, mleitner, rkhan, sukulkar, thaller, vbenes | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | NetworkManager-1.8.0-0.4.rc1.el7 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-08-01 09:17:07 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1393481 | ||||||||||
Attachments: |
|
Created attachment 1038552 [details]
vlan interface online script
I can reproduce the slowness by creating many devices (using the script from Description) and then do any operation with nmcli. For example: $ nmcli dev | wc -l 263 $ time nmcli general STATE CONNECTIVITY WIFI-HW WIFI WWAN-HW WWAN connected full enabled enabled enabled enabled real 0m4.912s user 0m2.475s sys 0m0.437s The problem is not syscalls per se, but rather in intensive usage of glib in libnm library. The nm_client_new() function itself takes about 50% of the instructions. I profiled 'nmcli general' command with valgrind's callgrind. The log is attached in the next comment. Data can be displayed with: $ callgrind_annotate callgrind.out.23830 or better $ kcachegrind callgrind.out.23830 which is very nice GUI tool to show the data in nice views with callgraphs, maps, etc. Unfortunately, I don't see any simple culprit or a low hanging fruit there. It is just obvious that the most intensive functions are memory management and various glib functions, because libnm calls them too many times (1 - 3 million), which seems wrong. Useful link: http://c.learncodethehardway.org/book/ex41.html Created attachment 1039130 [details]
Callgrind output for 'nmcli general' for NM 1.0.2
Callgrind output generated by:
valgrind --tool=callgrind nmcli general
on Fedora 22 with NetworkManager-1.0.2-1.fc22.x86_64
Dislpay data with:
kcachegrind callgrind.out.23830
Looks like dbus operations done by g_initable_init could be worth a look (indeed, is that table needed for a plain status equiry? Not that other nmcli uses shouldn't be faster too, but...). as libnm currently is, it fetches ~everything~ on initialization. There might be some places to optimize the fetching. But in the end loading everything will take some time on larger systems. We should investigate fetch on-demand for libnm. Lubomir suggested that porting libnm to use the GDBus ObjectManager interfaces to talk to NetworkManager (which NM 1.2 already implements service-side) is a possible fix here. We want to do that anyway to work around issues with D-Bus Policy on pending reply maximums. This is already in upstream master, will make it into 7.4 All currently planned work to improve performance is already in upstream master, and hence part of upcoming rhel-7.4. According to our tests, it significantly improves performance of nmcli/libnm. I am marking this bug as fixed, although in the future we should find ways to improve performance further. nmcli shouldn't be much affected by a lot of devices/connections now Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2299 |
Created attachment 1038540 [details] vlan interface creation script Description of problem: ifup becomes slow when large numbers of VLANs are created Version-Release number of selected component (if applicable): 1:NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64 kernel 3.10.0-229.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. create scripts for 254 VLANs 2. loop over the interface names doing ifup 3. Actual results: In a VM on a laptop, the last "ifup" takes multiple seconds to complete. The sequence as a whole appears to show quadratic behaviour Expected results: Better performance. Additional info: Strace of the "ifup" shows 3 slow "nmcli" operations and one slow "grep" - each on order of 3 seconds. The first "nmcli" is a simple status inquiry: "nmcli -t --fields running general status"... and it does over 20,000 write syscalls (and equivalently large numbers of other syscalls). This is repeatable manually (with the VLANs inplace). On a fresh boot without the VLANs, only 168 write syscalls are done.