1231526 – nmcli slow with large numbers of VLANs

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1231526 - nmcli slow with large numbers of VLANs

Summary: nmcli slow with large numbers of VLANs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	NetworkManager
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lubomir Rintel
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1393481
TreeView+	depends on / blocked

Reported:	2015-06-14 12:51 UTC by Jeremy Harris
Modified:	2020-12-11 11:50 UTC (History)
CC List:	12 users (show)
Fixed In Version:	NetworkManager-1.8.0-0.4.rc1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 09:17:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vlan interface creation script (285 bytes, application/x-shellscript) 2015-06-14 12:51 UTC, Jeremy Harris	no flags	Details
vlan interface online script (129 bytes, text/plain) 2015-06-14 12:53 UTC, Jeremy Harris	no flags	Details
Callgrind output for 'nmcli general' for NM 1.0.2 (851.15 KB, text/plain) 2015-06-15 16:19 UTC, Jirka Klimes	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:2299	0	normal	SHIPPED_LIVE	Moderate: NetworkManager and libnl3 security, bug fix and enhancement update	2017-08-01 12:40:28 UTC

Description Jeremy Harris 2015-06-14 12:51:44 UTC

Created attachment 1038540 [details]
vlan interface creation script

Description of problem:
 ifup becomes slow when large numbers of VLANs are created


Version-Release number of selected component (if applicable):
 1:NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64
 kernel 3.10.0-229.el7.x86_64

How reproducible:
 100%

Steps to Reproduce:
1. create scripts for 254 VLANs
2. loop over the interface names doing ifup
3.

Actual results:
 In a VM on a laptop, the last "ifup" takes multiple seconds to complete.
 The sequence as a whole appears to show quadratic behaviour

Expected results:
 Better performance.

Additional info:

 Strace of the "ifup" shows 3 slow "nmcli" operations and one slow "grep" -
each on order of 3 seconds.  The first "nmcli" is a simple status inquiry:
"nmcli -t --fields running general status"... and it does over 20,000 write
syscalls (and equivalently large numbers of other syscalls).  This is repeatable
manually (with the VLANs inplace).  On a fresh boot without the VLANs, only 168 write syscalls are done.

Comment 1 Jeremy Harris 2015-06-14 12:53:09 UTC

Created attachment 1038552 [details]
vlan interface online script

Comment 3 Jirka Klimes 2015-06-15 16:15:38 UTC

I can reproduce the slowness by creating many devices (using the script from Description) and then do any operation with nmcli.

For example:
$ nmcli dev | wc -l
263
$ time nmcli general
STATE      CONNECTIVITY  WIFI-HW  WIFI     WWAN-HW  WWAN    
connected  full          enabled  enabled  enabled  enabled 

real	0m4.912s
user	0m2.475s
sys	0m0.437s

The problem is not syscalls per se, but rather in intensive usage of glib in libnm library. The nm_client_new() function itself takes about 50% of the instructions.
I profiled 'nmcli general' command with valgrind's callgrind. The log is attached in the next comment. Data can be displayed with:
$ callgrind_annotate callgrind.out.23830
or better
$ kcachegrind callgrind.out.23830
which is very nice GUI tool to show the data in nice views with callgraphs, maps, etc.

Unfortunately, I don't see any simple culprit or a low hanging fruit there. It is just obvious that the most intensive functions are memory management and various glib functions, because libnm calls them too many times (1 - 3 million), which seems wrong.

Useful link:
http://c.learncodethehardway.org/book/ex41.html

Comment 4 Jirka Klimes 2015-06-15 16:19:35 UTC

Created attachment 1039130 [details]
Callgrind output for 'nmcli general' for NM 1.0.2


Callgrind output generated by:
valgrind --tool=callgrind nmcli general
on Fedora 22 with NetworkManager-1.0.2-1.fc22.x86_64

Dislpay data with:
kcachegrind callgrind.out.23830

Comment 5 Jeremy Harris 2015-06-16 10:27:47 UTC

Looks like dbus operations done by g_initable_init could be worth a look
(indeed, is that table needed for a plain status equiry?  Not that other nmcli uses shouldn't be faster too, but...).

Comment 6 Thomas Haller 2015-06-16 14:14:00 UTC

as libnm currently is, it fetches ~everything~ on initialization. There might be some places to optimize the fetching. But in the end loading everything will take some time on larger systems.

We should investigate fetch on-demand for libnm.

Comment 9 Dan Williams 2016-01-19 16:53:35 UTC

Lubomir suggested that porting libnm to use the GDBus ObjectManager interfaces to talk to NetworkManager (which NM 1.2 already implements service-side) is a possible fix here.  We want to do that anyway to work around issues with D-Bus Policy on pending reply maximums.

Comment 11 Lubomir Rintel 2017-01-16 10:28:06 UTC

This is already in upstream master, will make it into 7.4

Comment 13 Thomas Haller 2017-03-06 20:38:36 UTC

All currently planned work to improve performance is already in upstream master, and hence part of upcoming rhel-7.4.

According to our tests, it significantly improves performance of nmcli/libnm.
I am marking this bug as fixed, although in the future we should find ways to improve performance further.

Comment 15 Vladimir Benes 2017-06-02 07:30:33 UTC

nmcli shouldn't be much affected by a lot of devices/connections now

Comment 16 errata-xmlrpc 2017-08-01 09:17:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2299

Note You need to log in before you can comment on or make changes to this bug.