609228 – network interface functions cause very high CPU utilization in libvirtd

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 609228 - network interface functions cause very high CPU utilization in libvirtd

Summary: network interface functions cause very high CPU utilization in libvirtd

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	augeas
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	David Lutterkort
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	609214 (view as bug list)
Depends On:	600141
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-29 17:25 UTC by Laine Stump
Modified:	2010-11-10 19:54 UTC (History)
CC List:	15 users (show)
Fixed In Version:	augeas-0.7.2-2
Doc Type:	Bug Fix
Doc Text:
Clone Of:	600141
Environment:
Last Closed:	2010-11-10 19:54:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Laine Stump 2010-06-29 17:25:08 UTC

+++ This bug was initially created as a clone of Bug #600141 +++

Description of problem:

top - 16:07:01 up 11 days,  4:56,  6 users,  load average: 0.14, 0.11, 0.05
Tasks: 201 total,   1 running, 200 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.7%us,  2.9%sy,  0.0%ni, 83.1%id,  2.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1982808k total,  1942568k used,    40240k free,    98920k buffers
Swap:  3964920k total,    69992k used,  3894928k free,   621348k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
20194 root      20   0  608m  41m 3512 S 13.3  2.1   0:06.98 libvirtd 

Version-Release number of selected component (if applicable):

I've seen this with all versions of libvirt, RHEL5,6 and Fedora on different machines and CPUs.  Filing upstream because that is where it should be fixed

How reproducible: 100%

Steps to Reproduce:
1. Start virt-manager
2. run top
3. ... profit?
  
Actual results:

Libvirtd sits around consuming 13-20% of CPU even when completely idle and no VMs are running.  Modern software shouldn't be doing whatever kind of continuous polling it is using to rack up such CPU time.  If I leave the system alone, overnight, libvirtd will rack up more CPU than X, firefox, and all other idle processes combined.

Expected results:


Additional info:

--- Additional comment from crobinso on 2010-06-09 11:24:27 EDT ---

It's because virt-manager is running. virt-manager polls libvirt to detect VM/network/storage state change, this polling causes lots of CPU churn. Libvirt supports async lifecycle events for domains, but virt-manager doesn't use those API yet. The other objects (network, storage, interface, host devices) don't have async API.

So the solution is:

1) Add async lifecycle API for all libvirt objects
2) Have virt-manager actually use those API

--- Additional comment from berrange on 2010-06-25 07:23:05 EDT ---

libvirtd shows about 8% CPU usage when virt-manager is running on my system. Evidence is better than guesswork, so I ran oprofile against libvirtd.

Augeas comes out on top by a country mile, followed by loads of malloc/memcpy stuff which I bet is all due to augeas too:

samples  %        image name               symbol name
31677    41.0255  libaugeas.so.0.10.2      /usr/lib64/libaugeas.so.0.10.2
14469    18.7391  libc-2.11.2.so           wcscoll_l
5401      6.9949  libc-2.11.2.so           wcscmp
4913      6.3629  libc-2.11.2.so           _int_free
4020      5.2064  libc-2.11.2.so           _int_malloc
3780      4.8955  libc-2.11.2.so           memcpy
2486      3.2197  libc-2.11.2.so           malloc
1676      2.1706  libc-2.11.2.so           malloc_consolidate
1446      1.8727  libc-2.11.2.so           memset
1359      1.7601  libc-2.11.2.so           wcscoll
1299      1.6824  libc-2.11.2.so           realloc
1097      1.4207  libc-2.11.2.so           calloc
714       0.9247  libc-2.11.2.so           free
517       0.6696  libc-2.11.2.so           _int_realloc
278       0.3600  libc-2.11.2.so           __GI___strncmp_ssse3
253       0.3277  libc-2.11.2.so           __strstr_sse2
174       0.2254  libc-2.11.2.so           __strlen_sse2
116       0.1502  libc-2.11.2.so           vfprintf
111       0.1438  libc-2.11.2.so           strnlen
106       0.1373  libpthread-2.11.2.so     pthread_mutex_lock
95        0.1230  libvirtd                 virLogMessage
73        0.0945  libpthread-2.11.2.so     pthread_mutex_unlock
66        0.0855  libc-2.11.2.so           __strchr_sse2
64        0.0829  libc-2.11.2.so           __GI___strcmp_ssse3
61        0.0790  libc-2.11.2.so           __ctype_b_loc
45        0.0583  libc-2.11.2.so           btowc
39        0.0505  libc-2.11.2.so           _IO_default_xsputn
35        0.0453  libc-2.11.2.so           strcat
30        0.0389  libc-2.11.2.so           strndup
30        0.0389  libvirtd                 virEventRunOnce


Editing virt-manager code to disable the 'update_interfaces' method in connection.py and re-run the oprofile test, libvirtd now consumes < 1% CPU and oprofile shows a completely different trace:


samples  %        image name               symbol name
1519     42.7165  libc-2.11.2.so           memset
346       9.7300  libc-2.11.2.so           __strstr_sse2
214       6.0180  libpthread-2.11.2.so     pthread_mutex_lock
136       3.8245  libc-2.11.2.so           __strchr_sse2
130       3.6558  libvirtd                 virLogMessage
117       3.2902  libpthread-2.11.2.so     pthread_mutex_unlock
77        2.1654  libc-2.11.2.so           vfprintf
57        1.6029  libc-2.11.2.so           _int_malloc
53        1.4904  libvirtd                 virEventRunOnce
45        1.2655  libc-2.11.2.so           calloc
40        1.1249  libc-2.11.2.so           _int_free
25        0.7030  libc-2.11.2.so           malloc_consolidate
21        0.5906  libc-2.11.2.so           _IO_default_xsputn
20        0.5624  libc-2.11.2.so           __strlen_sse2
20        0.5624  libc-2.11.2.so           poll
20        0.5624  libvirtd                 nodeListDevices
19        0.5343  libpthread-2.11.2.so     pthread_cond_wait@@GLIBC_2.3.2
19        0.5343  libvirt.so.0.8.1         virHashComputeKey
19        0.5343  libvirtd                 qemudDispatchClientEvent
18        0.5062  libvirtd                 remoteDispatchClientCall


So the problem appears to be that augeas (and netcf by implication) has exceedingly high CPU utilization :-(  Need advice from David on how to proceed here. Maybe augeas simply hasn't had any performance optimization work done on it yet ? Otherwise maybe we need to change netcf to not use augeas+xslt for converting the ifcfg files into XML format.

--- Additional comment from berrange on 2010-06-25 07:59:18 EDT ---

A more targetted profile this time from just running ncftool directly

samples  %        image name               symbol name
85058    20.7509  libc-2.11.2.so           wcscoll_l
29644     7.2320  libc-2.11.2.so           wcscmp
26191     6.3896  libc-2.11.2.so           _int_malloc
25622     6.2508  libaugeas.so.0.11.0      parse_expression
18443     4.4994  libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
16999     4.1471  libc-2.11.2.so           memcpy
13354     3.2579  libc-2.11.2.so           _int_free
9402      2.2937  libaugeas.so.0.11.0      set_regs
9019      2.2003  libc-2.11.2.so           wcscoll
8788      2.1439  libc-2.11.2.so           malloc_consolidate
8457      2.0632  libaugeas.so.0.11.0      re_node_set_add_intersect
8062      1.9668  libfa.so.1.3.1           re_as_string
7832      1.9107  libfa.so.1.3.1           cset_contains
7343      1.7914  libaugeas.so.0.11.0      re_node_set_contains
6031      1.4713  libaugeas.so.0.11.0      re_acquire_state
5113      1.2474  libaugeas.so.0.11.0      sift_states_backward
4680      1.1417  libaugeas.so.0.11.0      re_node_set_insert_last
4680      1.1417  libaugeas.so.0.11.0      re_search_internal
4587      1.1191  libaugeas.so.0.11.0      re_node_set_insert
4451      1.0859  libc-2.11.2.so           calloc
4315      1.0527  libc-2.11.2.so           free
4189      1.0220  libc-2.11.2.so           malloc

--- Additional comment from crobinso on 2010-06-25 10:08:25 EDT ---

I think every interface 'list' operation causes netcf to parse /etc/sysconfig/network-scripts/*, which would explain things.

--- Additional comment from zamsden on 2010-06-25 17:07:26 EDT ---

I'm more than happy to run some profiling here as well.  On my particular system (F13 with encrypted disk on dual core Turion64) it's enough to make the system unresponsive and barely usable with a 2-VCPU VM.

If network is implicated, especially network scripts, well mine are pretty complex, I'm running split DNS, wired / wireless bonding, and a vpnc tunnel.  It's possible one of those things exaggerates the cost of this quite a bit.

--- Additional comment from lutter on 2010-06-28 20:30:48 EDT ---

Cole is right: every netcf operation causes the network scripts to be parsed again. There's not enough change tracking to guard against other programs modifying those files in between netcf operations.

There's a few ways to address that, varying in complexity and hackishness:

(1) Instead of reparsing the files on every netcf operation, only reparse after some small amount of time
(2) Within netcf, watch pertinent files with inotify and reread them only upon actual changes (or complain if both netcf and some outside program try to make changes)
(3) Do the same within augeas

Of course, another option would be to reduce how often virt-manager does an ncf_list.

Comment 2 RHEL Program Management 2010-06-29 17:43:07 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Laine Stump 2010-06-29 18:04:26 UTC

This BZ has been added to track necessary changes to augeas and/or netcf to
avoid unnecessary re-reading of unmodified config files, either by using
inotify, or just by stat'ing each file before it is re-read to see if it's been
modified.

There is also some talk of improving the efficiency of the parsing, but nothing
concrete there yet.

Comment 4 David Lutterkort 2010-06-29 19:24:03 UTC

I am working on avoiding all those reparses transparently in augeas - with that change, reparsing unmodified files will be avoided at the cost of doing stats on those files and some internal bookkeeping.

Comment 5 David Lutterkort 2010-06-30 02:19:47 UTC

Please see corresponding Fedora bug for progress: https://bugzilla.redhat.com/show_bug.cgi?id=600141#c7

Comment 6 David Lutterkort 2010-06-30 15:47:43 UTC

Since basic smoke tests by Laine and Cole indicate that it does improve the situation dramatically and does not cause regressions, this should be fixed with augeas-0.7.2-2

Comment 8 Cole Robinson 2010-07-02 14:44:45 UTC

*** Bug 609214 has been marked as a duplicate of this bug. ***

Comment 9 Mohua Li 2010-08-09 09:34:51 UTC

hi, i'm not quite sure the reproduce steps, i did saw in augeas git, the aug_load changed to load only update/modify file,  and here is my steps to verify the bug,

software version,
augeas-libs-0.7.2-2.el6.x86_64
augeas-0.7.2-2.el6.x86_64
libvirt-0.8.1-21.el6.x86_64


1, run the virt-manager with  2 network script available(eth0, eth1),
2, running virt-manager for a long time, like 3 hours, 
3,top -p $libvirt_pid,
4,if the bug not resolve, then virt-manager will keep polling the network configuration, and augeas keep reparse all the network profile, then libvirtd will cause high cpu utilization,but actually in my test, i didn't see the cpu utilization grows obvious,is this means the bug get fixed, 


can someone help me to check if the steps is right or if i miss sth?


+    /* To avoid unnecessary loads of files, we reload an existing file in
+     * several steps:
+     * (1) mark all file nodes under /augeas/files as dirty (and only those)
+     * (2) process all files matched by a lens; we check (in
+     *     transform_load) if the file has been modified. If it has, we
+     *     reparse it. Either way, we clear the dirty flag. We also need to
+     *     reread the file if part or all of it has been modified in the
+     *     tree but not been saved yet
+     * (3) remove all files from the tree that still have a dirty entry
+     *     under /augeas/files. Those files are not processed by any lens
+     *     anymore
+     * (4) Remove entries from /augeas/files and /files that correspond
+     *     to directories without any files of interest

Comment 10 David Lutterkort 2010-08-09 14:09:08 UTC

(In reply to comment #9)
> hi, i'm not quite sure the reproduce steps, i did saw in augeas git, the
> aug_load changed to load only update/modify file,  and here is my steps to
> verify the bug,

Yes, those are the right steps to verify the bug - it's a little up to interpretation what constitutes 'high load'. Certainly, the 13-20% of CPU usage that was initially reported was unacceptable.

I'd say if you see loads that are significantly lower than that, this should be considered fixed.

Comment 11 Mohua Li 2010-08-10 04:32:38 UTC

according to comment9,10, in my test, after running virt-manager for several hours with on guest running and several guest shutoff,  the load of libvirtd is just about 0.7~3%, which is lower than the initially reported value 13~20%, so  the bug is fixed,

Comment 12 releng-rhel@redhat.com 2010-11-10 19:54:55 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.