609214 – virt-manager: interface polling cause very high CPU utilization in libvirtd

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 609214 - virt-manager: interface polling cause very high CPU utilization in libvirtd

Summary: virt-manager: interface polling cause very high CPU utilization in libvirtd

Keywords:
Status:	CLOSED DUPLICATE of bug 609228
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	virt-manager
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Cole Robinson
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-29 16:52 UTC by Cole Robinson
Modified:	2010-07-02 14:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	600141
Environment:
Last Closed:	2010-07-02 14:44:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Slow down interface polling (3.55 KB, text/plain) 2010-06-29 17:28 UTC, Cole Robinson	no flags	Details
View All

Description Cole Robinson 2010-06-29 16:52:34 UTC

Cloning from F13 libvirt to RHEL6 virt-manager where we will work around this for RHEL6 GA

+++ This bug was initially created as a clone of Bug #600141 +++

Description of problem:

top - 16:07:01 up 11 days,  4:56,  6 users,  load average: 0.14, 0.11, 0.05
Tasks: 201 total,   1 running, 200 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.7%us,  2.9%sy,  0.0%ni, 83.1%id,  2.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1982808k total,  1942568k used,    40240k free,    98920k buffers
Swap:  3964920k total,    69992k used,  3894928k free,   621348k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
20194 root      20   0  608m  41m 3512 S 13.3  2.1   0:06.98 libvirtd 

Version-Release number of selected component (if applicable):

I've seen this with all versions of libvirt, RHEL5,6 and Fedora on different machines and CPUs.  Filing upstream because that is where it should be fixed

How reproducible: 100%

Steps to Reproduce:
1. Start virt-manager
2. run top
3. ... profit?
  
Actual results:

Libvirtd sits around consuming 13-20% of CPU even when completely idle and no VMs are running.  Modern software shouldn't be doing whatever kind of continuous polling it is using to rack up such CPU time.  If I leave the system alone, overnight, libvirtd will rack up more CPU than X, firefox, and all other idle processes combined.

Expected results:


Additional info:

--- Additional comment from crobinso on 2010-06-09 11:24:27 EDT ---

It's because virt-manager is running. virt-manager polls libvirt to detect VM/network/storage state change, this polling causes lots of CPU churn. Libvirt supports async lifecycle events for domains, but virt-manager doesn't use those API yet. The other objects (network, storage, interface, host devices) don't have async API.

So the solution is:

1) Add async lifecycle API for all libvirt objects
2) Have virt-manager actually use those API

--- Additional comment from berrange on 2010-06-25 07:23:05 EDT ---

libvirtd shows about 8% CPU usage when virt-manager is running on my system. Evidence is better than guesswork, so I ran oprofile against libvirtd.

Augeas comes out on top by a country mile, followed by loads of malloc/memcpy stuff which I bet is all due to augeas too:

samples  %        image name               symbol name
31677    41.0255  libaugeas.so.0.10.2      /usr/lib64/libaugeas.so.0.10.2
14469    18.7391  libc-2.11.2.so           wcscoll_l
5401      6.9949  libc-2.11.2.so           wcscmp
4913      6.3629  libc-2.11.2.so           _int_free
4020      5.2064  libc-2.11.2.so           _int_malloc
3780      4.8955  libc-2.11.2.so           memcpy
2486      3.2197  libc-2.11.2.so           malloc
1676      2.1706  libc-2.11.2.so           malloc_consolidate
1446      1.8727  libc-2.11.2.so           memset
1359      1.7601  libc-2.11.2.so           wcscoll
1299      1.6824  libc-2.11.2.so           realloc
1097      1.4207  libc-2.11.2.so           calloc
714       0.9247  libc-2.11.2.so           free
517       0.6696  libc-2.11.2.so           _int_realloc
278       0.3600  libc-2.11.2.so           __GI___strncmp_ssse3
253       0.3277  libc-2.11.2.so           __strstr_sse2
174       0.2254  libc-2.11.2.so           __strlen_sse2
116       0.1502  libc-2.11.2.so           vfprintf
111       0.1438  libc-2.11.2.so           strnlen
106       0.1373  libpthread-2.11.2.so     pthread_mutex_lock
95        0.1230  libvirtd                 virLogMessage
73        0.0945  libpthread-2.11.2.so     pthread_mutex_unlock
66        0.0855  libc-2.11.2.so           __strchr_sse2
64        0.0829  libc-2.11.2.so           __GI___strcmp_ssse3
61        0.0790  libc-2.11.2.so           __ctype_b_loc
45        0.0583  libc-2.11.2.so           btowc
39        0.0505  libc-2.11.2.so           _IO_default_xsputn
35        0.0453  libc-2.11.2.so           strcat
30        0.0389  libc-2.11.2.so           strndup
30        0.0389  libvirtd                 virEventRunOnce


Editing virt-manager code to disable the 'update_interfaces' method in connection.py and re-run the oprofile test, libvirtd now consumes < 1% CPU and oprofile shows a completely different trace:


samples  %        image name               symbol name
1519     42.7165  libc-2.11.2.so           memset
346       9.7300  libc-2.11.2.so           __strstr_sse2
214       6.0180  libpthread-2.11.2.so     pthread_mutex_lock
136       3.8245  libc-2.11.2.so           __strchr_sse2
130       3.6558  libvirtd                 virLogMessage
117       3.2902  libpthread-2.11.2.so     pthread_mutex_unlock
77        2.1654  libc-2.11.2.so           vfprintf
57        1.6029  libc-2.11.2.so           _int_malloc
53        1.4904  libvirtd                 virEventRunOnce
45        1.2655  libc-2.11.2.so           calloc
40        1.1249  libc-2.11.2.so           _int_free
25        0.7030  libc-2.11.2.so           malloc_consolidate
21        0.5906  libc-2.11.2.so           _IO_default_xsputn
20        0.5624  libc-2.11.2.so           __strlen_sse2
20        0.5624  libc-2.11.2.so           poll
20        0.5624  libvirtd                 nodeListDevices
19        0.5343  libpthread-2.11.2.so     pthread_cond_wait@@GLIBC_2.3.2
19        0.5343  libvirt.so.0.8.1         virHashComputeKey
19        0.5343  libvirtd                 qemudDispatchClientEvent
18        0.5062  libvirtd                 remoteDispatchClientCall


So the problem appears to be that augeas (and netcf by implication) has exceedingly high CPU utilization :-(  Need advice from David on how to proceed here. Maybe augeas simply hasn't had any performance optimization work done on it yet ? Otherwise maybe we need to change netcf to not use augeas+xslt for converting the ifcfg files into XML format.

--- Additional comment from berrange on 2010-06-25 07:59:18 EDT ---

A more targetted profile this time from just running ncftool directly

samples  %        image name               symbol name
85058    20.7509  libc-2.11.2.so           wcscoll_l
29644     7.2320  libc-2.11.2.so           wcscmp
26191     6.3896  libc-2.11.2.so           _int_malloc
25622     6.2508  libaugeas.so.0.11.0      parse_expression
18443     4.4994  libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
16999     4.1471  libc-2.11.2.so           memcpy
13354     3.2579  libc-2.11.2.so           _int_free
9402      2.2937  libaugeas.so.0.11.0      set_regs
9019      2.2003  libc-2.11.2.so           wcscoll
8788      2.1439  libc-2.11.2.so           malloc_consolidate
8457      2.0632  libaugeas.so.0.11.0      re_node_set_add_intersect
8062      1.9668  libfa.so.1.3.1           re_as_string
7832      1.9107  libfa.so.1.3.1           cset_contains
7343      1.7914  libaugeas.so.0.11.0      re_node_set_contains
6031      1.4713  libaugeas.so.0.11.0      re_acquire_state
5113      1.2474  libaugeas.so.0.11.0      sift_states_backward
4680      1.1417  libaugeas.so.0.11.0      re_node_set_insert_last
4680      1.1417  libaugeas.so.0.11.0      re_search_internal
4587      1.1191  libaugeas.so.0.11.0      re_node_set_insert
4451      1.0859  libc-2.11.2.so           calloc
4315      1.0527  libc-2.11.2.so           free
4189      1.0220  libc-2.11.2.so           malloc

--- Additional comment from crobinso on 2010-06-25 10:08:25 EDT ---

I think every interface 'list' operation causes netcf to parse /etc/sysconfig/network-scripts/*, which would explain things.

--- Additional comment from zamsden on 2010-06-25 17:07:26 EDT ---

I'm more than happy to run some profiling here as well.  On my particular system (F13 with encrypted disk on dual core Turion64) it's enough to make the system unresponsive and barely usable with a 2-VCPU VM.

If network is implicated, especially network scripts, well mine are pretty complex, I'm running split DNS, wired / wireless bonding, and a vpnc tunnel.  It's possible one of those things exaggerates the cost of this quite a bit.

--- Additional comment from lutter on 2010-06-28 20:30:48 EDT ---

Cole is right: every netcf operation causes the network scripts to be parsed again. There's not enough change tracking to guard against other programs modifying those files in between netcf operations.

There's a few ways to address that, varying in complexity and hackishness:

(1) Instead of reparsing the files on every netcf operation, only reparse after some small amount of time
(2) Within netcf, watch pertinent files with inotify and reread them only upon actual changes (or complain if both netcf and some outside program try to make changes)
(3) Do the same within augeas

Of course, another option would be to reduce how often virt-manager does an ncf_list.

Comment 1 Cole Robinson 2010-06-29 16:53:35 UTC

We've decided to just work around this in virt-manager for GA by limiting the netcf polling speed.

Comment 2 Cole Robinson 2010-06-29 17:28:44 UTC

Created attachment 427744 [details]
Slow down interface polling

Patch does interface polling only on every 10th virt-manager tick. State changing operations like start,stop,define,undefine are updated to force interface refresh, so user interaction is still responsive.

Comment 3 Cole Robinson 2010-06-30 14:35:31 UTC

Hmm, sounds like lutter has a proper fix cooking for augeas, I think I'll hold off on applying this. If the augeas change doesn't go in, we can apply the virt-manager hack, and revert it in the 6.1 cycle.

Comment 4 Cole Robinson 2010-07-02 14:44:45 UTC

Okay, lutter took care of this in augeas, so duping to that bug:

*** This bug has been marked as a duplicate of bug 609228 ***

Note You need to log in before you can comment on or make changes to this bug.