600141 – network interface functions cause very high CPU utilization in libvirtd

Bug 600141 - network interface functions cause very high CPU utilization in libvirtd

Summary: network interface functions cause very high CPU utilization in libvirtd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	augeas
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	David Lutterkort
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	609228
TreeView+	depends on / blocked

Reported:	2010-06-04 02:15 UTC by Zachary Amsden
Modified:	2013-04-30 23:41 UTC (History)
CC List:	13 users (show)
Fixed In Version:	augeas-0.7.2-3.fc13
Clone Of:
Clones:	609214 609228 (view as bug list)
Environment:
Last Closed:	2010-07-23 02:30:07 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Zachary Amsden 2010-06-04 02:15:32 UTC

Description of problem:

top - 16:07:01 up 11 days,  4:56,  6 users,  load average: 0.14, 0.11, 0.05
Tasks: 201 total,   1 running, 200 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.7%us,  2.9%sy,  0.0%ni, 83.1%id,  2.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1982808k total,  1942568k used,    40240k free,    98920k buffers
Swap:  3964920k total,    69992k used,  3894928k free,   621348k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
20194 root      20   0  608m  41m 3512 S 13.3  2.1   0:06.98 libvirtd 

Version-Release number of selected component (if applicable):

I've seen this with all versions of libvirt, RHEL5,6 and Fedora on different machines and CPUs.  Filing upstream because that is where it should be fixed

How reproducible: 100%

Steps to Reproduce:
1. Start virt-manager
2. run top
3. ... profit?
  
Actual results:

Libvirtd sits around consuming 13-20% of CPU even when completely idle and no VMs are running.  Modern software shouldn't be doing whatever kind of continuous polling it is using to rack up such CPU time.  If I leave the system alone, overnight, libvirtd will rack up more CPU than X, firefox, and all other idle processes combined.

Expected results:


Additional info:

Comment 1 Cole Robinson 2010-06-09 15:24:27 UTC

It's because virt-manager is running. virt-manager polls libvirt to detect VM/network/storage state change, this polling causes lots of CPU churn. Libvirt supports async lifecycle events for domains, but virt-manager doesn't use those API yet. The other objects (network, storage, interface, host devices) don't have async API.

So the solution is:

1) Add async lifecycle API for all libvirt objects
2) Have virt-manager actually use those API

Comment 2 Daniel Berrangé 2010-06-25 11:23:05 UTC

libvirtd shows about 8% CPU usage when virt-manager is running on my system. Evidence is better than guesswork, so I ran oprofile against libvirtd.

Augeas comes out on top by a country mile, followed by loads of malloc/memcpy stuff which I bet is all due to augeas too:

samples  %        image name               symbol name
31677    41.0255  libaugeas.so.0.10.2      /usr/lib64/libaugeas.so.0.10.2
14469    18.7391  libc-2.11.2.so           wcscoll_l
5401      6.9949  libc-2.11.2.so           wcscmp
4913      6.3629  libc-2.11.2.so           _int_free
4020      5.2064  libc-2.11.2.so           _int_malloc
3780      4.8955  libc-2.11.2.so           memcpy
2486      3.2197  libc-2.11.2.so           malloc
1676      2.1706  libc-2.11.2.so           malloc_consolidate
1446      1.8727  libc-2.11.2.so           memset
1359      1.7601  libc-2.11.2.so           wcscoll
1299      1.6824  libc-2.11.2.so           realloc
1097      1.4207  libc-2.11.2.so           calloc
714       0.9247  libc-2.11.2.so           free
517       0.6696  libc-2.11.2.so           _int_realloc
278       0.3600  libc-2.11.2.so           __GI___strncmp_ssse3
253       0.3277  libc-2.11.2.so           __strstr_sse2
174       0.2254  libc-2.11.2.so           __strlen_sse2
116       0.1502  libc-2.11.2.so           vfprintf
111       0.1438  libc-2.11.2.so           strnlen
106       0.1373  libpthread-2.11.2.so     pthread_mutex_lock
95        0.1230  libvirtd                 virLogMessage
73        0.0945  libpthread-2.11.2.so     pthread_mutex_unlock
66        0.0855  libc-2.11.2.so           __strchr_sse2
64        0.0829  libc-2.11.2.so           __GI___strcmp_ssse3
61        0.0790  libc-2.11.2.so           __ctype_b_loc
45        0.0583  libc-2.11.2.so           btowc
39        0.0505  libc-2.11.2.so           _IO_default_xsputn
35        0.0453  libc-2.11.2.so           strcat
30        0.0389  libc-2.11.2.so           strndup
30        0.0389  libvirtd                 virEventRunOnce


Editing virt-manager code to disable the 'update_interfaces' method in connection.py and re-run the oprofile test, libvirtd now consumes < 1% CPU and oprofile shows a completely different trace:


samples  %        image name               symbol name
1519     42.7165  libc-2.11.2.so           memset
346       9.7300  libc-2.11.2.so           __strstr_sse2
214       6.0180  libpthread-2.11.2.so     pthread_mutex_lock
136       3.8245  libc-2.11.2.so           __strchr_sse2
130       3.6558  libvirtd                 virLogMessage
117       3.2902  libpthread-2.11.2.so     pthread_mutex_unlock
77        2.1654  libc-2.11.2.so           vfprintf
57        1.6029  libc-2.11.2.so           _int_malloc
53        1.4904  libvirtd                 virEventRunOnce
45        1.2655  libc-2.11.2.so           calloc
40        1.1249  libc-2.11.2.so           _int_free
25        0.7030  libc-2.11.2.so           malloc_consolidate
21        0.5906  libc-2.11.2.so           _IO_default_xsputn
20        0.5624  libc-2.11.2.so           __strlen_sse2
20        0.5624  libc-2.11.2.so           poll
20        0.5624  libvirtd                 nodeListDevices
19        0.5343  libpthread-2.11.2.so     pthread_cond_wait@@GLIBC_2.3.2
19        0.5343  libvirt.so.0.8.1         virHashComputeKey
19        0.5343  libvirtd                 qemudDispatchClientEvent
18        0.5062  libvirtd                 remoteDispatchClientCall


So the problem appears to be that augeas (and netcf by implication) has exceedingly high CPU utilization :-(  Need advice from David on how to proceed here. Maybe augeas simply hasn't had any performance optimization work done on it yet ? Otherwise maybe we need to change netcf to not use augeas+xslt for converting the ifcfg files into XML format.

Comment 3 Daniel Berrangé 2010-06-25 11:59:18 UTC

A more targetted profile this time from just running ncftool directly

samples  %        image name               symbol name
85058    20.7509  libc-2.11.2.so           wcscoll_l
29644     7.2320  libc-2.11.2.so           wcscmp
26191     6.3896  libc-2.11.2.so           _int_malloc
25622     6.2508  libaugeas.so.0.11.0      parse_expression
18443     4.4994  libxml2.so.2.7.6         /usr/lib64/libxml2.so.2.7.6
16999     4.1471  libc-2.11.2.so           memcpy
13354     3.2579  libc-2.11.2.so           _int_free
9402      2.2937  libaugeas.so.0.11.0      set_regs
9019      2.2003  libc-2.11.2.so           wcscoll
8788      2.1439  libc-2.11.2.so           malloc_consolidate
8457      2.0632  libaugeas.so.0.11.0      re_node_set_add_intersect
8062      1.9668  libfa.so.1.3.1           re_as_string
7832      1.9107  libfa.so.1.3.1           cset_contains
7343      1.7914  libaugeas.so.0.11.0      re_node_set_contains
6031      1.4713  libaugeas.so.0.11.0      re_acquire_state
5113      1.2474  libaugeas.so.0.11.0      sift_states_backward
4680      1.1417  libaugeas.so.0.11.0      re_node_set_insert_last
4680      1.1417  libaugeas.so.0.11.0      re_search_internal
4587      1.1191  libaugeas.so.0.11.0      re_node_set_insert
4451      1.0859  libc-2.11.2.so           calloc
4315      1.0527  libc-2.11.2.so           free
4189      1.0220  libc-2.11.2.so           malloc

Comment 4 Cole Robinson 2010-06-25 14:08:25 UTC

I think every interface 'list' operation causes netcf to parse /etc/sysconfig/network-scripts/*, which would explain things.

Comment 5 Zachary Amsden 2010-06-25 21:07:26 UTC

I'm more than happy to run some profiling here as well.  On my particular system (F13 with encrypted disk on dual core Turion64) it's enough to make the system unresponsive and barely usable with a 2-VCPU VM.

If network is implicated, especially network scripts, well mine are pretty complex, I'm running split DNS, wired / wireless bonding, and a vpnc tunnel.  It's possible one of those things exaggerates the cost of this quite a bit.

Comment 6 David Lutterkort 2010-06-29 00:30:48 UTC

Cole is right: every netcf operation causes the network scripts to be parsed again. There's not enough change tracking to guard against other programs modifying those files in between netcf operations.

There's a few ways to address that, varying in complexity and hackishness:

(1) Instead of reparsing the files on every netcf operation, only reparse after some small amount of time
(2) Within netcf, watch pertinent files with inotify and reread them only upon actual changes (or complain if both netcf and some outside program try to make changes)
(3) Do the same within augeas

Of course, another option would be to reduce how often virt-manager does an ncf_list.

Comment 7 David Lutterkort 2010-06-30 01:36:31 UTC

This can be alleviated by teaching augeas to be smarter about when to reparse a file and when not.

I have posted patches upstream on augeas-devel (review much welcome) and built augeas-0.7.2-2 based on these. You need to get the packages out of koji for now from http://koji.fedoraproject.org/koji/packageinfo?packageID=6131

It would be great if somebody could independently verify that these patches address the virt-manager issues (and do not introduce any regressions in libvirt's network config handling)

Comment 8 Laine Stump 2010-06-30 09:04:46 UTC

On my F13 system with virt-manager running, prior to this fix CPU usage by libvirtd was over 11%. With the new augeas installed, that drops to 2.3%, so there is definitely a huge difference!

Functional testing will take a bit more time ;-)

Comment 9 Fedora Update System 2010-06-30 22:26:47 UTC

augeas-0.7.2-2.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/augeas-0.7.2-2.fc13

Comment 10 Fedora Update System 2010-07-01 18:46:45 UTC

augeas-0.7.2-2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update augeas'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/augeas-0.7.2-2.fc13

Comment 11 Fedora Update System 2010-07-02 08:58:17 UTC

augeas-0.7.2-3.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/augeas-0.7.2-3.fc13

Comment 12 Fedora Update System 2010-07-05 22:06:36 UTC

augeas-0.7.2-3.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update augeas'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/augeas-0.7.2-3.fc13

Comment 13 Aaron Faanes 2010-07-11 03:18:28 UTC

Using augeas-0.7.2-3.fc13 showed no excessive CPU load for me.

Comment 14 Fedora Update System 2010-07-23 02:29:52 UTC

augeas-0.7.2-3.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.