+++ This bug was initially created as a clone of Bug #590073 +++ Created an attachment (id=412398) output of Valgrind when libvirtd was run under Valgrind Description of problem: The leak is pretty slow, but becomes problematic for systems that have been running a long time.For example, on one of the customer's systems that had been running for ~2 months with ~25 VM's, libvirtd was using ~2.3GiB of memory (over half of the total memory allocated to the dom0). Snippet from the output of Valgrind when libvirtd was run under Valgrind: valgrind -v --leak-check=full --show-reachable=yes --log-file=libvirtd.memcheck /usr/sbin/libvirtd --daemon ==3876== 789,432 bytes in 8,938 blocks are definitely lost in loss record 417 of 417 ==3876== at 0x4A05F1D: realloc (vg_replace_malloc.c:476) ==3876== by 0x384F0191AE: virReallocN (memory.c:160) ==3876== by 0x384F06166A: xenUnifiedAddDomainInfo (xen_unified.c:1688) ==3876== by 0x384F076A01: xenStoreDomainIntroduced (xs_internal.c:1373) ==3876== by 0x384F0775FD: xenStoreWatchEvent (xs_internal.c:1303) ==3876== by 0x40E3FE: virEventRunOnce (event.c:451) ==3876== by 0x40F7DE: qemudRunLoop (qemud.c:2079) ==3876== by 0x413DE8: main (qemud.c:2956) Version-Release number of selected component (if applicable): libvirt-0.6.3-20.1.el5_4 How reproducible: Consistently Steps to Reproduce: 1.Run libvirtd on a dom0 long enough(atleast a week) 2.Make sure dom0 has numerous guests. 3.Verify the memory usage of libvirtd using ps auxwww|grep libvirtd Actual results: libvirtd slowly leaks memory Expected results: libvirtd should not leak memory Additional info: 1)There are quite a few fixes for memory leaks in the upstream code which aren't include in RHEL 5. For example: (upstream fix: 7be5c26d746643b5ba889d62212a615050fed772) virDomainPtr xenXMDomainDefineXML(virConnectPtr conn, const char *xml) { virDomainPtr ret; - virDomainPtr olddomain; <snip> - /* XXX wtf.com is this line for - it appears to be amemory leak */ - if (!(olddomain = virGetDomain(conn, def->name, entry->def->uuid))) - goto error; virGetDomain allocs() the domain. 2)I've also attached the output of Valgrind when libvirtd was run under Valgrind.(libvirtd.memcheck) --- Additional comment from laine on 2010-06-22 12:58:17 EDT --- Here's Dan Berrange's comments on the two biggest offenders in the valgrind output. On 06/08/2010 04:53 AM, Daniel P. Berrange wrote: > On Mon, Jun 07, 2010 at 11:02:21PM -0400, Laine Stump wrote: >> Before I dig into this, do either of the following memory leaks look >> familiar to any libvirt guys? (as subject says, they're from 0.6.3 in >> RHEL 5.5) >> >> ==3876== 357,840 bytes in 8,946 blocks are definitely lost in loss >> record 416 of 417 >> ==3876== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) >> ==3876== by 0x346A2019D9: read_message (xs.c:768) >> ==3876== by 0x346A201B4B: read_thread (xs.c:824) >> ==3876== by 0x346760673C: start_thread (pthread_create.c:301) >> ==3876== by 0x3466ED3D1C: clone (in /lib64/libc-2.5.so) > > That's a XenStore bug and I'm not sure its easily fixable. When you > register for a watch notification with xenstore it spawns a background > thread for that. When you close your xenstore handle it uses the pure > evil pthread_cancel() to kill that thread. Memory cleanup ? What's > that ? Would need todo something with cancelation handlers or rewrite > the code to not use pthread_cancel(). > > You'll leak one record for each libvirt Xen connection you open & > close > >> ==3876== >> ==3876== 789,432 bytes in 8,938 blocks are definitely lost in loss >> record 417 of 417 >> ==3876== at 0x4A05F1D: realloc (vg_replace_malloc.c:476) >> ==3876== by 0x384F0191AE: virReallocN (memory.c:160) >> ==3876== by 0x384F06166A: xenUnifiedAddDomainInfo (xen_unified.c:1688) >> ==3876== by 0x384F076A01: xenStoreDomainIntroduced (xs_internal.c:1373) >> ==3876== by 0x384F0775FD: xenStoreWatchEvent (xs_internal.c:1303) >> ==3876== by 0x40E3FE: virEventRunOnce (event.c:451) >> ==3876== by 0x40F7DE: qemudRunLoop (qemud.c:2079) >> ==3876== by 0x413DE8: main (qemud.c:2956) > > I'm not sure what this is caused buy offhand. I might say that it was a > result of the virConnectPtr ref counting not being right, but if that > were the case I'd expect to see valgrind report that 'virConnectPtr' was > leaked too, but it doesn't. So it must be some other issue.
Upstream c/s 21356.
(In reply to comment #3) > Upstream c/s 21356. I guess you overlooked something since I'm seeing this as c/s 21353 [1]. But this is OK ;) Michal [1] http://xenbits.xensource.com/staging/xen-unstable.hg?rev/2dd3141b3e3e
*** Bug 507545 has been marked as a duplicate of this bug. ***
Created attachment 426553 [details] Patch to fix xenstore leaks This is the patch to fix the xenstore leaks. Basically it's a backport of upstream c/s 21353. Upstream uses different codebase with the possibility to both use and don't use pthreads but we use pthreads so the macro definitions were substituted by the direct definitions to use pthread_cleanup_push() and pthread_cleanup_pop() macros directly. Upstream relationship: xen-unstable c/s 21353 Testing: since the leaking is slow I used valgrind for finding the memory leaks, the syntax I used is `valgrind -v --leak-check=full --show-reachable=yes --log-file=file /usr/sbin/libvirtd` to be able to see all the libvirt daemon messages and then I setup libvirt to accept remote connection (although from localhost) not to ignore the libvirt daemon (since `virsh list` was working even though libvirt was not running because it's most likely been accessing xenstore directly but I needed to connect to libvirt daemon first) and I issued many `virsh -c xen+ssh://localhost list` commands when running libvirt on valgrind. After interrupt signal (Ctrl + C) valgrind dumped the data to the file and this is the snippet from the valgrind without the patch applied: ==5045== 1,760 bytes in 44 blocks are definitely lost in loss record 317 of 328 ==5045== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) ==5045== by 0x4C3D147: read_message (xs.c:768) ==5045== by 0x4C3D38D: read_thread (xs.c:824) ==5045== by 0x3D3B80673C: start_thread (pthread_create.c:301) ==5045== by 0x3D3ACD3D1C: clone (in /lib64/libc-2.5.so) With this patch applied, there's no read_message occurrence and no such block could be found which means that xenstore is no longer leaking. Michal
QA verified this bug on xen-3.0.3-115.el5: 1. Start libvirtd valgrind with the following command: #valgrind -v --leak-check=full --show-reachable=yes --log-file=logfile /usr/sbin/libvirtd 2. Remote connect to libvirtd via virsh for 10 minutes with the following command: #count=0; while [ $count -lt 200 ]; do virsh -c xen+ssh://localhost list; sleep 3; let count=$count+1; done; For xen package without this patch(xen-3.0.3-113.el5), we could find memory leak from the logfile: ==25406== 8,000 bytes in 200 blocks are definitely lost in loss record 291 of 294 ==25406== at 0x4A05E1C: malloc (vg_replace_malloc.c:195) ==25406== by 0x4C299D9: read_message (xs.c:768) ==25406== by 0x4C29B4B: read_thread (xs.c:824) ==25406== by 0x3A61C0673C: start_thread (in /lib64/libpthread-2.5.so) ==25406== by 0x3A610D3D1C: clone (in /lib64/libc-2.5.so) For xen-3.0.3-115.el5, no such memory leak is found. So change this bug to VERIFIED.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html