Bug 606919

Summary: Memory leak in xenstore
Product: Red Hat Enterprise Linux 5 Reporter: Paolo Bonzini <pbonzini>
Component: xenAssignee: Michal Novotny <minovotn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.4CC: alexander, areis, chris.lober, hbrock, herrold, iaslanidis, james.brown, jdenemar, juzhang, jwest, llim, minovotn, mjenner, mrezanin, nachandr, plyons, samuel.kielek, slords, tao, virt-maint, xen-maint, yuzhang
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: xen-3.0.3-114.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 590073 Environment:
Last Closed: 2011-01-13 22:22:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 590073    
Bug Blocks: 514499, 593339, 616091    
Attachments:
Description Flags
Patch to fix xenstore leaks none

Description Paolo Bonzini 2010-06-22 18:10:36 UTC
+++ This bug was initially created as a clone of Bug #590073 +++

Created an attachment (id=412398)
output of Valgrind when libvirtd was run under Valgrind

Description of problem:
The leak is pretty slow, but becomes problematic for systems that have been running a long time.For example, on one of the customer's systems that had been running for ~2 months with ~25 VM's, libvirtd was using ~2.3GiB of memory (over half of the total memory allocated to the dom0).

Snippet from the output of Valgrind when libvirtd was run under Valgrind:
valgrind -v --leak-check=full --show-reachable=yes --log-file=libvirtd.memcheck /usr/sbin/libvirtd --daemon
==3876== 789,432 bytes in 8,938 blocks are definitely lost in loss record 417 of 417
==3876==    at 0x4A05F1D: realloc (vg_replace_malloc.c:476)
==3876==    by 0x384F0191AE: virReallocN (memory.c:160)
==3876==    by 0x384F06166A: xenUnifiedAddDomainInfo  (xen_unified.c:1688)
==3876==    by 0x384F076A01: xenStoreDomainIntroduced (xs_internal.c:1373)
==3876==    by 0x384F0775FD: xenStoreWatchEvent       (xs_internal.c:1303)
==3876==    by 0x40E3FE: virEventRunOnce (event.c:451)
==3876==    by 0x40F7DE: qemudRunLoop (qemud.c:2079)
==3876==    by 0x413DE8: main (qemud.c:2956)

Version-Release number of selected component (if applicable):
libvirt-0.6.3-20.1.el5_4


How reproducible:
Consistently


Steps to Reproduce:
1.Run libvirtd on a dom0 long enough(atleast a week) 
2.Make sure dom0 has numerous guests.
3.Verify the memory usage of libvirtd using 
ps auxwww|grep libvirtd
  
Actual results:
libvirtd slowly leaks memory

Expected results:
libvirtd should not leak memory

Additional info:
1)There are quite a few fixes for memory leaks in the upstream code which aren't include in RHEL 5.
For example:
(upstream fix: 7be5c26d746643b5ba889d62212a615050fed772)
virDomainPtr xenXMDomainDefineXML(virConnectPtr conn, const char *xml) {
    virDomainPtr ret;
-    virDomainPtr olddomain;
<snip>
-        /* XXX wtf.com is this line for - it appears to be amemory leak */
-        if (!(olddomain = virGetDomain(conn, def->name, entry->def->uuid)))
-            goto error;

virGetDomain allocs() the domain.

2)I've also attached the output of Valgrind when libvirtd was run under Valgrind.(libvirtd.memcheck)


--- Additional comment from laine on 2010-06-22 12:58:17 EDT ---

Here's Dan Berrange's comments on the two biggest offenders in the valgrind output.

On 06/08/2010 04:53 AM, Daniel P. Berrange wrote:

> On Mon, Jun 07, 2010 at 11:02:21PM -0400, Laine Stump wrote:
>> Before I dig into this, do either of the following memory leaks look 
>> familiar to any libvirt guys? (as subject says, they're from 0.6.3 in 
>> RHEL 5.5)
>>
>> ==3876== 357,840 bytes in 8,946 blocks are definitely lost in loss 
>> record 416 of 417
>> ==3876==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
>> ==3876==    by 0x346A2019D9: read_message (xs.c:768)
>> ==3876==    by 0x346A201B4B: read_thread (xs.c:824)
>> ==3876==    by 0x346760673C: start_thread (pthread_create.c:301)
>> ==3876==    by 0x3466ED3D1C: clone (in /lib64/libc-2.5.so)
>
> That's a XenStore bug and I'm not sure its easily fixable. When you
> register for a watch notification with xenstore it spawns a background
> thread for that. When you close your xenstore handle it uses the pure
> evil  pthread_cancel() to kill that thread. Memory cleanup ? What's 
> that ?  Would need todo something with cancelation handlers or rewrite
> the code to not use pthread_cancel().
>
> You'll leak one record for each libvirt Xen connection you open &
> close
>
>> ==3876==
>> ==3876== 789,432 bytes in 8,938 blocks are definitely lost in loss 
>> record 417 of 417
>> ==3876==    at 0x4A05F1D: realloc (vg_replace_malloc.c:476)
>> ==3876==    by 0x384F0191AE: virReallocN (memory.c:160)
>> ==3876==    by 0x384F06166A: xenUnifiedAddDomainInfo (xen_unified.c:1688)
>> ==3876==    by 0x384F076A01: xenStoreDomainIntroduced (xs_internal.c:1373)
>> ==3876==    by 0x384F0775FD: xenStoreWatchEvent (xs_internal.c:1303)
>> ==3876==    by 0x40E3FE: virEventRunOnce (event.c:451)
>> ==3876==    by 0x40F7DE: qemudRunLoop (qemud.c:2079)
>> ==3876==    by 0x413DE8: main (qemud.c:2956)
>
> I'm not sure what this is caused buy offhand. I might say that it was a
> result of the virConnectPtr ref counting not being right, but if that
> were the case I'd expect to see valgrind report that 'virConnectPtr' was
> leaked too, but it doesn't. So it must be some other issue.

Comment 3 Paolo Bonzini 2010-06-23 16:10:05 UTC
Upstream c/s 21356.

Comment 4 Michal Novotny 2010-06-23 16:20:25 UTC
(In reply to comment #3)
> Upstream c/s 21356.    

I guess you overlooked something since I'm seeing this as c/s 21353 [1].
But this is OK ;)

Michal

[1] http://xenbits.xensource.com/staging/xen-unstable.hg?rev/2dd3141b3e3e

Comment 5 Paolo Bonzini 2010-06-23 17:17:26 UTC
*** Bug 507545 has been marked as a duplicate of this bug. ***

Comment 6 Michal Novotny 2010-06-24 12:46:00 UTC
Created attachment 426553 [details]
Patch to fix xenstore leaks

This is the patch to fix the xenstore leaks. Basically it's a backport of upstream c/s 21353. Upstream uses different codebase with the possibility to both use and don't use pthreads but we use pthreads so the macro definitions were substituted by the direct definitions to use pthread_cleanup_push() and pthread_cleanup_pop() macros directly.

Upstream relationship: xen-unstable c/s 21353

Testing: since the leaking is slow I used valgrind for finding the memory leaks, the syntax I used is `valgrind -v --leak-check=full --show-reachable=yes --log-file=file /usr/sbin/libvirtd` to be able to see all the libvirt daemon messages and then I setup libvirt to accept remote connection (although from localhost) not to ignore the libvirt daemon (since `virsh list` was working even though libvirt was not running because it's most likely been accessing xenstore directly but I needed to connect to libvirt daemon first) and I issued many `virsh -c xen+ssh://localhost list` commands when running libvirt on valgrind. After interrupt signal (Ctrl + C) valgrind dumped the data to the file and this is the snippet from the valgrind without the patch applied:

==5045== 1,760 bytes in 44 blocks are definitely lost in loss record 317 of 328
==5045==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==5045==    by 0x4C3D147: read_message (xs.c:768)
==5045==    by 0x4C3D38D: read_thread (xs.c:824)
==5045==    by 0x3D3B80673C: start_thread (pthread_create.c:301)
==5045==    by 0x3D3ACD3D1C: clone (in /lib64/libc-2.5.so)

With this patch applied, there's no read_message occurrence and no such block could be found which means that xenstore is no longer leaking.

Michal

Comment 12 Yufang Zhang 2010-08-31 11:36:36 UTC
QA verified this bug on xen-3.0.3-115.el5:

1. Start libvirtd valgrind with the following command:
#valgrind -v --leak-check=full --show-reachable=yes --log-file=logfile /usr/sbin/libvirtd

2. Remote connect to libvirtd via virsh for 10 minutes with the following command:
#count=0; while [ $count -lt 200 ]; do virsh -c xen+ssh://localhost list; sleep
3; let count=$count+1; done;


For xen package without this patch(xen-3.0.3-113.el5), we could find memory leak
from the logfile:

==25406== 8,000 bytes in 200 blocks are definitely lost in loss record 291 of 294
==25406==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==25406==    by 0x4C299D9: read_message (xs.c:768)
==25406==    by 0x4C29B4B: read_thread (xs.c:824)
==25406==    by 0x3A61C0673C: start_thread (in /lib64/libpthread-2.5.so)
==25406==    by 0x3A610D3D1C: clone (in /lib64/libc-2.5.so)


For xen-3.0.3-115.el5, no such memory leak is found. 

So change this bug to VERIFIED.

Comment 14 errata-xmlrpc 2011-01-13 22:22:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html