Bug 656795

Summary: Start and shutdown domain lead to memory leak
Product: Red Hat Enterprise Linux 6 Reporter: xhu
Component: libvirtAssignee: Eric Blake <eblake>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.0CC: ajia, aliguori, berrange, bsarathy, bugproxy, ccui, dallan, dyuan, eblake, fnovak, gren, llim, mzhan, plyons, rkhadgar, tao, vbian, xen-maint, yoyzhang
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.8.7-1.el6 Doc Type: Bug Fix
Doc Text:
Memory buffer was not freed properly on domain startup and shutdown, which led to a memory leak that increased each time the domain was started or shut down. This update removes this memory leak.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 13:24:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 620345, 658571, 658657, 682240    
Bug Blocks: 672549, 679164, 682249    
Attachments:
Description Flags
leak memory check script
none
valgrind log for libvirtd
none
Patch to fix memory leak
none
libvirtd_memory_check.sh.log for libvirt-0.8.7-1.el6
none
leak test log for libvirt-0.8.7-6.el6 none

Description xhu 2010-11-24 06:15:27 UTC
Created attachment 462542 [details]
leak memory check script

Description of problem:
Start and shutdown domain lead to memory leak

Version-Release number of selected component (if applicable):
kernel-2.6.32-71.el6.x86_64
libvirt-0.8.1-27.el6.x86_64
qemu-kvm-0.12.1.2-2.113.el6.x86_64

How reproducible:
everytime

Steps to Reproduce:
1. install a domain named "kvm1"
2. disable selinux:
# setenforce 0
3. run "libvirtd_memory_check.sh" attachment script
  
Actual results:
when the script in step 2 is running, it can be seen in "libvirtd_memory_check.sh.log" attachment valgrind log that there is memory leak. 
And the leak memory will increase after every start/shutdown

Expected results:
No memory leak caused by start and shutdown domain

Additional info:
The leak memory statics is shown in "libvirtd_memory_check.sh.log" attachment valgrind log as follows:

LEAK SUMMARY:
==10198==    definitely lost: 124,172 bytes in 148 blocks
==10198==    indirectly lost: 3,527,409 bytes in 30,656 blocks
==10198==      possibly lost: 26,229 bytes in 138 blocks
==10198==    still reachable: 2,373,241 bytes in 17,808 blocks
==10198==         suppressed: 0 bytes in 0 blocks
==10198== Rerun with --leak-check=full to see details of leaked memory

Comment 1 xhu 2010-11-24 06:18:45 UTC
Created attachment 462543 [details]
valgrind log for libvirtd

Comment 2 Eric Blake 2010-11-24 18:11:36 UTC
Upstream patch posted for the worst offender (at least 1024 bytes on every qemu monitor connection, which is one per start/stop sequence):
https://www.redhat.com/archives/libvir-list/2010-November/msg01100.html

There appears to be other leaks (148 blocks of 1024 bytes each would be larger than 124172 lost bytes), but they are smaller in size, and might not be as frequent; it will take more analysis to decide whether anything else is worth plugging.

Comment 3 Frank Novak 2010-11-25 14:05:16 UTC
Mike Strosaker on our team had a simple cron job to run every 15 mins and capture libvirtd memory usage..As far as we can tell, the same four VMs have been running on the system for the entirety of monitoring, so there has been no provisioning activity.


DATE                                 %MEM    RSS
2010-11-23-16:45:01                  10.4    6731.32
2010-11-23-17:00:01                  11.3    7341.36
2010-11-23-17:15:01                  12.3    7948.18
2010-11-23-17:30:01                  13.2    8555.17
2010-11-23-17:45:01                  14.2    9168.53
2010-11-23-18:00:01                  15.2    9799.07
2010-11-23-18:15:01                  16.2    10436.68
2010-11-23-18:30:01                  17.1    11057.58
2010-11-23-18:45:01                  18.1    11666.37
2010-11-23-19:00:01                  19.1    12307.74
2010-11-23-19:15:01                  20.0    12945.53
2010-11-23-19:30:01                  21.0    13572.84
2010-11-23-19:45:01                  22.0    14213.44
2010-11-23-20:00:01                  23.0    14863.93
2010-11-23-20:15:01                  24.0    15502.12
2010-11-23-20:30:01                  25.0    16155.37
2010-11-23-20:45:01                  26.1    16817.77
2010-11-23-21:00:01                  27.1    17472.32
2010-11-23-21:15:01                  28.1    18116.05
2010-11-23-21:30:01                  29.1    18758.14
2010-11-23-21:45:01                  30.0    19380.17
2010-11-23-22:00:01                  31.0    20029.62
2010-11-23-22:15:01                  32.0    20654.29
2010-11-23-22:30:01                  33.0    21280.00
2010-11-23-22:45:01                  33.9    21890.36
2010-11-23-23:00:01                  34.9    22500.23
2010-11-23-23:15:01                  35.8    23103.05
2010-11-23-23:30:01                  36.8    23714.33

Additional data being collected..

Comment 4 Anthony Liguori 2010-12-01 00:09:41 UTC
The nasty leak has something to do with disk information.  Based on a core extracted from a leaking libvirtd process, there's a repeating pattern of:

002ca150: 2f73 746f 7261 6765 2f70 726f 642f 6570  /storage/prod/ep
002ca160: 6865 6d65 7261 6c2f 2f76 686f 7374 3037  hemeral//vhost07
002ca170: 3239 2f76 686f 7374 3037 3239 2e69 6d67  29/vhost0729.img
002ca180: 0000 0058 767f 0000 2500 0000 0000 0000  ...Xv...%.......
002ca190: 6964 6530 2d30 2d30 0000 0058 767f 0000  ide0-0-0...Xv...
002ca1a0: 2000 0000 0000 0000 3500 0000 0000 0000   .......5.......
002ca1b0: 656f 7468 6572 0000 9800 0058 767f 0000  eother.....Xv...
002ca1c0: bf89 cacb 1b4d 2a7b 3000 0030 767f 0000  .....M*{0..0v...
002ca1d0: 3000 0000 0000 0000 4500 0000 0000 0000  0.......E.......

That is repeated over 2 million times which looks clearly like a high frequency leak.  Still trying to find a the right data structure that would contain this information.

Comment 5 Anthony Liguori 2010-12-01 00:12:55 UTC
There are three other guests running.  vhost0728 has 200k hits in the core file but the other two hosts only have 3 hits.

Looks like the leak is specific to particular guests.  It's possibly we're running some sort of API call frequently but only for certain guests.

Comment 6 Eric Blake 2010-12-01 00:27:31 UTC
I've identified further leaks in libnl and libselinux that impact libvirt, and I'm still in the process of tracking down root causes of other valgrind leak reports.  I'm definitely making progress on plugging leaks via upstream patches, and will be working on backporting them to RHEL as fast as I can.

Comment 7 Eric Blake 2010-12-01 03:45:15 UTC
*** Bug 620334 has been marked as a duplicate of this bug. ***

Comment 8 Daniel Berrangé 2010-12-01 10:37:08 UTC
(In reply to comment #4)
> The nasty leak has something to do with disk information.  Based on a core
> extracted from a leaking libvirtd process, there's a repeating pattern of:
> 
> 002ca150: 2f73 746f 7261 6765 2f70 726f 642f 6570  /storage/prod/ep
> 002ca160: 6865 6d65 7261 6c2f 2f76 686f 7374 3037  hemeral//vhost07
> 002ca170: 3239 2f76 686f 7374 3037 3239 2e69 6d67  29/vhost0729.img
> 002ca180: 0000 0058 767f 0000 2500 0000 0000 0000  ...Xv...%.......
> 002ca190: 6964 6530 2d30 2d30 0000 0058 767f 0000  ide0-0-0...Xv...
> 002ca1a0: 2000 0000 0000 0000 3500 0000 0000 0000   .......5.......
> 002ca1b0: 656f 7468 6572 0000 9800 0058 767f 0000  eother.....Xv...
> 002ca1c0: bf89 cacb 1b4d 2a7b 3000 0030 767f 0000  .....M*{0..0v...
> 002ca1d0: 3000 0000 0000 0000 4500 0000 0000 0000  0.......E.......
> 
> That is repeated over 2 million times which looks clearly like a high frequency
> leak.  Still trying to find a the right data structure that would contain this
> information.

That pattern is showing 'path', 'devAlias', <some integer>, 'reason'. In other words it is an instance of an virDomainEvent for an I/O error. Likely from

    ioErrorEvent2 = virDomainEventIOErrorReasonNewFromObj(vm, srcPath, devAlias, action, reason);

In qemuHandleDomainIOError.

This allocated object is put on the event queue

            qemuDomainEventQueue(driver, ioErrorEvent2);


Some short while later qemuDomainEventFlush runs and invokes

    virDomainEventQueueDispatch(&tempQueue,
                                driver->domainEventCallbacks,
                                qemuDomainEventDispatchFunc,
                                driver);

this should iterate over all queued event, dispatch them, and then call virDomainEventFree().

The only way I could see it leak, is if the qemuDomainEventFlush method never got run.

Comment 9 Anthony Liguori 2010-12-01 15:02:43 UTC
Created attachment 463999 [details]
Patch to fix memory leak

This is untested and against upstream, but I think this is the source of the problem.

Comment 11 Eric Blake 2010-12-14 14:52:11 UTC
Proposed patch series for z-stream:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2010-December/msg00305.html

Comment 12 IBM Bug Proxy 2010-12-21 09:52:04 UTC
------- Comment From bnpoorni@in.ibm.com 2010-12-21 04:47 EDT-------
*** Bug 68847 has been marked as a duplicate of this bug. ***

Comment 13 Jiri Denemark 2011-01-09 23:57:32 UTC
Built into libvirt-0.8.7-1.el6

Comment 14 Cui Chun 2011-01-11 05:57:48 UTC
Verified. 

Please confirm if the "LEAK SUMMARY" is acceptable. I will continue to run the script and try to finish 36000 cycles tonight. 

-----------------
Test environment:
libvirt-0.8.7-1.el6
qemu-kvm-0.12.1.2-2.128.el6
kernel-2.6.32-94.el6

Steps:
1. install a domain named "rhel6-clone"
2. disable selinux:
# setenforce 0
3. run "libvirtd_memory_check.sh" attachment script 
4. check the "libvirtd_memory_check.sh.log" after running 400 cycles, do not found leak again.

==21443== LEAK SUMMARY:
==21443==    definitely lost: 0 bytes in 0 blocks
==21443==    indirectly lost: 0 bytes in 0 blocks
==21443==      possibly lost: 349 bytes in 18 blocks
==21443==    still reachable: 2,540 bytes in 47 blocks
==21443==         suppressed: 0 bytes in 0 blocks
==21443== Rerun with --leak-check=full to see details of leaked memory

Comment 15 Cui Chun 2011-01-11 06:01:07 UTC
Created attachment 472740 [details]
libvirtd_memory_check.sh.log for libvirt-0.8.7-1.el6

Comment 16 Laine Stump 2011-01-13 21:36:40 UTC
*** Bug 583083 has been marked as a duplicate of this bug. ***

Comment 19 Vivian Bian 2011-02-15 12:08:58 UTC
Created attachment 478866 [details]
leak test log for libvirt-0.8.7-6.el6

retested with libvirt-0.8.7-6.el6.x86_64 PASS . Set bug status to VERIFIED

1. install a domain named "rhel6-clone"
2. disable selinux:
# setenforce 0
3. run "libvirtd_memory_check.sh" attachment script 
4. check the "libvirtd_memory_check.sh.log" after running 36000 cycles, do not
found leak again.


==5593== LEAK SUMMARY:
==5593==    definitely lost: 0 bytes in 0 blocks
==5593==    indirectly lost: 0 bytes in 0 blocks
==5593==      possibly lost: 349 bytes in 18 blocks
==5593==    still reachable: 1,840 bytes in 39 blocks
==5593==         suppressed: 0 bytes in 0 blocks
==5593== Rerun with --leak-check=full to see details of leaked memory

Comment 20 Martin Prpič 2011-04-15 14:24:25 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Starting and shutting down a domain led to a memory leak due to the memory buffer not being freed properly. With this update, starting and shutting down a domain no longer leads to a memory leak.

Comment 23 Laura Bailey 2011-05-04 04:30:36 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Starting and shutting down a domain led to a memory leak due to the memory buffer not being freed properly. With this update, starting and shutting down a domain no longer leads to a memory leak.+Memory buffer was not freed properly on domain startup and shutdown, which led to a memory leak that increased each time the domain was started or shut down. This update removes this memory leak.

Comment 24 errata-xmlrpc 2011-05-19 13:24:25 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html