Bug 656795 - Start and shutdown domain lead to memory leak
Start and shutdown domain lead to memory leak
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.0
Unspecified Unspecified
urgent Severity high
: rc
: ---
Assigned To: Eric Blake
Virtualization Bugs
: ZStream
: 583083 620334 (view as bug list)
Depends On: 620345 658571 658657 682240
Blocks: 672549 679164 682249
  Show dependency treegraph
 
Reported: 2010-11-24 01:15 EST by xhu
Modified: 2015-09-27 22:27 EDT (History)
19 users (show)

See Also:
Fixed In Version: libvirt-0.8.7-1.el6
Doc Type: Bug Fix
Doc Text:
Memory buffer was not freed properly on domain startup and shutdown, which led to a memory leak that increased each time the domain was started or shut down. This update removes this memory leak.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-19 09:24:25 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
leak memory check script (247 bytes, text/plain)
2010-11-24 01:15 EST, xhu
no flags Details
valgrind log for libvirtd (411.96 KB, text/plain)
2010-11-24 01:18 EST, xhu
no flags Details
Patch to fix memory leak (812 bytes, patch)
2010-12-01 10:02 EST, Anthony Liguori
no flags Details | Diff
libvirtd_memory_check.sh.log for libvirt-0.8.7-1.el6 (2.35 KB, application/octet-stream)
2011-01-11 01:01 EST, Cui Chun
no flags Details
leak test log for libvirt-0.8.7-6.el6 (2.30 KB, text/plain)
2011-02-15 07:08 EST, Vivian Bian
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 68645 None None None Never

  None (edit)
Description xhu 2010-11-24 01:15:27 EST
Created attachment 462542 [details]
leak memory check script

Description of problem:
Start and shutdown domain lead to memory leak

Version-Release number of selected component (if applicable):
kernel-2.6.32-71.el6.x86_64
libvirt-0.8.1-27.el6.x86_64
qemu-kvm-0.12.1.2-2.113.el6.x86_64

How reproducible:
everytime

Steps to Reproduce:
1. install a domain named "kvm1"
2. disable selinux:
# setenforce 0
3. run "libvirtd_memory_check.sh" attachment script
  
Actual results:
when the script in step 2 is running, it can be seen in "libvirtd_memory_check.sh.log" attachment valgrind log that there is memory leak. 
And the leak memory will increase after every start/shutdown

Expected results:
No memory leak caused by start and shutdown domain

Additional info:
The leak memory statics is shown in "libvirtd_memory_check.sh.log" attachment valgrind log as follows:

LEAK SUMMARY:
==10198==    definitely lost: 124,172 bytes in 148 blocks
==10198==    indirectly lost: 3,527,409 bytes in 30,656 blocks
==10198==      possibly lost: 26,229 bytes in 138 blocks
==10198==    still reachable: 2,373,241 bytes in 17,808 blocks
==10198==         suppressed: 0 bytes in 0 blocks
==10198== Rerun with --leak-check=full to see details of leaked memory
Comment 1 xhu 2010-11-24 01:18:45 EST
Created attachment 462543 [details]
valgrind log for libvirtd
Comment 2 Eric Blake 2010-11-24 13:11:36 EST
Upstream patch posted for the worst offender (at least 1024 bytes on every qemu monitor connection, which is one per start/stop sequence):
https://www.redhat.com/archives/libvir-list/2010-November/msg01100.html

There appears to be other leaks (148 blocks of 1024 bytes each would be larger than 124172 lost bytes), but they are smaller in size, and might not be as frequent; it will take more analysis to decide whether anything else is worth plugging.
Comment 3 Frank Novak 2010-11-25 09:05:16 EST
Mike Strosaker on our team had a simple cron job to run every 15 mins and capture libvirtd memory usage..As far as we can tell, the same four VMs have been running on the system for the entirety of monitoring, so there has been no provisioning activity.


DATE                                 %MEM    RSS
2010-11-23-16:45:01                  10.4    6731.32
2010-11-23-17:00:01                  11.3    7341.36
2010-11-23-17:15:01                  12.3    7948.18
2010-11-23-17:30:01                  13.2    8555.17
2010-11-23-17:45:01                  14.2    9168.53
2010-11-23-18:00:01                  15.2    9799.07
2010-11-23-18:15:01                  16.2    10436.68
2010-11-23-18:30:01                  17.1    11057.58
2010-11-23-18:45:01                  18.1    11666.37
2010-11-23-19:00:01                  19.1    12307.74
2010-11-23-19:15:01                  20.0    12945.53
2010-11-23-19:30:01                  21.0    13572.84
2010-11-23-19:45:01                  22.0    14213.44
2010-11-23-20:00:01                  23.0    14863.93
2010-11-23-20:15:01                  24.0    15502.12
2010-11-23-20:30:01                  25.0    16155.37
2010-11-23-20:45:01                  26.1    16817.77
2010-11-23-21:00:01                  27.1    17472.32
2010-11-23-21:15:01                  28.1    18116.05
2010-11-23-21:30:01                  29.1    18758.14
2010-11-23-21:45:01                  30.0    19380.17
2010-11-23-22:00:01                  31.0    20029.62
2010-11-23-22:15:01                  32.0    20654.29
2010-11-23-22:30:01                  33.0    21280.00
2010-11-23-22:45:01                  33.9    21890.36
2010-11-23-23:00:01                  34.9    22500.23
2010-11-23-23:15:01                  35.8    23103.05
2010-11-23-23:30:01                  36.8    23714.33

Additional data being collected..
Comment 4 Anthony Liguori 2010-11-30 19:09:41 EST
The nasty leak has something to do with disk information.  Based on a core extracted from a leaking libvirtd process, there's a repeating pattern of:

002ca150: 2f73 746f 7261 6765 2f70 726f 642f 6570  /storage/prod/ep
002ca160: 6865 6d65 7261 6c2f 2f76 686f 7374 3037  hemeral//vhost07
002ca170: 3239 2f76 686f 7374 3037 3239 2e69 6d67  29/vhost0729.img
002ca180: 0000 0058 767f 0000 2500 0000 0000 0000  ...Xv...%.......
002ca190: 6964 6530 2d30 2d30 0000 0058 767f 0000  ide0-0-0...Xv...
002ca1a0: 2000 0000 0000 0000 3500 0000 0000 0000   .......5.......
002ca1b0: 656f 7468 6572 0000 9800 0058 767f 0000  eother.....Xv...
002ca1c0: bf89 cacb 1b4d 2a7b 3000 0030 767f 0000  .....M*{0..0v...
002ca1d0: 3000 0000 0000 0000 4500 0000 0000 0000  0.......E.......

That is repeated over 2 million times which looks clearly like a high frequency leak.  Still trying to find a the right data structure that would contain this information.
Comment 5 Anthony Liguori 2010-11-30 19:12:55 EST
There are three other guests running.  vhost0728 has 200k hits in the core file but the other two hosts only have 3 hits.

Looks like the leak is specific to particular guests.  It's possibly we're running some sort of API call frequently but only for certain guests.
Comment 6 Eric Blake 2010-11-30 19:27:31 EST
I've identified further leaks in libnl and libselinux that impact libvirt, and I'm still in the process of tracking down root causes of other valgrind leak reports.  I'm definitely making progress on plugging leaks via upstream patches, and will be working on backporting them to RHEL as fast as I can.
Comment 7 Eric Blake 2010-11-30 22:45:15 EST
*** Bug 620334 has been marked as a duplicate of this bug. ***
Comment 8 Daniel Berrange 2010-12-01 05:37:08 EST
(In reply to comment #4)
> The nasty leak has something to do with disk information.  Based on a core
> extracted from a leaking libvirtd process, there's a repeating pattern of:
> 
> 002ca150: 2f73 746f 7261 6765 2f70 726f 642f 6570  /storage/prod/ep
> 002ca160: 6865 6d65 7261 6c2f 2f76 686f 7374 3037  hemeral//vhost07
> 002ca170: 3239 2f76 686f 7374 3037 3239 2e69 6d67  29/vhost0729.img
> 002ca180: 0000 0058 767f 0000 2500 0000 0000 0000  ...Xv...%.......
> 002ca190: 6964 6530 2d30 2d30 0000 0058 767f 0000  ide0-0-0...Xv...
> 002ca1a0: 2000 0000 0000 0000 3500 0000 0000 0000   .......5.......
> 002ca1b0: 656f 7468 6572 0000 9800 0058 767f 0000  eother.....Xv...
> 002ca1c0: bf89 cacb 1b4d 2a7b 3000 0030 767f 0000  .....M*{0..0v...
> 002ca1d0: 3000 0000 0000 0000 4500 0000 0000 0000  0.......E.......
> 
> That is repeated over 2 million times which looks clearly like a high frequency
> leak.  Still trying to find a the right data structure that would contain this
> information.

That pattern is showing 'path', 'devAlias', <some integer>, 'reason'. In other words it is an instance of an virDomainEvent for an I/O error. Likely from

    ioErrorEvent2 = virDomainEventIOErrorReasonNewFromObj(vm, srcPath, devAlias, action, reason);

In qemuHandleDomainIOError.

This allocated object is put on the event queue

            qemuDomainEventQueue(driver, ioErrorEvent2);


Some short while later qemuDomainEventFlush runs and invokes

    virDomainEventQueueDispatch(&tempQueue,
                                driver->domainEventCallbacks,
                                qemuDomainEventDispatchFunc,
                                driver);

this should iterate over all queued event, dispatch them, and then call virDomainEventFree().

The only way I could see it leak, is if the qemuDomainEventFlush method never got run.
Comment 9 Anthony Liguori 2010-12-01 10:02:43 EST
Created attachment 463999 [details]
Patch to fix memory leak

This is untested and against upstream, but I think this is the source of the problem.
Comment 11 Eric Blake 2010-12-14 09:52:11 EST
Proposed patch series for z-stream:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2010-December/msg00305.html
Comment 12 IBM Bug Proxy 2010-12-21 04:52:04 EST
------- Comment From bnpoorni@in.ibm.com 2010-12-21 04:47 EDT-------
*** Bug 68847 has been marked as a duplicate of this bug. ***
Comment 13 Jiri Denemark 2011-01-09 18:57:32 EST
Built into libvirt-0.8.7-1.el6
Comment 14 Cui Chun 2011-01-11 00:57:48 EST
Verified. 

Please confirm if the "LEAK SUMMARY" is acceptable. I will continue to run the script and try to finish 36000 cycles tonight. 

-----------------
Test environment:
libvirt-0.8.7-1.el6
qemu-kvm-0.12.1.2-2.128.el6
kernel-2.6.32-94.el6

Steps:
1. install a domain named "rhel6-clone"
2. disable selinux:
# setenforce 0
3. run "libvirtd_memory_check.sh" attachment script 
4. check the "libvirtd_memory_check.sh.log" after running 400 cycles, do not found leak again.

==21443== LEAK SUMMARY:
==21443==    definitely lost: 0 bytes in 0 blocks
==21443==    indirectly lost: 0 bytes in 0 blocks
==21443==      possibly lost: 349 bytes in 18 blocks
==21443==    still reachable: 2,540 bytes in 47 blocks
==21443==         suppressed: 0 bytes in 0 blocks
==21443== Rerun with --leak-check=full to see details of leaked memory
Comment 15 Cui Chun 2011-01-11 01:01:07 EST
Created attachment 472740 [details]
libvirtd_memory_check.sh.log for libvirt-0.8.7-1.el6
Comment 16 Laine Stump 2011-01-13 16:36:40 EST
*** Bug 583083 has been marked as a duplicate of this bug. ***
Comment 19 Vivian Bian 2011-02-15 07:08:58 EST
Created attachment 478866 [details]
leak test log for libvirt-0.8.7-6.el6

retested with libvirt-0.8.7-6.el6.x86_64 PASS . Set bug status to VERIFIED

1. install a domain named "rhel6-clone"
2. disable selinux:
# setenforce 0
3. run "libvirtd_memory_check.sh" attachment script 
4. check the "libvirtd_memory_check.sh.log" after running 36000 cycles, do not
found leak again.


==5593== LEAK SUMMARY:
==5593==    definitely lost: 0 bytes in 0 blocks
==5593==    indirectly lost: 0 bytes in 0 blocks
==5593==      possibly lost: 349 bytes in 18 blocks
==5593==    still reachable: 1,840 bytes in 39 blocks
==5593==         suppressed: 0 bytes in 0 blocks
==5593== Rerun with --leak-check=full to see details of leaked memory
Comment 20 Martin Prpic 2011-04-15 10:24:25 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Starting and shutting down a domain led to a memory leak due to the memory buffer not being freed properly. With this update, starting and shutting down a domain no longer leads to a memory leak.
Comment 23 Laura Bailey 2011-05-04 00:30:36 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Starting and shutting down a domain led to a memory leak due to the memory buffer not being freed properly. With this update, starting and shutting down a domain no longer leads to a memory leak.+Memory buffer was not freed properly on domain startup and shutdown, which led to a memory leak that increased each time the domain was started or shut down. This update removes this memory leak.
Comment 24 errata-xmlrpc 2011-05-19 09:24:25 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html

Note You need to log in before you can comment on or make changes to this bug.