Hide Forgot
Created attachment 462542 [details] leak memory check script Description of problem: Start and shutdown domain lead to memory leak Version-Release number of selected component (if applicable): kernel-2.6.32-71.el6.x86_64 libvirt-0.8.1-27.el6.x86_64 qemu-kvm-0.12.1.2-2.113.el6.x86_64 How reproducible: everytime Steps to Reproduce: 1. install a domain named "kvm1" 2. disable selinux: # setenforce 0 3. run "libvirtd_memory_check.sh" attachment script Actual results: when the script in step 2 is running, it can be seen in "libvirtd_memory_check.sh.log" attachment valgrind log that there is memory leak. And the leak memory will increase after every start/shutdown Expected results: No memory leak caused by start and shutdown domain Additional info: The leak memory statics is shown in "libvirtd_memory_check.sh.log" attachment valgrind log as follows: LEAK SUMMARY: ==10198== definitely lost: 124,172 bytes in 148 blocks ==10198== indirectly lost: 3,527,409 bytes in 30,656 blocks ==10198== possibly lost: 26,229 bytes in 138 blocks ==10198== still reachable: 2,373,241 bytes in 17,808 blocks ==10198== suppressed: 0 bytes in 0 blocks ==10198== Rerun with --leak-check=full to see details of leaked memory
Created attachment 462543 [details] valgrind log for libvirtd
Upstream patch posted for the worst offender (at least 1024 bytes on every qemu monitor connection, which is one per start/stop sequence): https://www.redhat.com/archives/libvir-list/2010-November/msg01100.html There appears to be other leaks (148 blocks of 1024 bytes each would be larger than 124172 lost bytes), but they are smaller in size, and might not be as frequent; it will take more analysis to decide whether anything else is worth plugging.
Mike Strosaker on our team had a simple cron job to run every 15 mins and capture libvirtd memory usage..As far as we can tell, the same four VMs have been running on the system for the entirety of monitoring, so there has been no provisioning activity. DATE %MEM RSS 2010-11-23-16:45:01 10.4 6731.32 2010-11-23-17:00:01 11.3 7341.36 2010-11-23-17:15:01 12.3 7948.18 2010-11-23-17:30:01 13.2 8555.17 2010-11-23-17:45:01 14.2 9168.53 2010-11-23-18:00:01 15.2 9799.07 2010-11-23-18:15:01 16.2 10436.68 2010-11-23-18:30:01 17.1 11057.58 2010-11-23-18:45:01 18.1 11666.37 2010-11-23-19:00:01 19.1 12307.74 2010-11-23-19:15:01 20.0 12945.53 2010-11-23-19:30:01 21.0 13572.84 2010-11-23-19:45:01 22.0 14213.44 2010-11-23-20:00:01 23.0 14863.93 2010-11-23-20:15:01 24.0 15502.12 2010-11-23-20:30:01 25.0 16155.37 2010-11-23-20:45:01 26.1 16817.77 2010-11-23-21:00:01 27.1 17472.32 2010-11-23-21:15:01 28.1 18116.05 2010-11-23-21:30:01 29.1 18758.14 2010-11-23-21:45:01 30.0 19380.17 2010-11-23-22:00:01 31.0 20029.62 2010-11-23-22:15:01 32.0 20654.29 2010-11-23-22:30:01 33.0 21280.00 2010-11-23-22:45:01 33.9 21890.36 2010-11-23-23:00:01 34.9 22500.23 2010-11-23-23:15:01 35.8 23103.05 2010-11-23-23:30:01 36.8 23714.33 Additional data being collected..
The nasty leak has something to do with disk information. Based on a core extracted from a leaking libvirtd process, there's a repeating pattern of: 002ca150: 2f73 746f 7261 6765 2f70 726f 642f 6570 /storage/prod/ep 002ca160: 6865 6d65 7261 6c2f 2f76 686f 7374 3037 hemeral//vhost07 002ca170: 3239 2f76 686f 7374 3037 3239 2e69 6d67 29/vhost0729.img 002ca180: 0000 0058 767f 0000 2500 0000 0000 0000 ...Xv...%....... 002ca190: 6964 6530 2d30 2d30 0000 0058 767f 0000 ide0-0-0...Xv... 002ca1a0: 2000 0000 0000 0000 3500 0000 0000 0000 .......5....... 002ca1b0: 656f 7468 6572 0000 9800 0058 767f 0000 eother.....Xv... 002ca1c0: bf89 cacb 1b4d 2a7b 3000 0030 767f 0000 .....M*{0..0v... 002ca1d0: 3000 0000 0000 0000 4500 0000 0000 0000 0.......E....... That is repeated over 2 million times which looks clearly like a high frequency leak. Still trying to find a the right data structure that would contain this information.
There are three other guests running. vhost0728 has 200k hits in the core file but the other two hosts only have 3 hits. Looks like the leak is specific to particular guests. It's possibly we're running some sort of API call frequently but only for certain guests.
I've identified further leaks in libnl and libselinux that impact libvirt, and I'm still in the process of tracking down root causes of other valgrind leak reports. I'm definitely making progress on plugging leaks via upstream patches, and will be working on backporting them to RHEL as fast as I can.
*** Bug 620334 has been marked as a duplicate of this bug. ***
(In reply to comment #4) > The nasty leak has something to do with disk information. Based on a core > extracted from a leaking libvirtd process, there's a repeating pattern of: > > 002ca150: 2f73 746f 7261 6765 2f70 726f 642f 6570 /storage/prod/ep > 002ca160: 6865 6d65 7261 6c2f 2f76 686f 7374 3037 hemeral//vhost07 > 002ca170: 3239 2f76 686f 7374 3037 3239 2e69 6d67 29/vhost0729.img > 002ca180: 0000 0058 767f 0000 2500 0000 0000 0000 ...Xv...%....... > 002ca190: 6964 6530 2d30 2d30 0000 0058 767f 0000 ide0-0-0...Xv... > 002ca1a0: 2000 0000 0000 0000 3500 0000 0000 0000 .......5....... > 002ca1b0: 656f 7468 6572 0000 9800 0058 767f 0000 eother.....Xv... > 002ca1c0: bf89 cacb 1b4d 2a7b 3000 0030 767f 0000 .....M*{0..0v... > 002ca1d0: 3000 0000 0000 0000 4500 0000 0000 0000 0.......E....... > > That is repeated over 2 million times which looks clearly like a high frequency > leak. Still trying to find a the right data structure that would contain this > information. That pattern is showing 'path', 'devAlias', <some integer>, 'reason'. In other words it is an instance of an virDomainEvent for an I/O error. Likely from ioErrorEvent2 = virDomainEventIOErrorReasonNewFromObj(vm, srcPath, devAlias, action, reason); In qemuHandleDomainIOError. This allocated object is put on the event queue qemuDomainEventQueue(driver, ioErrorEvent2); Some short while later qemuDomainEventFlush runs and invokes virDomainEventQueueDispatch(&tempQueue, driver->domainEventCallbacks, qemuDomainEventDispatchFunc, driver); this should iterate over all queued event, dispatch them, and then call virDomainEventFree(). The only way I could see it leak, is if the qemuDomainEventFlush method never got run.
Created attachment 463999 [details] Patch to fix memory leak This is untested and against upstream, but I think this is the source of the problem.
Proposed patch series for z-stream: http://post-office.corp.redhat.com/archives/rhvirt-patches/2010-December/msg00305.html
------- Comment From bnpoorni@in.ibm.com 2010-12-21 04:47 EDT------- *** Bug 68847 has been marked as a duplicate of this bug. ***
Built into libvirt-0.8.7-1.el6
Verified. Please confirm if the "LEAK SUMMARY" is acceptable. I will continue to run the script and try to finish 36000 cycles tonight. ----------------- Test environment: libvirt-0.8.7-1.el6 qemu-kvm-0.12.1.2-2.128.el6 kernel-2.6.32-94.el6 Steps: 1. install a domain named "rhel6-clone" 2. disable selinux: # setenforce 0 3. run "libvirtd_memory_check.sh" attachment script 4. check the "libvirtd_memory_check.sh.log" after running 400 cycles, do not found leak again. ==21443== LEAK SUMMARY: ==21443== definitely lost: 0 bytes in 0 blocks ==21443== indirectly lost: 0 bytes in 0 blocks ==21443== possibly lost: 349 bytes in 18 blocks ==21443== still reachable: 2,540 bytes in 47 blocks ==21443== suppressed: 0 bytes in 0 blocks ==21443== Rerun with --leak-check=full to see details of leaked memory
Created attachment 472740 [details] libvirtd_memory_check.sh.log for libvirt-0.8.7-1.el6
*** Bug 583083 has been marked as a duplicate of this bug. ***
Created attachment 478866 [details] leak test log for libvirt-0.8.7-6.el6 retested with libvirt-0.8.7-6.el6.x86_64 PASS . Set bug status to VERIFIED 1. install a domain named "rhel6-clone" 2. disable selinux: # setenforce 0 3. run "libvirtd_memory_check.sh" attachment script 4. check the "libvirtd_memory_check.sh.log" after running 36000 cycles, do not found leak again. ==5593== LEAK SUMMARY: ==5593== definitely lost: 0 bytes in 0 blocks ==5593== indirectly lost: 0 bytes in 0 blocks ==5593== possibly lost: 349 bytes in 18 blocks ==5593== still reachable: 1,840 bytes in 39 blocks ==5593== suppressed: 0 bytes in 0 blocks ==5593== Rerun with --leak-check=full to see details of leaked memory
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Starting and shutting down a domain led to a memory leak due to the memory buffer not being freed properly. With this update, starting and shutting down a domain no longer leads to a memory leak.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Starting and shutting down a domain led to a memory leak due to the memory buffer not being freed properly. With this update, starting and shutting down a domain no longer leads to a memory leak.+Memory buffer was not freed properly on domain startup and shutdown, which led to a memory leak that increased each time the domain was started or shut down. This update removes this memory leak.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html