Bug 670848
Summary: | [Libvirt] Libvirt daemon crashes on virDomainDefFree at conf/domain_conf.c:793 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Naori <dnaori> | ||||||
Component: | libvirt | Assignee: | Laine Stump <laine> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 6.1 | CC: | ajia, berrange, dallan, dnaori, eblake, hateya, jdenemar, mgoldboi, mjenner, veillard, xen-maint, yoyzhang | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 6.1 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-0.8.7-10.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-05-19 13:25:55 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Looking at the stack trace it seems the sequence is qemudDomainBlockStats(domain, "hda") finishes calls qemuDomainObjEndJob(obj) on line 7761 qemuDomainObjEndJob(obj) finishes and calls virDomainObjUnref(obj); on line 499 of src/qemu/qemu_domain.c domain count goes to 0 so calls virDomainObjFree(dom); on line 910 which starts by calling virDomainDefFree(dom->def) line 884 and virDomainDefFree() segfault at the beginning while dereferencing the graphics allocated array. It seems that something managed to free the domain in the meantime qemudDomainBlockStats() can take some time to process, maybe a domain kill happened in the meantime. Still the locking should have garanteed that the object was preserved, something in the framework is not handling locking/unlocking properly I would guess. Daniel BTW the domain log are not present in the tarball, it seems a tar of the gdb.txt for some mysterious reason. Can you provide the libvirt logs as a separate attachment to this bug, thanks ! Daniel Created attachment 474427 [details]
libvirt log.
The qemudDomainBlockStats() method looks correct, so I have to assume that elsewhere there is something that is unref'ing the virDomainObj twice. Is this crash easily reproducable, and if so, does it occur on earlier 0.8.1 builds of libvirt. If not that would indicate a regression which would help tracking down the problem Setting needinfo to make sure David Naori sees the question. We may have a fix. It fixes a crash caused by starting and stopping large numbers of VMs. See: https://www.redhat.com/archives/libvir-list/2011-January/msg00876.html Since the fix is not in the latest build I have made a set of rpms to test available at http://veillard.com/libvirt/test/ they include the above patch on top of RHEL-6 libvirt-0.8.7-3, please give them a try and see if the issue can be reproduced with those. Daniel David, in the description you say "create 20+ VMs". How many did you actually create? I created bug 671564 and bug 671567 to track two separate issues found while investigating event.c. The first is easily reproducible with creating then destroying 60 VMs, but has a stack trace different than the original report. The second is something that I have not been able to reproduce (I found it by inspection), but since it involves a data race window, it may be the more likely culprit for fixing this report. If we can definitively prove that either one of those bugs solves this issue, then we can mark this as duplicate. Dave Allan, to your question it was 27 VMs to be precise. Daniel, unfortunately i did not check earlier versions. dnaori can you reproduce the issue with the special build of libvirt at http://veillard.com/libvirt/test/ ? Please upgrade the libvirt in your testing environement to that version and try to reproduce the problem ! Then report on result thanks ! Daniel David, there is an additional fix that Eric produced after Daniel spun that build. Eric will get you a scratch build with it today. So far seems like this build - http://veillard.com/libvirt/test/ solves the problem. i managed to reproduce the crash with Eric's build and with veillard's core dump bt: #0 virDomainDefFree (def=0x7f05d4017080) at conf/domain_conf.c:793 #1 0x00007f05ea2bb6a0 in virDomainObjFree (dom=0x7f05d401e6c0) at conf/domain_conf.c:884 #2 virDomainObjUnref (dom=0x7f05d401e6c0) at conf/domain_conf.c:910 #3 0x000000000043fca9 in qemudDomainBlockStats (dom=<value optimized out>, path=0x7f05d00dfe70 "hda", stats=<value optimized out>) at qemu/qemu_driver.c:7767 #4 0x00007f05ea303429 in virDomainBlockStats (dom=0x7f05d00e8860, path=0x7f05d00dfe70 "hda", stats=0x7f05dbffeae0, size=40) at libvirt.c:4501 #5 0x000000000042706a in remoteDispatchDomainBlockStats (server=<value optimized out>, client=<value optimized out>, conn=0x7f05d00009e0, hdr=<value optimized out>, rerr=0x7f05dbffeb90, args=0x7f05dbffecd0, ret=0x7f05dbffec70) at remote.c:918 #6 0x000000000042c5ba in remoteDispatchClientCall (server=0x15e1620, client=0x7f05dc0013a0, msg=0x7f05dc001520) at dispatch.c:530 #7 remoteDispatchClientRequest (server=0x15e1620, client=0x7f05dc0013a0, msg=0x7f05dc001520) at dispatch.c:408 #8 0x000000000041c2c8 in qemudWorker (data=0x7f05dc000908) at libvirtd.c:1582 #9 0x0000003e45a077e1 in ?? () #10 0x00007f05dbfff710 in ?? () #11 0x0000000000000000 in ?? () Argh ... that's not good ... let's sync on IRC to try to get a live gdb session for it. It's nearly impossible in that case to guess just from a stack trace, actually having detailed debug logs of the daemon may be useful, let's try to set this up togetther. Daniel (In reply to comment #18) > these upstream patches (two approved, one > still pending review): > https://www.redhat.com/archives/libvir-list/2011-January/msg00985.html > https://www.redhat.com/archives/libvir-list/2011-January/msg01151.html > https://www.redhat.com/archives/libvir-list/2011-January/msg01106.html Bug 673588 now tracks getting those three patches into RHEL, whether or not they prove to be the root cause of this crash. Started testing the new scratch build https://brewweb.devel.redhat.com/taskinfo?taskID=3072886, and still the crash occurs. attached gdb bt of the core dump: #0 0x00007fde21408d5c in virDomainDefFree (def=0x7fddfc001ed0) at conf/domain_conf.c:800 #1 0x00007fde21409315 in virDomainObjFree (dom=0x7fddfc069bb0) at conf/domain_conf.c:891 #2 0x00007fde21409457 in virDomainObjUnref (dom=0x7fddfc069bb0) at conf/domain_conf.c:917 #3 0x0000000000471993 in qemuDomainObjEndJob (obj=0x7fddfc069bb0) at qemu/qemu_domain.c:499 #4 0x0000000000450360 in qemudDomainBlockStats (dom=0x7fde04169b40, path=0x7fde04017ff0 "hda", stats=0x7fde19753a00) at qemu/qemu_driver.c:7821 #5 0x00007fde21455849 in virDomainBlockStats (dom=0x7fde04169b40, path=0x7fde04017ff0 "hda", stats=0x7fde19753ab0, size=40) at libvirt.c:4501 #6 0x0000000000423ddb in remoteDispatchDomainBlockStats (server=0x1b6d640, client=0x7fde14001240, conn=0x7fde080009e0, hdr=0x7fde14110860, rerr=0x7fde19753c00, args=0x7fde19753bb0, ret=0x7fde19753b50) at remote.c:918 #7 0x0000000000430e36 in remoteDispatchClientCall (server=0x1b6d640, client=0x7fde14001240, msg=0x7fde140d0850, qemu_protocol=false) at dispatch.c:530 #8 0x0000000000430a01 in remoteDispatchClientRequest (server=0x1b6d640, client=0x7fde14001240, msg=0x7fde140d0850) at dispatch.c:408 #9 0x000000000041d603 in qemudWorker (data=0x7fde140008c0) at libvirtd.c:1582 #10 0x00000030440077e1 in start_thread (arg=0x7fde19754710) at pthread_create.c:301 #11 0x00000030438e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Using these steps from Wen Congyang: 1. use gdb to debug libvirtd, and set breakpoint in the function qemuConnectMonitor() 2. start a vm, and the libvirtd will be stopped in qemuConnectMonitor() 3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid) 4. continue to run libvirtd in gdb, and libvirtd will be blocked in the function qemuMonitorSetCapabilities() 5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid) 6. continue to run libvirtd in gdb I saw libvirt crash: 11:12:44.882: 17952: error : qemuRemoveCgroup:335 : internal error Unable to find cgroup for windows_2008-32 11:12:44.882: 17952: warning : qemudShutdownVMDaemon:3109 : Failed to remove cgroup for windows_2008-32 Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff0aaf700 (LWP 17950)] 0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6 (gdb) bt #0 0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6 #1 0x0000003021677f38 in _int_free () from /lib64/libc.so.6 #2 0x00007ffff79e2d73 in virFree (ptrptr=0x7ffff0aae7a0) at util/memory.c:311 #3 0x000000000041dc75 in qemudClientMessageRelease (client=0x7fffec0012f0, msg=0x7fffe0014e10) at libvirtd.c:2065 #4 0x000000000041dd16 in qemudDispatchClientWrite (client=0x7fffec0012f0) at libvirtd.c:2095 #5 0x000000000041dfbe in qemudDispatchClientEvent (watch=8, fd=18, events=2, opaque=0x6fadb0) at libvirtd.c:2165 #6 0x00000000004189ee in virEventDispatchHandles (nfds=7, fds=0x7fffec0011b0) at event.c:467 #7 0x0000000000419082 in virEventRunOnce () at event.c:599 #8 0x000000000041e1c1 in qemudOneLoop () at libvirtd.c:2265 #9 0x000000000041e6cd in qemudRunLoop (opaque=0x6fadb0) at libvirtd.c:2375 #10 0x0000003021e077e1 in start_thread () from /lib64/libpthread.so.0 #11 0x00000030216e151d in clone () from /lib64/libc.so.6 which, although a different trace, has all the same earmarks of being a case of the event dispatcher referencing double-freed memory. Then, applying Wen's patch, and reproducing the scenario, I no longer see the crash. Therefore, it may be worth merging this and bug 673588. Here is valgrind output from a system that experienced this crash. It shows the thread that has previously freed the domain object. I haven't analyzed it yet: Invalid read of size 4 ==13237== at 0x333084F3A3: virDomainObjUnref (domain_conf.c:979) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== Address 0x4e6cf78 is 40 bytes inside a block of size 104 free'd ==13237== at 0x4A04D72: free (vg_replace_malloc.c:325) ==13237== by 0x33308398A8: virFree (memory.c:311) ==13237== by 0x333084F44F: virDomainObjUnref (domain_conf.c:965) ==13237== by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209) ==13237== by 0x4186FE: virEventCleanupHandles (event.c:538) ==13237== by 0x418D14: virEventRunOnce (event.c:603) ==13237== by 0x41B398: qemudOneLoop (libvirtd.c:2238) ==13237== by 0x41B856: qemudRunLoop (libvirtd.c:2348) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== ==13237== Invalid read of size 4 ==13237== at 0x366B60A640: pthread_mutex_unlock (in /lib64/libpthread-2.12.so) ==13237== by 0x333084F3BF: virDomainObjUnref (domain_conf.c:980) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== Address 0x4e6cf60 is 16 bytes inside a block of size 104 free'd ==13237== at 0x4A04D72: free (vg_replace_malloc.c:325) ==13237== by 0x33308398A8: virFree (memory.c:311) ==13237== by 0x333084F44F: virDomainObjUnref (domain_conf.c:965) ==13237== by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209) ==13237== by 0x4186FE: virEventCleanupHandles (event.c:538) ==13237== by 0x418D14: virEventRunOnce (event.c:603) ==13237== by 0x41B398: qemudOneLoop (libvirtd.c:2238) ==13237== by 0x41B856: qemudRunLoop (libvirtd.c:2348) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== ==13237== Invalid read of size 4 ==13237== at 0x366B60A1F0: __pthread_mutex_unlock_full (in /lib64/libpthread-2.12.so) ==13237== by 0x333084F3BF: virDomainObjUnref (domain_conf.c:980) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== Address 0x4e6cf60 is 16 bytes inside a block of size 104 free'd ==13237== at 0x4A04D72: free (vg_replace_malloc.c:325) ==13237== by 0x33308398A8: virFree (memory.c:311) ==13237== by 0x333084F44F: virDomainObjUnref (domain_conf.c:965) ==13237== by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209) ==13237== by 0x4186FE: virEventCleanupHandles (event.c:538) ==13237== by 0x418D14: virEventRunOnce (event.c:603) ==13237== by 0x41B398: qemudOneLoop (libvirtd.c:2238) ==13237== by 0x41B856: qemudRunLoop (libvirtd.c:2348) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== ==13237== Invalid read of size 8 ==13237== at 0x333084F3F7: virDomainObjUnref (domain_conf.c:955) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== Address 0x4e6cf88 is 56 bytes inside a block of size 104 free'd ==13237== at 0x4A04D72: free (vg_replace_malloc.c:325) ==13237== by 0x33308398A8: virFree (memory.c:311) ==13237== by 0x333084F44F: virDomainObjUnref (domain_conf.c:965) ==13237== by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209) ==13237== by 0x4186FE: virEventCleanupHandles (event.c:538) ==13237== by 0x418D14: virEventRunOnce (event.c:603) ==13237== by 0x41B398: qemudOneLoop (libvirtd.c:2238) ==13237== by 0x41B856: qemudRunLoop (libvirtd.c:2348) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== ==13237== Invalid read of size 4 ==13237== at 0x333084EC59: virDomainDefFree (domain_conf.c:859) ==13237== by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== Address 0x4e2fcb8 is 312 bytes inside a block of size 632 free'd ==13237== at 0x4A04D72: free (vg_replace_malloc.c:325) ==13237== by 0x33308398A8: virFree (memory.c:311) ==13237== by 0x333084F1A7: virDomainDefFree (domain_conf.c:945) ==13237== by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955) ==13237== by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209) ==13237== by 0x4186FE: virEventCleanupHandles (event.c:538) ==13237== by 0x418D14: virEventRunOnce (event.c:603) ==13237== by 0x41B398: qemudOneLoop (libvirtd.c:2238) ==13237== by 0x41B856: qemudRunLoop (libvirtd.c:2348) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== ==13237== Invalid read of size 8 ==13237== at 0x333084EC70: virDomainDefFree (domain_conf.c:860) ==13237== by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== Address 0x4e2fcc0 is 320 bytes inside a block of size 632 free'd ==13237== at 0x4A04D72: free (vg_replace_malloc.c:325) ==13237== by 0x33308398A8: virFree (memory.c:311) ==13237== by 0x333084F1A7: virDomainDefFree (domain_conf.c:945) ==13237== by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955) ==13237== by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209) ==13237== by 0x4186FE: virEventCleanupHandles (event.c:538) ==13237== by 0x418D14: virEventRunOnce (event.c:603) ==13237== by 0x41B398: qemudOneLoop (libvirtd.c:2238) ==13237== by 0x41B856: qemudRunLoop (libvirtd.c:2348) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== ==13237== Invalid read of size 8 ==13237== at 0x333084EC7C: virDomainDefFree (domain_conf.c:860) ==13237== by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==13237== ==13237== ==13237== Process terminating with default action of signal 11 (SIGSEGV) ==13237== Access not within mapped region at address 0x0 ==13237== at 0x333084EC7C: virDomainDefFree (domain_conf.c:860) ==13237== by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955) ==13237== by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819) ==13237== by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518) ==13237== by 0x427339: remoteDispatchDomainBlockStats (remote.c:918) ==13237== by 0x42C889: remoteDispatchClientRequest (dispatch.c:530) ==13237== by 0x41C597: qemudWorker (libvirtd.c:1582) ==13237== by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so) ==13237== by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so) ==13237== If you believe this happened as a result of a stack ==13237== overflow in your program's main thread (unlikely but ==13237== possible), you can try to increase the size of the ==13237== main thread stack using the --main-stacksize= flag. ==13237== The main thread stack size used in this run was 10485760. Note that the above crash happened while running libvirt-0.8.7-8. Eric says that build has all the fixes discussed above. I have a local test setup that creates and destroys a transient guest while doing domainblkstats of a block device on the domain, and an application called referential supplied by danpb that keeps track of the refcount on every domain object (when libvirt is run under referential, that is). So far this has not reproduced the crash. Now in a quest to more exactly replicate the behavior on the reproducing system, I'm wondering exactly what libvirt APIs vdsm is calling to "stop the vms" (I'm calling the destroy API. Maybe I need to be calling "shutdown"? And in that case, what is the minimum client needed to respond properly to that?) Using the output of referential as a guide (along with Dan B's brain), I found two potential races in the code that deals with removing inactive DomainObjs when a domain is shutdown. The patches for these have been posted upstream, and after review/push, I will submit rebased versions for the RHEL6 version of libvirt: https://www.redhat.com/archives/libvir-list/2011-March/msg00112.html Patches sent to rhvirt-patches: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00100.html Shoot - I think this introduces a deadlock regression into 'virsh save domain file'. gdb shows the following backtrace: Thread 7 (Thread 0x7fffe97fb700 (LWP 29297)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40) at util/threads-pthread.c:112 #2 0x000000000041cdb9 in qemudWorker (data=0x7fffec000920) at libvirtd.c:1608 #3 0x00000032984077e1 in start_thread (arg=0x7fffe97fb700) at pthread_create.c:301 #4 0x0000003297ce5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 6 (Thread 0x7fffea1fc700 (LWP 29296)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40) at util/threads-pthread.c:112 #2 0x000000000041cdb9 in qemudWorker (data=0x7fffec000908) at libvirtd.c:1608 #3 0x00000032984077e1 in start_thread (arg=0x7fffea1fc700) at pthread_create.c:301 #4 0x0000003297ce5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 5 (Thread 0x7fffeabfd700 (LWP 29295)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40) at util/threads-pthread.c:112 #2 0x000000000041cdb9 in qemudWorker (data=0x7fffec0008f0) at libvirtd.c:1608 #3 0x00000032984077e1 in start_thread (arg=0x7fffeabfd700) at pthread_create.c:301 #4 0x0000003297ce5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 4 (Thread 0x7fffeb5fe700 (LWP 29294)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40) at util/threads-pthread.c:112 #2 0x000000000041cdb9 in qemudWorker (data=0x7fffec0008d8) at libvirtd.c:1608 #3 0x00000032984077e1 in start_thread (arg=0x7fffeb5fe700) at pthread_create.c:301 #4 0x0000003297ce5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3 (Thread 0x7fffebfff700 (LWP 29293)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 #1 0x0000003298409345 in _L_lock_870 () from /lib64/libpthread.so.0 #2 0x0000003298409217 in __pthread_mutex_lock (mutex=0x6fc6a0) at pthread_mutex_lock.c:61 #3 0x00007ffff79e9739 in virMutexLock (m=0x6fc6a0) at util/threads-pthread.c:80 #4 0x000000000041805d in virEventUpdateTimeoutImpl (timer=1, frequency=0) at event.c:247 #5 0x00007ffff79d5212 in virEventUpdateTimeout (timer=1, timeout=0) at util/event.c:70 #6 0x00000000004631da in qemuDomainEventQueue (driver=0x73b1a0, event=0x7fffdc011dd0) at qemu/qemu_domain.c:97 #7 0x000000000043ed12 in qemudDomainSaveFlag (driver=0x73b1a0, dom=0x7fffdc001970, vm=0x761950, path=0x7fffdc000e10 "/var/run/libvirt/qemu/fed12.img", compressed=0) at qemu/qemu_driver.c:2074 #8 0x000000000043efd2 in qemudDomainSave (dom=0x7fffdc001970, path=0x7fffdc000e10 "/var/run/libvirt/qemu/fed12.img") at qemu/qemu_driver.c:2137 #9 0x00007ffff7a40a9d in virDomainSave (domain=0x7fffdc001970, to=0x7fffdc000e10 "/var/run/libvirt/qemu/fed12.img") at libvirt.c:2280 #10 0x0000000000426267 in remoteDispatchDomainSave (server=0x6fee40, client=0x7fffec001190, conn=0x7fffd8000b00, hdr=0x7fffec041430, rerr=0x7fffebffec00, args=0x7fffebffebb0, ret=0x7fffebffeb50) at remote.c:2273 #11 0x000000000043061a in remoteDispatchClientCall (server=0x6fee40, client=0x7fffec001190, msg=0x7fffec001420, qemu_protocol=false) at dispatch.c:529 #12 0x00000000004301e5 in remoteDispatchClientRequest (server=0x6fee40, client=0x7fffec001190, msg=0x7fffec001420) at dispatch.c:407 #13 0x000000000041ce79 in qemudWorker (data=0x7fffec0008c0) at libvirtd.c:1629 #14 0x00000032984077e1 in start_thread (arg=0x7fffebfff700) at pthread_create.c:301 #15 0x0000003297ce5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 2 (Thread 0x7ffff0aaa700 (LWP 29292)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 #1 0x0000003298409345 in _L_lock_870 () from /lib64/libpthread.so.0 #2 0x0000003298409217 in __pthread_mutex_lock (mutex=0x761950) at pthread_mutex_lock.c:61 #3 0x00007ffff79e9739 in virMutexLock (m=0x761950) at util/threads-pthread.c:80 #4 0x00007ffff7a0dbd2 in virDomainObjLock (obj=0x761950) at conf/domain_conf.c:8447 #5 0x000000000046feb7 in qemuProcessHandleMonitorDestroy (mon=0x7fffdc001a60, vm=0x761950) at qemu/qemu_process.c:599 #6 0x0000000000478893 in qemuMonitorFree (mon=0x7fffdc001a60) at qemu/qemu_monitor.c:209 #7 0x0000000000478931 in qemuMonitorUnref (mon=0x7fffdc001a60) at qemu/qemu_monitor.c:229 #8 0x000000000047896d in qemuMonitorUnwatch (monitor=0x7fffdc001a60) at qemu/qemu_monitor.c:242 #9 0x0000000000418fd4 in virEventCleanupHandles () at event.c:538 #10 0x00000000004192c4 in virEventRunOnce () at event.c:603 #11 0x000000000041e45e in qemudOneLoop () at libvirtd.c:2285 #12 0x000000000041e96a in qemudRunLoop (opaque=0x6fee40) at libvirtd.c:2395 #13 0x00000032984077e1 in start_thread (arg=0x7ffff0aaa700) at pthread_create.c:301 #14 0x0000003297ce5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7ffff7765800 (LWP 29290)): #0 0x000000329840803d in pthread_join (threadid=140737231103744, thread_return=0x0) at pthread_join.c:89 #1 0x0000000000421ae6 in main (argc=1, argv=0x7fffffffe1f8) at libvirtd.c:3411 This additional patch resolves the deadlock; v0.8.7-9 is DOA with just the first 2 of 3 patches, but 0.8.7-10 should have the third patch and solve this bug. http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00125.html verified. started libvirtd in gdb; and perform several shut-down scenarios (acpi, kill, kill -9), libvirt didn't crash. libvirt-0.8.7-10.el6.x86_64 qemu-kvm-0.12.1.2-2.149.el6.x86_64 vdsm-4.9-52.el6.x86_64 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html |
Created attachment 474278 [details] libvirt and gdb logs. Description of problem: libvirt demon crashes while running several vm's using RHEVM. (virDomainDefFree at conf/domain_conf.c:793) Version-Release number of selected component (if applicable): -libvirt-0.8.7-2.el6 -vdsm-cli-4.9-43 -qemu-kvm-0.12.1.2-2.129.el6 -rhel 6.1 based host Steps to Reproduce: - on RHEVM 1.create 20+ vms on single host 2.start and stop the vms. Additional info: libvirt and gdb logs attached.