Hide Forgot
Description of problem: Incorrect reference counting and excessive cleanup code can result in hangs and/or crashes from closing a qemu domain too many times when the monitor is no longer available. This may be a cause of bug 670848, although that has not yet been proven at this point, so this bug has been opened to track a definite bug fix, while investigation continues on that bug. Version-Release number of selected component (if applicable): libvirt-0.8.7-4.el6 How reproducible: the formula below produces a deadlock 100% of the time; in real life operation it is probably a racy scenario that doesn't always trigger Steps to Reproduce: from one upstream patch: 1. service libvirtd start 2. virsh start <domain> 3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid) 4. service libvirtd restart 5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid) from another: 1. use gdb to debug libvirtd, and set breakpoint in the function qemuConnectMonitor() 2. start a vm, and the libvirtd will be stopped in qemuConnectMonitor() 3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid) 4. continue to run libvirtd in gdb, and libvirtd will be blocked in the function qemuMonitorSetCapabilities() 5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid) Actual results: on the first test, libvirtd hangs on the second test, the log shows: LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin ... char device redirected to /dev/pts/3 2011-01-27 09:38:48.101: shutting down 2011-01-27 09:41:26.401: shutting down Expected results: on the first test, libvirtd should never hang on the second, a domain should only be shut down once Additional info: as of writing this bug, two out of three patches have been accepted: commit d96431f9104a3a7fd12865b941a78b4cf7c6ec09 Author: Wen Congyang <wency.com> Date: Tue Jan 25 14:43:43 2011 +0800 avoid vm to be deleted if qemuConnectMonitor failed ... We should add an extra reference of vm to avoid vm to be deleted if qemuConnectMonitor() failed. commit e85247e7c3a9ee2697b49ca5bbcabd3d2d493f95 Author: Daniel P. Berrange <berrange> Date: Thu Jan 27 18:28:15 2011 +0000 When qemuMonitorSetCapabilities() fails, there is no need to call qemuMonitorClose(), because the caller will already see the error code and tear down the entire VM. The extra call to qemuMonitorClose resulted in a double-free due to it removing a ref count prematurely. * src/qemu/qemu_driver.c: Remove premature close of monitor and a third patch is pending upstream review: https://www.redhat.com/archives/libvir-list/2011-January/msg01106.html The vm is shut down twice. I do not know whether this behavior has side effect, but I think we should shutdown the vm only once. Signed-off-by: Wen Congyang <wency cn fujitsu com>
Patches posted upstream: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-January/msg01517.html
Not just deadlock, but also crash: Using these steps from Wen Congyang: 1. use gdb to debug libvirtd, and set breakpoint in the function qemuConnectMonitor() 2. start a vm, and the libvirtd will be stopped in qemuConnectMonitor() 3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid) 4. continue to run libvirtd in gdb, and libvirtd will be blocked in the function qemuMonitorSetCapabilities() 5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid) 6. continue to run libvirtd in gdb I saw libvirt crash: 11:12:44.882: 17952: error : qemuRemoveCgroup:335 : internal error Unable to find cgroup for windows_2008-32 11:12:44.882: 17952: warning : qemudShutdownVMDaemon:3109 : Failed to remove cgroup for windows_2008-32 Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff0aaf700 (LWP 17950)] 0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6 (gdb) bt #0 0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6 #1 0x0000003021677f38 in _int_free () from /lib64/libc.so.6 #2 0x00007ffff79e2d73 in virFree (ptrptr=0x7ffff0aae7a0) at util/memory.c:311 #3 0x000000000041dc75 in qemudClientMessageRelease (client=0x7fffec0012f0, msg=0x7fffe0014e10) at libvirtd.c:2065 #4 0x000000000041dd16 in qemudDispatchClientWrite (client=0x7fffec0012f0) at libvirtd.c:2095 #5 0x000000000041dfbe in qemudDispatchClientEvent (watch=8, fd=18, events=2, opaque=0x6fadb0) at libvirtd.c:2165 #6 0x00000000004189ee in virEventDispatchHandles (nfds=7, fds=0x7fffec0011b0) at event.c:467 #7 0x0000000000419082 in virEventRunOnce () at event.c:599 #8 0x000000000041e1c1 in qemudOneLoop () at libvirtd.c:2265 The third upstream patch has been reposted, and needs to be ACK'd and resubmitted to rhel: https://www.redhat.com/archives/libvir-list/2011-February/msg00074.html
In POST; two patches per comment 1 and one more patch at: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-February/msg00372.html
Back in POST, since 0.8.7-5.el6 is incomplete: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-February/msg00963.html
checked with libvirt-0.8.7-4.el6.x86_64.rpm --- reproducer libvirt-0.8.7-7.el6.x86_64.rpm --- verification from one terminal: 1. virsh start <domain> 2. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid) 3. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid) from the other termical 1.gdb libvirtd (gdb) b qemuConnectMonitor Breakpoint 1 at 0x434ff0: file qemu/qemu_driver.c, line 1246. (gdb) r start a vm, and the libvirtd will be stopped in qemuConnectMonitor() 2. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid) 3. continue to run libvirtd in gdb, and libvirtd will be blocked in the function qemuMonitorSetCapabilities() (gdb) c 4. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid) [reproducer] tail -F /var/log/libvirt/qemu/domain.log eges of VM to 107:107 char device redirected to /dev/pts/2 char device redirected to /dev/pts/4 Using CPU model "cpu64-rhel6" 2011-02-21 06:41:11.178: shutting down 2011-02-21 06:41:11.318: shutting down NB: although could see the guest be shut down twice, but I didn't encounter the libvirt crash . [verification] red_worker_main: begin handle_dev_input: start 2011-02-21 06:49:36.566: shutting down on the first test, libvirtd never hung . And libvirtd never crashed on the second, the domain only be shut down once Please check whether the steps are correct , if so , will set the bug status to VERIFIED . if not , will retest it to meet the exact request .
(In reply to comment #7) > checked with > libvirt-0.8.7-4.el6.x86_64.rpm --- reproducer > libvirt-0.8.7-7.el6.x86_64.rpm --- verification > > 3. continue to run libvirtd in gdb, and libvirtd will be blocked in the > function qemuMonitorSetCapabilities() > (gdb) c > 4. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid) I believe the reason I saw libvirt crash in my testing after killing the qemu pid is that I also had virt-manager running at the same time, so that there was multi-threaded interactions competing for status about the domain. Also, the crash is dependent on race conditions, so I'm not sure how reliably it can be created. But even if you don't see the crash: > [reproducer] > tail -F /var/log/libvirt/qemu/domain.log > eges of VM to 107:107 > char device redirected to /dev/pts/2 > char device redirected to /dev/pts/4 > Using CPU model "cpu64-rhel6" > 2011-02-21 06:41:11.178: shutting down > 2011-02-21 06:41:11.318: shutting down This is a valid detection of the bug, > [verification] > red_worker_main: begin > handle_dev_input: start > 2011-02-21 06:49:36.566: shutting down > and this is a valid verification that the bug was fixed. > Please check whether the steps are correct , if so , will set the bug status to > VERIFIED . if not , will retest it to meet the exact request . I'm satisfied with moving the bug to VERIFIED, even if you can't reproduce the crash; even if it would be nicer to get the crash reproducer as well (I tried again today, but failed to reproduce things; then again, when I first reproduced the crash, it was while using upstream libvirt.git, and there may have been other interactions in upstream that are not present on any RHEL build that affect the likelihood of a crash).
according to comment 8 and comment 7, set bug status to VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html