Bug 688774
Summary: | libvirtd can hang or crash from reference counting bugs when qemu process dies | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Eric Blake <eblake> |
Component: | libvirt | Assignee: | Eric Blake <eblake> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.1 | CC: | dallan, eblake, mjenner, weizhan, yoyzhang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-0.8.7-16.el6 | Doc Type: | Bug Fix |
Doc Text: |
libvirt was not careful about object locking rules when managing KVM guests, which resulted in a number of unexpected actions. If a guest shut down without notice, libvirt could crash or loop indefinitely. Locking code in libvirt has been improved to avoid accessing data outside locks, and to avoid deadlocks when multiple threads are interacting with the same domain, so libvirt no longer hangs or crashes when a guest shuts down.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-05-19 13:29:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eric Blake
2011-03-17 23:44:45 UTC
Another demonstration of reference bugs causing libvirtd confusion: commit d5df67be3c5d7a70bd2018fa5267733f23b1ae5d Author: Wen Congyang <wency.com> Date: Wed Mar 16 17:01:23 2011 +0800 do not unref obj in qemuDomainObjExitMonitor* Steps to reproduce this bug: # cat test.sh #! /bin/bash -x virsh start domain sleep 5 virsh qemu-monitor-command domain 'cpu_set 2 online' --hmp # while true; do ./test.sh ; done Then libvirtd will crash. The reason is that: we add a reference of obj when we open the monitor. We will reduce this reference when we free the monitor. If the reference of monitor is 0, we will free monitor automatically and the reference of obj is reduced. But in the function qemuDomainObjExitMonitorWithDriver(), we reduce this reference again when the reference of monitor is 0. It will cause the obj be freed in the function qemuDomainObjEndJob(). Then we start the domain again, and libvirtd will crash in the function virDomainObjListSearchName(), because we pass a null pointer(obj->def->name) to strcmp(). Both test cases from Wen assumed that the 'cpu_set 2 online' command of qemu crashes; depending on which qemu binary you are testing with, this might not be the case. However, it is possible to reproduce the testing scenarios by using gdb in libvirtd to stop execution at the point of the qemu monitor command, then manually kill -9 the qemu process, then resume libvirt, with the same effect (basically, Wen's formula depends on qemu dying at the right moment, not how it dies). Additionally, this upstream patch is worth including as long as we are fixing code related to ref-count and locking bugs: commit 496084175a78b02312129e0398ec14c5927d75ba Author: Eric Blake <eblake> Date: Mon Mar 14 20:20:53 2011 -0600 qemu: respect locking rules THREADS.txt states that the contents of vm should not be read or modified while the vm lock is not held, but that the lock must not be held while performing a monitor command. This fixes all the offenders that I could find. * src/qemu/qemu_process.c (qemuProcessStartCPUs) (qemuProcessInitPasswords, qemuProcessStart): Don't modify or refer to vm state outside lock. * src/qemu/qemu_driver.c (qemudDomainHotplugVcpus): Likewise. * src/qemu/qemu_hotplug.c (qemuDomainChangeGraphicsPasswords): Likewise. (In reply to comment #2) > Both test cases from Wen assumed that the 'cpu_set 2 online' command of qemu > crashes; depending on which qemu binary you are testing with, this might not be > the case. However, it is possible to reproduce the testing scenarios by using > gdb in libvirtd to stop execution at the point of the qemu monitor command, > then manually kill -9 the qemu process, then resume libvirt, with the same > effect (basically, Wen's formula depends on qemu dying at the right moment, not > how it dies). I do the following steps according to above comments to reproduce the bug on libvirt-0.8.7-13.el6.x86_64 1. start a guest 2. gdb /usr/sbin/libvirtd ` ps aux |grep [l]ibvirtd |awk '{print $2}'` 3. on gdb (gdb)break qemuMonitorJSONArbitraryCommand (gdb)r 4. open another terminal do # virsh qemu-monitor-command win2008_i386 'cpu_set 2 online' --hmp 5. open another terminal do # kill -9 `pidof qemu-kvm` 6. on gdb terminal, do (gdb)c Then see the libvirtd status is still running and no crash, but when I stop libvirtd, it shows # service libvirtd stop Stopping libvirtd daemon: [FAILED] # virsh-rail]# service libvirtd status libvirtd dead but subsys locked Is that the result we expect for this bug? And on libvirt-0.8.7-14.el6.x86_64, the result is same for the same test steps. (In reply to comment #6) > I do the following steps according to above comments to reproduce the bug on > libvirt-0.8.7-13.el6.x86_64 > 1. start a guest > 2. gdb /usr/sbin/libvirtd ` ps aux |grep [l]ibvirtd |awk '{print $2}'` > 3. on gdb > (gdb)break qemuMonitorJSONArbitraryCommand Hmm, I wonder if the breakpoint is accurate. This breaks _before_ libvirt has send a command over the monitor and wait for the response, whereas in Wen's case it was after the command was sent that qemu crashed. > (gdb)r > 4. open another terminal do > # virsh qemu-monitor-command win2008_i386 'cpu_set 2 online' --hmp > 5. open another terminal do > # kill -9 `pidof qemu-kvm` > 6. on gdb terminal, do > (gdb)c > > Then see the libvirtd status is still running and no crash, Hmm, that result is good for 14, but not so good for 13 where you are trying to reproduce the bug; the problem in step 3 might explain it, though. > but when I stop > libvirtd, it shows > # service libvirtd stop > Stopping libvirtd daemon: [FAILED] > # virsh-rail]# service libvirtd status > libvirtd dead but subsys locked Did you detach gdb from libvirt first? I was able to reproduce your scenario by forgetting to detach gdb, which means the libvirt process is still running as a slave to gdb, and service can't clean it up until gdb quits. To avoid the confusing messages from service, issue '(gdb) q' and say yes that you are okay detaching from the inferior process, prior to stopping the service. > > Is that the result we expect for this bug? But since that didn't reproduce the hang, you need to modify the process a bit. I'm still trying to reproduce the failure on -13 and success on -14 to give you a more reliable formula. Moving back to assigned - more upstream patches have been posted which means the problem has not been completely resolved. I should have a more reproducible test case before moving this back to POST. https://www.redhat.com/archives/libvir-list/2011-March/msg01408.html commit 025e19981008662dc230562b5e9d8faa86027384 Author: Hu Tao <hutao.com> Date: Wed Mar 30 10:34:16 2011 +0800 qemu: unlock qemu driver before return from domain save qemuDriverUnlock() wasn't called on 2 exit paths the driver before exiting on error Hi Eric,
Thanks for your detailed reply. I also break on qemu/qemu_monitor.c:729 after first break on qemuMonitorJSONArbitraryCommand, and the result is the same as before. It reports
"error: cannot send monitor command '{"execute":"human-monitor-command","arguments":{"command-line":"cpu_set 2 online"}}': Input/output error"
But libvirtd still running. I still can not reproduce it on libvirt-0.8.7-13.
> Did you detach gdb from libvirt first? I was able to reproduce your scenario
> by forgetting to detach gdb, which means the libvirt process is still running
> as a slave to gdb, and service can't clean it up until gdb quits. To avoid the
> confusing messages from service, issue '(gdb) q' and say yes that you are okay
> detaching from the inferior process, prior to stopping the service.
Yes, you are right. After I detach gdb, libvirtd stops successfully with no "libvirtd dead but subsys locked" problem both on -13 and -14.
This may also be related to bug 680730 Wen pointed out another problem for which there is no upstream patch yet: https://www.redhat.com/archives/libvir-list/2011-March/msg01492.html I'm not sure how soon that will be resolved, or if it should be spawned into a separate BZ to make progress on this one with the following commits: commit cc2424fc65a6b4b307f0d7a314d595cd3f15589e Author: Wen Congyang <wency.com> Date: Wed Mar 30 09:43:25 2011 +0800 do not send monitor command after monitor meet error If the monitor met a error, and we will call qemuProcessHandleMonitorEOF(). But we may try to send monitor command after qemuProcessHandleMonitorEOF() returned. Then libvirtd will be blocked in qemuMonitorSend(). Steps to reproduce this bug: 1. use gdb to attach libvirtd, and set a breakpoint in the function qemuConnectMonitor() 2. start a vm 3. let the libvirtd to run until qemuMonitorOpen() returns. 4. kill the qemu process 5. continue running libvirtd commit 025e19981008662dc230562b5e9d8faa86027384 Author: Hu Tao <hutao.com> Date: Wed Mar 30 10:34:16 2011 +0800 qemu: unlock qemu driver before return from domain save qemuDriverUnlock() wasn't called on 2 exit paths the driver before exiting on error I was able to reproduce virsh hanging with the steps for bug cc2424 using libvirt-0.8.7-15.el6, using three shells: 1# gdb --pid `pidof libvirtd` (gdb) b qemuMonitorOpen (gdb) c 2# virsh start dom 1# (gdb) b qemuMonitorOpen (gdb) c (gdb) fin 3# kill `pidof qemu-kvm` 1# c terminal 2 is stuck - the virsh command never completes because libvirtd has deadlocked on a lock. Ctrl-c in terminal 1, followed by 't a a bt' shows that libvirtd is indeed stuck in one of the threads: Thread 4 (Thread 0x7f53d35fe700 (LWP 19074)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x000000312d640c66 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:108 #2 0x000000000046d226 in qemuMonitorSend (mon=0x7f53c0001820, msg=<value optimized out>) at qemu/qemu_monitor.c:730 #3 0x000000000047655d in qemuMonitorJSONCommandWithFd (mon=0x7f53c0001820, cmd=<value optimized out>, scm_fd=-1, reply=0x7f53d35fb1f8) at qemu/qemu_monitor_json.c:220 ... after applying the patch, the same scenario no longer deadlocks; terminal 2 fails with this (expected, since qemu died too soon to attach to it): error: Failed to start domain dom error: operation failed: failed to retrieve chardev info in qemu with 'info chardev' In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00942.html Verify PASS on qemu-kvm-0.12.1.2-2.153.el6.x86_64 libvirt-0.8.7-16.el6.x86_64 kernel-2.6.32-125.el6.x86_64 steps: 1.on console 1 1# gdb /usr/sbin/libvirtd `pidof libvirtd` (gdb) b qemuMonitorOpen (gdb) c 2.on console 2 2# virsh start domain 3.on console 1 1# (gdb)fin 4.on console 3 3# kill `pidof qemu-kvm` 5.on console 1 1# (gdb)c see console 2 result I have already done several times and no hanging occurs Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Libvirt was not careful with respect to object locking rules when managing kvm guests Consequence: Various unexpected actions, such as a guest shutting down at the wrong time, could result in libvirt crashing or going into an infinite loop Fix: Locking code in libvirt has been improved to avoid accessing data outside of locks and to avoid deadlocks when multiple threads are interacting with the same domain Result: Libvirt should no longer hang or crash when a guest shuts down Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,8 +1 @@ -Cause: +libvirt was not careful about object locking rules when managing KVM guests, which resulted in a number of unexpected actions. If a guest shut down without notice, libvirt could crash or loop indefinitely. Locking code in libvirt has been improved to avoid accessing data outside locks, and to avoid deadlocks when multiple threads are interacting with the same domain, so libvirt no longer hangs or crashes when a guest shuts down.- Libvirt was not careful with respect to object locking rules when managing kvm guests -Consequence: - Various unexpected actions, such as a guest shutting down at the wrong time, could result in libvirt crashing or going into an infinite loop -Fix: - Locking code in libvirt has been improved to avoid accessing data outside of locks and to avoid deadlocks when multiple threads are interacting with the same domain -Result: - Libvirt should no longer hang or crash when a guest shuts down An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html |