Bug 673588

Summary:	libvirt can deadlock from double-closing qemu monitors
Product:	Red Hat Enterprise Linux 6	Reporter:	Eric Blake <eblake>
Component:	libvirt	Assignee:	Eric Blake <eblake>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	dallan, dyuan, eblake, jdenemar, kxiong, mjenner, vbian, xen-maint
Target Milestone:	rc
Target Release:	6.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-0.8.7-7.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-19 13:26:36 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Blake 2011-01-28 20:23:41 UTC

Description of problem:
Incorrect reference counting and excessive cleanup code can result in hangs and/or crashes from closing a qemu domain too many times when the monitor is no longer available.  This may be a cause of bug 670848, although that has not yet been proven at this point, so this bug has been opened to track a definite bug fix, while investigation continues on that bug.

Version-Release number of selected component (if applicable):
libvirt-0.8.7-4.el6

How reproducible:
the formula below produces a deadlock 100% of the time; in real life operation it is probably a racy scenario that doesn't always trigger

Steps to Reproduce:
from one upstream patch:
    1. service libvirtd start
    2. virsh start <domain>
    3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid)
    4. service libvirtd restart
    5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid)

from another:
    1. use gdb to debug libvirtd, and set breakpoint in the function
       qemuConnectMonitor()
    2. start a vm, and the libvirtd will be stopped in qemuConnectMonitor()
    3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid)
    4. continue to run libvirtd in gdb, and libvirtd will be blocked in the
       function qemuMonitorSetCapabilities()
    5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid)
 
  
Actual results:
on the first test, libvirtd hangs
on the second test, the log shows:
    LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin ...
    char device redirected to /dev/pts/3
    2011-01-27 09:38:48.101: shutting down
    2011-01-27 09:41:26.401: shutting down

Expected results:
on the first test, libvirtd should never hang
on the second, a domain should only be shut down once

Additional info:
as of writing this bug, two out of three patches have been accepted:

commit d96431f9104a3a7fd12865b941a78b4cf7c6ec09
Author: Wen Congyang <wency.com>
Date:   Tue Jan 25 14:43:43 2011 +0800

    avoid vm to be deleted if qemuConnectMonitor failed
...
     We should add an extra reference of vm to avoid vm to be deleted if
    qemuConnectMonitor() failed.

commit e85247e7c3a9ee2697b49ca5bbcabd3d2d493f95
Author: Daniel P. Berrange <berrange>
Date:   Thu Jan 27 18:28:15 2011 +0000

    
    When qemuMonitorSetCapabilities() fails, there is no need to
    call qemuMonitorClose(), because the caller will already see
    the error code and tear down the entire VM. The extra call to
    qemuMonitorClose resulted in a double-free due to it removing
    a ref count prematurely.
    
    * src/qemu/qemu_driver.c: Remove premature close of monitor

and a third patch is pending upstream review:
https://www.redhat.com/archives/libvir-list/2011-January/msg01106.html
   The vm is shut down twice. I do not know whether this behavior has
   side effect, but I think we should shutdown the vm only once.

   Signed-off-by: Wen Congyang <wency cn fujitsu com>

Comment 1 Eric Blake 2011-01-28 20:31:16 UTC

Patches posted upstream:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-January/msg01517.html

Comment 2 Eric Blake 2011-02-02 18:41:46 UTC

Not just deadlock, but also crash:

Using these steps from Wen Congyang:

1. use gdb to debug libvirtd, and set breakpoint in the function
   qemuConnectMonitor()
2. start a vm, and the libvirtd will be stopped in qemuConnectMonitor()
3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid)
4. continue to run libvirtd in gdb, and libvirtd will be blocked in the
   function qemuMonitorSetCapabilities()
5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid)
6. continue to run libvirtd in gdb

I saw libvirt crash:
11:12:44.882: 17952: error : qemuRemoveCgroup:335 : internal error Unable to
find cgroup for windows_2008-32
11:12:44.882: 17952: warning : qemudShutdownVMDaemon:3109 : Failed to remove
cgroup for windows_2008-32

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff0aaf700 (LWP 17950)]
0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6
#1  0x0000003021677f38 in _int_free () from /lib64/libc.so.6
#2  0x00007ffff79e2d73 in virFree (ptrptr=0x7ffff0aae7a0) at util/memory.c:311
#3  0x000000000041dc75 in qemudClientMessageRelease (client=0x7fffec0012f0, 
    msg=0x7fffe0014e10) at libvirtd.c:2065
#4  0x000000000041dd16 in qemudDispatchClientWrite (client=0x7fffec0012f0)
    at libvirtd.c:2095
#5  0x000000000041dfbe in qemudDispatchClientEvent (watch=8, fd=18, events=2, 
    opaque=0x6fadb0) at libvirtd.c:2165
#6  0x00000000004189ee in virEventDispatchHandles (nfds=7, fds=0x7fffec0011b0)
    at event.c:467
#7  0x0000000000419082 in virEventRunOnce () at event.c:599
#8  0x000000000041e1c1 in qemudOneLoop () at libvirtd.c:2265

The third upstream patch has been reposted, and needs to be ACK'd and resubmitted to rhel:
https://www.redhat.com/archives/libvir-list/2011-February/msg00074.html

Comment 3 Eric Blake 2011-02-03 16:14:14 UTC

In POST; two patches per comment 1 and one more patch at:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-February/msg00372.html

Comment 5 Eric Blake 2011-02-14 20:20:21 UTC

Back in POST, since 0.8.7-5.el6 is incomplete:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-February/msg00963.html

Comment 7 Vivian Bian 2011-02-21 07:06:11 UTC

checked with 
libvirt-0.8.7-4.el6.x86_64.rpm  --- reproducer 
libvirt-0.8.7-7.el6.x86_64.rpm  --- verification 

from one terminal:

    1. virsh start <domain>
    2. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid)
    3. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid)

from the other termical

1.gdb libvirtd
  (gdb) b qemuConnectMonitor 
  Breakpoint 1 at 0x434ff0: file qemu/qemu_driver.c, line 1246.
  (gdb) r
  start a vm, and the libvirtd will be stopped in qemuConnectMonitor()
2. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid)
3. continue to run libvirtd in gdb, and libvirtd will be blocked in the
   function qemuMonitorSetCapabilities()
  (gdb) c
4. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid)


[reproducer]
tail -F /var/log/libvirt/qemu/domain.log
eges of VM to 107:107
char device redirected to /dev/pts/2
char device redirected to /dev/pts/4
Using CPU model "cpu64-rhel6"
2011-02-21 06:41:11.178: shutting down
2011-02-21 06:41:11.318: shutting down

NB: although could see the guest be shut down twice, but I didn't encounter the libvirt crash .


[verification]
red_worker_main: begin
handle_dev_input: start
2011-02-21 06:49:36.566: shutting down

on the first test, libvirtd never hung . And libvirtd never crashed
on the second, the domain only be shut down once


Please check whether the steps are correct , if so , will set the bug status to VERIFIED . if not , will retest it to meet the exact request .

Comment 8 Eric Blake 2011-02-21 19:13:59 UTC

(In reply to comment #7)
> checked with 
> libvirt-0.8.7-4.el6.x86_64.rpm  --- reproducer 
> libvirt-0.8.7-7.el6.x86_64.rpm  --- verification 
> 

> 3. continue to run libvirtd in gdb, and libvirtd will be blocked in the
>    function qemuMonitorSetCapabilities()
>   (gdb) c
> 4. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid)

I believe the reason I saw libvirt crash in my testing after killing the qemu pid is that I also had virt-manager running at the same time, so that there was multi-threaded interactions competing for status about the domain.  Also, the crash is dependent on race conditions, so I'm not sure how reliably it can be created.  But even if you don't see the crash:

> [reproducer]
> tail -F /var/log/libvirt/qemu/domain.log
> eges of VM to 107:107
> char device redirected to /dev/pts/2
> char device redirected to /dev/pts/4
> Using CPU model "cpu64-rhel6"
> 2011-02-21 06:41:11.178: shutting down
> 2011-02-21 06:41:11.318: shutting down

This is a valid detection of the bug,

> [verification]
> red_worker_main: begin
> handle_dev_input: start
> 2011-02-21 06:49:36.566: shutting down
> 

and this is a valid verification that the bug was fixed.

> Please check whether the steps are correct , if so , will set the bug status to
> VERIFIED . if not , will retest it to meet the exact request .

I'm satisfied with moving the bug to VERIFIED, even if you can't reproduce the crash; even if it would be nicer to get the crash reproducer as well (I tried again today, but failed to reproduce things; then again, when I first reproduced the crash, it was while using upstream libvirt.git, and there may have been other interactions in upstream that are not present on any RHEL build that affect the likelihood of a crash).

Comment 9 Vivian Bian 2011-02-22 07:16:47 UTC

according to comment 8 and comment 7, set bug status to VERIFIED

Comment 12 errata-xmlrpc 2011-05-19 13:26:36 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html