688774 – libvirtd can hang or crash from reference counting bugs when qemu process dies

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 688774 - libvirtd can hang or crash from reference counting bugs when qemu process dies

Summary: libvirtd can hang or crash from reference counting bugs when qemu process dies

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Eric Blake
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-17 23:44 UTC by Eric Blake
Modified:	2015-09-28 02:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:	libvirt-0.8.7-16.el6
Doc Type:	Bug Fix
Doc Text:	libvirt was not careful about object locking rules when managing KVM guests, which resulted in a number of unexpected actions. If a guest shut down without notice, libvirt could crash or loop indefinitely. Locking code in libvirt has been improved to avoid accessing data outside locks, and to avoid deadlocks when multiple threads are interacting with the same domain, so libvirt no longer hangs or crashes when a guest shuts down.
Clone Of:
Environment:
Last Closed:	2011-05-19 13:29:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0596	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2011-05-18 17:56:36 UTC

Description Eric Blake 2011-03-17 23:44:45 UTC

Description of problem:
Steps to reproduce this bug:
# virsh qemu-monitor-command domain 'cpu_set 2 online' --hmp
The domain has 2 cpus, and we try to set the third cpu online.
The qemu crashes, and this command will hang.

The reason is that the refs is not 1 when we unwatch the monitor.
We lock the monitor, but we do not unlock it. So virCondWait()
will be blocked.


Version-Release number of selected component (if applicable):
libvirt-0.8.7-13.el6

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
hang

Expected results:
no hang

Additional info:
See upstream post and patch:
https://www.redhat.com/archives/libvir-list/2011-March/msg00740.html

Comment 1 Eric Blake 2011-03-18 18:29:07 UTC

Another demonstration of reference bugs causing libvirtd confusion:

commit d5df67be3c5d7a70bd2018fa5267733f23b1ae5d
Author: Wen Congyang <wency.com>
Date:   Wed Mar 16 17:01:23 2011 +0800

    do not unref obj in qemuDomainObjExitMonitor*
    
    Steps to reproduce this bug:
    # cat test.sh
      #! /bin/bash -x
      virsh start domain
      sleep 5
      virsh qemu-monitor-command domain 'cpu_set 2 online' --hmp
    # while true; do ./test.sh ; done
    
    Then libvirtd will crash.
    
    The reason is that:
    we add a reference of obj when we open the monitor. We will reduce this
    reference when we free the monitor.
    
    If the reference of monitor is 0, we will free monitor automatically and
    the reference of obj is reduced.
    
    But in the function qemuDomainObjExitMonitorWithDriver(), we reduce this
    reference again when the reference of monitor is 0.
    
    It will cause the obj be freed in the function qemuDomainObjEndJob().
    
    Then we start the domain again, and libvirtd will crash in the function
    virDomainObjListSearchName(), because we pass a null pointer(obj->def->name)
    to strcmp().

Comment 2 Eric Blake 2011-03-18 18:32:44 UTC

Both test cases from Wen assumed that the 'cpu_set 2 online' command of qemu crashes; depending on which qemu binary you are testing with, this might not be the case.  However, it is possible to reproduce the testing scenarios by using gdb in libvirtd to stop execution at the point of the qemu monitor command, then manually kill -9 the qemu process, then resume libvirt, with the same effect (basically, Wen's formula depends on qemu dying at the right moment, not how it dies).

Comment 3 Eric Blake 2011-03-18 18:43:39 UTC

In POST:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00459.html

Comment 4 Eric Blake 2011-03-18 22:26:20 UTC

Additionally, this upstream patch is worth including as long as we are fixing code related to ref-count and locking bugs:

commit 496084175a78b02312129e0398ec14c5927d75ba
Author: Eric Blake <eblake>
Date:   Mon Mar 14 20:20:53 2011 -0600

    qemu: respect locking rules
    
    THREADS.txt states that the contents of vm should not be read or
    modified while the vm lock is not held, but that the lock must not
    be held while performing a monitor command.  This fixes all the
    offenders that I could find.
    
    * src/qemu/qemu_process.c (qemuProcessStartCPUs)
    (qemuProcessInitPasswords, qemuProcessStart): Don't modify or
    refer to vm state outside lock.
    * src/qemu/qemu_driver.c (qemudDomainHotplugVcpus): Likewise.
    * src/qemu/qemu_hotplug.c (qemuDomainChangeGraphicsPasswords):
    Likewise.

Comment 6 weizhang 2011-03-23 08:15:10 UTC

(In reply to comment #2)
> Both test cases from Wen assumed that the 'cpu_set 2 online' command of qemu
> crashes; depending on which qemu binary you are testing with, this might not be
> the case.  However, it is possible to reproduce the testing scenarios by using
> gdb in libvirtd to stop execution at the point of the qemu monitor command,
> then manually kill -9 the qemu process, then resume libvirt, with the same
> effect (basically, Wen's formula depends on qemu dying at the right moment, not
> how it dies).

I do the following steps according to above comments to reproduce the bug on libvirt-0.8.7-13.el6.x86_64
1. start a guest
2. gdb /usr/sbin/libvirtd ` ps aux |grep [l]ibvirtd |awk '{print $2}'`
3. on gdb
(gdb)break qemuMonitorJSONArbitraryCommand
(gdb)r
4. open another terminal do
# virsh qemu-monitor-command win2008_i386 'cpu_set 2 online' --hmp
5. open another terminal do
# kill -9 `pidof qemu-kvm`
6. on gdb terminal, do
(gdb)c

Then see the libvirtd status is still running and no crash, but when I stop libvirtd, it shows 
# service libvirtd stop
Stopping libvirtd daemon:                                  [FAILED]
# virsh-rail]# service libvirtd status
libvirtd dead but subsys locked

Is that the result we expect for this bug?

Comment 7 weizhang 2011-03-23 08:28:23 UTC

And on libvirt-0.8.7-14.el6.x86_64, the result is same for the same test steps.

Comment 8 Eric Blake 2011-03-29 23:23:21 UTC

(In reply to comment #6)
> I do the following steps according to above comments to reproduce the bug on
> libvirt-0.8.7-13.el6.x86_64
> 1. start a guest
> 2. gdb /usr/sbin/libvirtd ` ps aux |grep [l]ibvirtd |awk '{print $2}'`
> 3. on gdb
> (gdb)break qemuMonitorJSONArbitraryCommand

Hmm, I wonder if the breakpoint is accurate.  This breaks _before_ libvirt has send a command over the monitor and wait for the response, whereas in Wen's case it was after the command was sent that qemu crashed.

> (gdb)r
> 4. open another terminal do
> # virsh qemu-monitor-command win2008_i386 'cpu_set 2 online' --hmp
> 5. open another terminal do
> # kill -9 `pidof qemu-kvm`
> 6. on gdb terminal, do
> (gdb)c
> 
> Then see the libvirtd status is still running and no crash,

Hmm, that result is good for 14, but not so good for 13 where you are trying to reproduce the bug; the problem in step 3 might explain it, though.

> but when I stop
> libvirtd, it shows 
> # service libvirtd stop
> Stopping libvirtd daemon:                                  [FAILED]
> # virsh-rail]# service libvirtd status
> libvirtd dead but subsys locked

Did you detach gdb from libvirt first?  I was able to reproduce your scenario by forgetting to detach gdb, which means the libvirt process is still running as a slave to gdb, and service can't clean it up until gdb quits.  To avoid the confusing messages from service, issue '(gdb) q' and say yes that you are okay detaching from the inferior process, prior to stopping the service.

> 
> Is that the result we expect for this bug?

But since that didn't reproduce the hang, you need to modify the process a bit.  I'm still trying to reproduce the failure on -13 and success on -14 to give you a more reliable formula.

Comment 9 Eric Blake 2011-03-30 14:24:43 UTC

Moving back to assigned - more upstream patches have been posted which means the problem has not been completely resolved.  I should have a more reproducible test case before moving this back to POST.

https://www.redhat.com/archives/libvir-list/2011-March/msg01408.html

commit 025e19981008662dc230562b5e9d8faa86027384
Author: Hu Tao <hutao.com>
Date:   Wed Mar 30 10:34:16 2011 +0800

    qemu: unlock qemu driver before return from domain save
    
    qemuDriverUnlock() wasn't called on 2 exit paths
      the driver before exiting on error

Comment 10 weizhang 2011-03-31 05:11:31 UTC

Hi Eric,

Thanks for your detailed reply. I also break on qemu/qemu_monitor.c:729 after first break on qemuMonitorJSONArbitraryCommand, and the result is the same as before. It reports 
"error: cannot send monitor command '{"execute":"human-monitor-command","arguments":{"command-line":"cpu_set 2 online"}}': Input/output error"

But libvirtd still running. I still can not reproduce it on libvirt-0.8.7-13.

> Did you detach gdb from libvirt first?  I was able to reproduce your scenario
> by forgetting to detach gdb, which means the libvirt process is still running
> as a slave to gdb, and service can't clean it up until gdb quits.  To avoid the
> confusing messages from service, issue '(gdb) q' and say yes that you are okay
> detaching from the inferior process, prior to stopping the service.

Yes, you are right. After I detach gdb, libvirtd stops successfully with no "libvirtd dead but subsys locked" problem both on -13 and -14.

Comment 11 Eric Blake 2011-03-31 11:51:07 UTC

This may also be related to bug 680730

Comment 12 Eric Blake 2011-03-31 18:43:34 UTC

Wen pointed out another problem for which there is no upstream patch yet:

https://www.redhat.com/archives/libvir-list/2011-March/msg01492.html

I'm not sure how soon that will be resolved, or if it should be spawned into a separate BZ to make progress on this one with the following commits:

commit cc2424fc65a6b4b307f0d7a314d595cd3f15589e
Author: Wen Congyang <wency.com>
Date:   Wed Mar 30 09:43:25 2011 +0800

    do not send monitor command after monitor meet error
    
    If the monitor met a error, and we will call qemuProcessHandleMonitorEOF().
    But we may try to send monitor command after qemuProcessHandleMonitorEOF()
    returned. Then libvirtd will be blocked in qemuMonitorSend().
    
    Steps to reproduce this bug:
    1. use gdb to attach libvirtd, and set a breakpoint in the function
       qemuConnectMonitor()
    2. start a vm
    3. let the libvirtd to run until qemuMonitorOpen() returns.
    4. kill the qemu process
    5. continue running libvirtd

commit 025e19981008662dc230562b5e9d8faa86027384
Author: Hu Tao <hutao.com>
Date:   Wed Mar 30 10:34:16 2011 +0800

    qemu: unlock qemu driver before return from domain save
    
    qemuDriverUnlock() wasn't called on 2 exit paths
      the driver before exiting on error

Comment 13 Eric Blake 2011-03-31 23:01:11 UTC

I was able to reproduce virsh hanging with the steps for bug cc2424 using libvirt-0.8.7-15.el6, using three shells:

1# gdb --pid `pidof libvirtd`
(gdb) b qemuMonitorOpen
(gdb) c

2# virsh start dom

1# (gdb) b qemuMonitorOpen
(gdb) c
(gdb) fin

3# kill `pidof qemu-kvm`

1# c

terminal 2 is stuck - the virsh command never completes because libvirtd has deadlocked on a lock.  Ctrl-c in terminal 1, followed by 't a a bt' shows that libvirtd is indeed stuck in one of the threads:
Thread 4 (Thread 0x7f53d35fe700 (LWP 19074)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x000000312d640c66 in virCondWait (c=<value optimized out>, 
    m=<value optimized out>) at util/threads-pthread.c:108
#2  0x000000000046d226 in qemuMonitorSend (mon=0x7f53c0001820, 
    msg=<value optimized out>) at qemu/qemu_monitor.c:730
#3  0x000000000047655d in qemuMonitorJSONCommandWithFd (mon=0x7f53c0001820, 
    cmd=<value optimized out>, scm_fd=-1, reply=0x7f53d35fb1f8)
    at qemu/qemu_monitor_json.c:220
...

Comment 14 Eric Blake 2011-03-31 23:34:26 UTC

after applying the patch, the same scenario no longer deadlocks; terminal 2 fails with this (expected, since qemu died too soon to attach to it):

error: Failed to start domain dom
error: operation failed: failed to retrieve chardev info in qemu with 'info chardev'

In POST:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00942.html

Comment 17 weizhang 2011-04-07 08:08:26 UTC

Verify PASS on 
qemu-kvm-0.12.1.2-2.153.el6.x86_64
libvirt-0.8.7-16.el6.x86_64
kernel-2.6.32-125.el6.x86_64

steps:
1.on console 1 
1# gdb /usr/sbin/libvirtd `pidof libvirtd`
(gdb) b qemuMonitorOpen
(gdb) c

2.on console 2
2# virsh start domain

3.on console 1
1# (gdb)fin

4.on console 3
3# kill `pidof qemu-kvm`

5.on console 1
1# (gdb)c

see console 2 result
I have already done several times and no hanging occurs

Comment 18 Eric Blake 2011-05-02 21:58:48 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
 Libvirt was not careful with respect to object locking rules when managing kvm guests
Consequence:
 Various unexpected actions, such as a guest shutting down at the wrong time, could result in libvirt crashing or going into an infinite loop
Fix:
 Locking code in libvirt has been improved to avoid accessing data outside of locks and to avoid deadlocks when multiple threads are interacting with the same domain
Result:
 Libvirt should no longer hang or crash when a guest shuts down

Comment 21 Laura Bailey 2011-05-04 02:29:20 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,8 +1 @@
-Cause:
+libvirt was not careful about object locking rules when managing KVM guests, which resulted in a number of unexpected actions. If a guest shut down without notice, libvirt could crash or loop indefinitely. Locking code in libvirt has been improved to avoid accessing data outside locks, and to avoid deadlocks when multiple threads are interacting with the same domain, so libvirt no longer hangs or crashes when a guest shuts down.- Libvirt was not careful with respect to object locking rules when managing kvm guests
-Consequence:
- Various unexpected actions, such as a guest shutting down at the wrong time, could result in libvirt crashing or going into an infinite loop
-Fix:
- Locking code in libvirt has been improved to avoid accessing data outside of locks and to avoid deadlocks when multiple threads are interacting with the same domain
-Result:
- Libvirt should no longer hang or crash when a guest shuts down

Comment 22 errata-xmlrpc 2011-05-19 13:29:14 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html

Note You need to log in before you can comment on or make changes to this bug.