This service will be undergoing maintenance at 20:00 UTC, 2017-04-03. It is expected to last about 30 minutes
Bug 950286 - libvirtd crash on race with auto-destroy guests
libvirtd crash on race with auto-destroy guests
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.4
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Eric Blake
Virtualization Bugs
: ZStream
Depends On:
Blocks: 951073
  Show dependency treegraph
 
Reported: 2013-04-09 22:17 EDT by Eric Blake
Modified: 2013-11-21 03:58 EST (History)
12 users (show)

See Also:
Fixed In Version: libvirt-0.10.2-19.el6
Doc Type: Bug Fix
Doc Text:
Under certain conditions, when a connection was closed, guests set to be automatically destroyed failed to be destroyed and the libvirtd daemon terminated unexpectedly. A series of patches addressing various crash scenarios has been provided and libvirtd no longer crashes while auto-destroying guests.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-21 03:58:37 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Eric Blake 2013-04-09 22:17:31 EDT
Description of problem:
There are a number of upstream patches that solve deadlock and/or crash scenarios due to use-after-free when handling auto-destroy guests.  These need to be backported to RHEL.  While VDSM doesn't directly use auto-destroy guests, the act of migration uses an auto-destroy guest on the destination until the source has gotten far enough along in the migration process; also, virt-sandbox uses auto-destroy.

Version-Release number of selected component (if applicable):
libvirt-0.10.2-18.el6_4.3

How reproducible:
I found these commits by reading git logs.  It is probably difficult to trigger the races; but I could probably add some strategic sleep() statements if we need to prove the existence of at least one of the problems fixed in this series.

Steps to Reproduce:
1. create some autodestroy guests
2. close the connection; guests should be destroyed, but libvirtd should not crash
3.
  
Actual results:
if the race hits, valgrind can report a use-after-free, or libvirtd can even crash

Expected results:
no crash

Additional info:
the following upstream patches are worth backporting:

commit 96b893f092b3972bc04be975945830dc974af2b3
Author: Daniel P. Berrange <berrange@redhat.com>
Date:   Thu Feb 28 13:30:49 2013 +0000

    Fix deadlock in QEMU close callback APIs

commit 7ccad0b16d12d7616c7c21b1359f6a55a9677521
Author: Daniel P. Berrange <berrange@redhat.com>
Date:   Thu Feb 28 12:18:48 2013 +0000

    Fix crash in QEMU auto-destroy with transient guests

commit b4a124efc328ac221ff4e8a6fde3a1a0c0202d68
Author: Daniel P. Berrange <berrange@redhat.com>
Date:   Wed Feb 27 16:23:16 2013 +0000

    Fix autodestroy of QEMU guests

commit 4e4c6620e2e2937da03d37720d39368d297f5743
Author: Daniel P. Berrange <berrange@redhat.com>
Date:   Wed Jan 23 17:22:27 2013 +0000

    Avoid use of free'd memory in auto destroy callback

(Note that 568a6cd also touched autodestroy, but was later reverted by 9c4ecb3; we don't necessarily need to backport either of those patches, unless doing so helps avoid merge conflicts in the other patches)
Comment 1 Eric Blake 2013-04-10 10:38:09 EDT
Moving to POST, since all of these patches are upstream and will be picked up by rebase
Comment 2 EricLee 2013-04-11 04:10:33 EDT
Hi Eirc,

I can not reproduce this bug using simple migration, my steps:

1.Use ctl+c to terminate the migrate process before finish:
# virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose
Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by client

2.Both source and destination libvirtd not crash:
# service libvirtd status
libvirtd (pid  27426) is running...

3.And the guest is destroyed and disappear on destination.

I saw you said it is a race problem, so is it difficult to reproduce?
Or do I miss something?

Thanks,
EricLee
Comment 4 Eric Blake 2013-04-11 19:11:32 EDT
Backport notes: looks like it is also important to have this one:

commit 3898ba7f2cf067ae5852c40d68460c64fb06c94f
Author: Jiri Denemark <jdenemar@redhat.com>
Date:   Fri Feb 15 13:05:12 2013 +0100

    qemu: Turn closeCallbacks into virObjectLockable
    
    To avoid having to hold the qemu driver lock while iterating through
    close callbacks and calling them. This fixes a real deadlock when a
    domain which is being migrated from another host gets autodestoyed as a
    result of broken connection to the other host.
Comment 5 Eric Blake 2013-04-12 10:14:14 EDT
(In reply to comment #2)
> Hi Eirc,
> 
> I can not reproduce this bug using simple migration, my steps:
> 
> 1.Use ctl+c to terminate the migrate process before finish:
> # virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose
> Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by
> client

Based on additional feedback on my patches for 6.4:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2013-April/msg00276.html

The only observable bugs in 6.4 are the use-after-free in commit 4e4c6620 (but it is only a read, not a write, so the symptoms are limited to printing garbage in a log or observing a complaint when run under valgrind), and a potential use-after-free of a mutex in commit 7ccad0b (try migrating a transient guest).

The other upstream commits mentioned in comment 0 deal with deadlock that was only present in a small window of upstream libvirt.git when we dropped the big qemu driver lock; given that the deadlock was not present until commit a9e97e0 (upstream 1.0.3), it is not present in 6.4, and the rebase for 6.5 has already patched things.  The fact that you can't reproduce a deadlock on formal builds is good; you'd have to compare a reproducer against a specific build from libvirt.git.


> I saw you said it is a race problem, so is it difficult to reproduce?
> Or do I miss something?

I can still try and come up with a temporary patch that uses strategic sleep() to make the use-after-free of the mutex on a migrated transient guest a bit more obvious.
Comment 6 Eric Blake 2013-04-12 17:50:17 EDT
Since all the issues are already fixed in 6.5, I'm posting the reproducer formulas to the 6.4.z counterpart, bug 951073.  Of the two upstream commits we ended up backporting to 6.4, I've already found at one reproducer using valgrind.
Comment 8 zhenfeng wang 2013-07-09 23:55:51 EDT
Hi erric
I need verify this bug on the latest libvirt version, However,I can just reproduce this mem leak ,can't reproduce the libvirt crash. I saw the How reproducible description in comment 0 that "It is probably difficult to trigger the races; but I could probably add some strategic sleep() statements", bug I'm not very clear about it, so can you offer me the detailed steps, or other methods to verify this bug ? thanks
Comment 9 zhenfeng wang 2013-07-09 23:59:42 EDT
hi eric,
I'm very sorry to not write your name correcty.
Comment 10 Eric Blake 2013-07-11 15:52:36 EDT
(In reply to zhenfeng wang from comment #8)
> Hi erric
> I need verify this bug on the latest libvirt version, However,I can just
> reproduce this mem leak ,can't reproduce the libvirt crash.

Reproducing that the mem leak existed (formula in bug 951703), and has now been fixed, should be good enough to verify this bug.  It's an observable symptom of the bug, and I don't know if it is worth the effort to try to come up with a more impressive symptom.
Comment 11 zhenfeng wang 2013-07-17 03:43:47 EDT
Thanks for Eric's reply. According to the comment 10, I retest this bug on libvirt-0.10.2-19.el6 , and find the mem leak has gone, so mark this bug verified.
Comment 13 errata-xmlrpc 2013-11-21 03:58:37 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1581.html

Note You need to log in before you can comment on or make changes to this bug.