Bug 950286

Summary:	libvirtd crash on race with auto-destroy guests
Product:	Red Hat Enterprise Linux 6	Reporter:	Eric Blake <eblake>
Component:	libvirt	Assignee:	Eric Blake <eblake>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.4	CC:	acathrow, bili, cpelland, cwei, dyuan, eblake, jentrena, lyarwood, mjenner, mzhan, ydu, zhwang
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-0.10.2-19.el6	Doc Type:	Bug Fix
Doc Text:	Under certain conditions, when a connection was closed, guests set to be automatically destroyed failed to be destroyed and the libvirtd daemon terminated unexpectedly. A series of patches addressing various crash scenarios has been provided and libvirtd no longer crashes while auto-destroying guests.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-11-21 08:58:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	951073

Description Eric Blake 2013-04-10 02:17:31 UTC

Description of problem:
There are a number of upstream patches that solve deadlock and/or crash scenarios due to use-after-free when handling auto-destroy guests.  These need to be backported to RHEL.  While VDSM doesn't directly use auto-destroy guests, the act of migration uses an auto-destroy guest on the destination until the source has gotten far enough along in the migration process; also, virt-sandbox uses auto-destroy.

Version-Release number of selected component (if applicable):
libvirt-0.10.2-18.el6_4.3

How reproducible:
I found these commits by reading git logs.  It is probably difficult to trigger the races; but I could probably add some strategic sleep() statements if we need to prove the existence of at least one of the problems fixed in this series.

Steps to Reproduce:
1. create some autodestroy guests
2. close the connection; guests should be destroyed, but libvirtd should not crash
3.
  
Actual results:
if the race hits, valgrind can report a use-after-free, or libvirtd can even crash

Expected results:
no crash

Additional info:
the following upstream patches are worth backporting:

commit 96b893f092b3972bc04be975945830dc974af2b3
Author: Daniel P. Berrange <berrange>
Date:   Thu Feb 28 13:30:49 2013 +0000

    Fix deadlock in QEMU close callback APIs

commit 7ccad0b16d12d7616c7c21b1359f6a55a9677521
Author: Daniel P. Berrange <berrange>
Date:   Thu Feb 28 12:18:48 2013 +0000

    Fix crash in QEMU auto-destroy with transient guests

commit b4a124efc328ac221ff4e8a6fde3a1a0c0202d68
Author: Daniel P. Berrange <berrange>
Date:   Wed Feb 27 16:23:16 2013 +0000

    Fix autodestroy of QEMU guests

commit 4e4c6620e2e2937da03d37720d39368d297f5743
Author: Daniel P. Berrange <berrange>
Date:   Wed Jan 23 17:22:27 2013 +0000

    Avoid use of free'd memory in auto destroy callback

(Note that 568a6cd also touched autodestroy, but was later reverted by 9c4ecb3; we don't necessarily need to backport either of those patches, unless doing so helps avoid merge conflicts in the other patches)

Comment 1 Eric Blake 2013-04-10 14:38:09 UTC

Moving to POST, since all of these patches are upstream and will be picked up by rebase

Comment 2 EricLee 2013-04-11 08:10:33 UTC

Hi Eirc,

I can not reproduce this bug using simple migration, my steps:

1.Use ctl+c to terminate the migrate process before finish:
# virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose
Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by client

2.Both source and destination libvirtd not crash:
# service libvirtd status
libvirtd (pid  27426) is running...

3.And the guest is destroyed and disappear on destination.

I saw you said it is a race problem, so is it difficult to reproduce?
Or do I miss something?

Thanks,
EricLee

Comment 4 Eric Blake 2013-04-11 23:11:32 UTC

Backport notes: looks like it is also important to have this one:

commit 3898ba7f2cf067ae5852c40d68460c64fb06c94f
Author: Jiri Denemark <jdenemar>
Date:   Fri Feb 15 13:05:12 2013 +0100

    qemu: Turn closeCallbacks into virObjectLockable
    
    To avoid having to hold the qemu driver lock while iterating through
    close callbacks and calling them. This fixes a real deadlock when a
    domain which is being migrated from another host gets autodestoyed as a
    result of broken connection to the other host.

Comment 5 Eric Blake 2013-04-12 14:14:14 UTC

(In reply to comment #2)
> Hi Eirc,
> 
> I can not reproduce this bug using simple migration, my steps:
> 
> 1.Use ctl+c to terminate the migrate process before finish:
> # virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose
> Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by
> client

Based on additional feedback on my patches for 6.4:

http://post-office.corp.redhat.com/archives/rhvirt-patches/2013-April/msg00276.html

The only observable bugs in 6.4 are the use-after-free in commit 4e4c6620 (but it is only a read, not a write, so the symptoms are limited to printing garbage in a log or observing a complaint when run under valgrind), and a potential use-after-free of a mutex in commit 7ccad0b (try migrating a transient guest).

The other upstream commits mentioned in comment 0 deal with deadlock that was only present in a small window of upstream libvirt.git when we dropped the big qemu driver lock; given that the deadlock was not present until commit a9e97e0 (upstream 1.0.3), it is not present in 6.4, and the rebase for 6.5 has already patched things.  The fact that you can't reproduce a deadlock on formal builds is good; you'd have to compare a reproducer against a specific build from libvirt.git.


> I saw you said it is a race problem, so is it difficult to reproduce?
> Or do I miss something?

I can still try and come up with a temporary patch that uses strategic sleep() to make the use-after-free of the mutex on a migrated transient guest a bit more obvious.

Comment 6 Eric Blake 2013-04-12 21:50:17 UTC

Since all the issues are already fixed in 6.5, I'm posting the reproducer formulas to the 6.4.z counterpart, bug 951073.  Of the two upstream commits we ended up backporting to 6.4, I've already found at one reproducer using valgrind.

Comment 8 zhenfeng wang 2013-07-10 03:55:51 UTC

Hi erric
I need verify this bug on the latest libvirt version, However,I can just reproduce this mem leak ,can't reproduce the libvirt crash. I saw the How reproducible description in comment 0 that "It is probably difficult to trigger the races; but I could probably add some strategic sleep() statements", bug I'm not very clear about it, so can you offer me the detailed steps, or other methods to verify this bug ? thanks

Comment 9 zhenfeng wang 2013-07-10 03:59:42 UTC

hi eric,
I'm very sorry to not write your name correcty.

Comment 10 Eric Blake 2013-07-11 19:52:36 UTC

(In reply to zhenfeng wang from comment #8)
> Hi erric
> I need verify this bug on the latest libvirt version, However,I can just
> reproduce this mem leak ,can't reproduce the libvirt crash.

Reproducing that the mem leak existed (formula in bug 951703), and has now been fixed, should be good enough to verify this bug.  It's an observable symptom of the bug, and I don't know if it is worth the effort to try to come up with a more impressive symptom.

Comment 11 zhenfeng wang 2013-07-17 07:43:47 UTC

Thanks for Eric's reply. According to the comment 10, I retest this bug on libvirt-0.10.2-19.el6 , and find the mem leak has gone, so mark this bug verified.

Comment 13 errata-xmlrpc 2013-11-21 08:58:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1581.html