Bug 950286
Summary: | libvirtd crash on race with auto-destroy guests | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Eric Blake <eblake> |
Component: | libvirt | Assignee: | Eric Blake <eblake> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.4 | CC: | acathrow, bili, cpelland, cwei, dyuan, eblake, jentrena, lyarwood, mjenner, mzhan, ydu, zhwang |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-0.10.2-19.el6 | Doc Type: | Bug Fix |
Doc Text: |
Under certain conditions, when a connection was closed, guests set to be automatically destroyed failed to be destroyed and the libvirtd daemon terminated unexpectedly. A series of patches addressing various crash scenarios has been provided and libvirtd no longer crashes while auto-destroying guests.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-21 08:58:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 951073 |
Description
Eric Blake
2013-04-10 02:17:31 UTC
Moving to POST, since all of these patches are upstream and will be picked up by rebase Hi Eirc, I can not reproduce this bug using simple migration, my steps: 1.Use ctl+c to terminate the migrate process before finish: # virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by client 2.Both source and destination libvirtd not crash: # service libvirtd status libvirtd (pid 27426) is running... 3.And the guest is destroyed and disappear on destination. I saw you said it is a race problem, so is it difficult to reproduce? Or do I miss something? Thanks, EricLee Backport notes: looks like it is also important to have this one: commit 3898ba7f2cf067ae5852c40d68460c64fb06c94f Author: Jiri Denemark <jdenemar> Date: Fri Feb 15 13:05:12 2013 +0100 qemu: Turn closeCallbacks into virObjectLockable To avoid having to hold the qemu driver lock while iterating through close callbacks and calling them. This fixes a real deadlock when a domain which is being migrated from another host gets autodestoyed as a result of broken connection to the other host. (In reply to comment #2) > Hi Eirc, > > I can not reproduce this bug using simple migration, my steps: > > 1.Use ctl+c to terminate the migrate process before finish: > # virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose > Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by > client Based on additional feedback on my patches for 6.4: http://post-office.corp.redhat.com/archives/rhvirt-patches/2013-April/msg00276.html The only observable bugs in 6.4 are the use-after-free in commit 4e4c6620 (but it is only a read, not a write, so the symptoms are limited to printing garbage in a log or observing a complaint when run under valgrind), and a potential use-after-free of a mutex in commit 7ccad0b (try migrating a transient guest). The other upstream commits mentioned in comment 0 deal with deadlock that was only present in a small window of upstream libvirt.git when we dropped the big qemu driver lock; given that the deadlock was not present until commit a9e97e0 (upstream 1.0.3), it is not present in 6.4, and the rebase for 6.5 has already patched things. The fact that you can't reproduce a deadlock on formal builds is good; you'd have to compare a reproducer against a specific build from libvirt.git. > I saw you said it is a race problem, so is it difficult to reproduce? > Or do I miss something? I can still try and come up with a temporary patch that uses strategic sleep() to make the use-after-free of the mutex on a migrated transient guest a bit more obvious. Since all the issues are already fixed in 6.5, I'm posting the reproducer formulas to the 6.4.z counterpart, bug 951073. Of the two upstream commits we ended up backporting to 6.4, I've already found at one reproducer using valgrind. Hi erric I need verify this bug on the latest libvirt version, However,I can just reproduce this mem leak ,can't reproduce the libvirt crash. I saw the How reproducible description in comment 0 that "It is probably difficult to trigger the races; but I could probably add some strategic sleep() statements", bug I'm not very clear about it, so can you offer me the detailed steps, or other methods to verify this bug ? thanks hi eric, I'm very sorry to not write your name correcty. (In reply to zhenfeng wang from comment #8) > Hi erric > I need verify this bug on the latest libvirt version, However,I can just > reproduce this mem leak ,can't reproduce the libvirt crash. Reproducing that the mem leak existed (formula in bug 951703), and has now been fixed, should be good enough to verify this bug. It's an observable symptom of the bug, and I don't know if it is worth the effort to try to come up with a more impressive symptom. Thanks for Eric's reply. According to the comment 10, I retest this bug on libvirt-0.10.2-19.el6 , and find the mem leak has gone, so mark this bug verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1581.html |