Red Hat Bugzilla – Bug 950286
libvirtd crash on race with auto-destroy guests
Last modified: 2013-11-21 03:58:37 EST
Description of problem: There are a number of upstream patches that solve deadlock and/or crash scenarios due to use-after-free when handling auto-destroy guests. These need to be backported to RHEL. While VDSM doesn't directly use auto-destroy guests, the act of migration uses an auto-destroy guest on the destination until the source has gotten far enough along in the migration process; also, virt-sandbox uses auto-destroy. Version-Release number of selected component (if applicable): libvirt-0.10.2-18.el6_4.3 How reproducible: I found these commits by reading git logs. It is probably difficult to trigger the races; but I could probably add some strategic sleep() statements if we need to prove the existence of at least one of the problems fixed in this series. Steps to Reproduce: 1. create some autodestroy guests 2. close the connection; guests should be destroyed, but libvirtd should not crash 3. Actual results: if the race hits, valgrind can report a use-after-free, or libvirtd can even crash Expected results: no crash Additional info: the following upstream patches are worth backporting: commit 96b893f092b3972bc04be975945830dc974af2b3 Author: Daniel P. Berrange <berrange@redhat.com> Date: Thu Feb 28 13:30:49 2013 +0000 Fix deadlock in QEMU close callback APIs commit 7ccad0b16d12d7616c7c21b1359f6a55a9677521 Author: Daniel P. Berrange <berrange@redhat.com> Date: Thu Feb 28 12:18:48 2013 +0000 Fix crash in QEMU auto-destroy with transient guests commit b4a124efc328ac221ff4e8a6fde3a1a0c0202d68 Author: Daniel P. Berrange <berrange@redhat.com> Date: Wed Feb 27 16:23:16 2013 +0000 Fix autodestroy of QEMU guests commit 4e4c6620e2e2937da03d37720d39368d297f5743 Author: Daniel P. Berrange <berrange@redhat.com> Date: Wed Jan 23 17:22:27 2013 +0000 Avoid use of free'd memory in auto destroy callback (Note that 568a6cd also touched autodestroy, but was later reverted by 9c4ecb3; we don't necessarily need to backport either of those patches, unless doing so helps avoid merge conflicts in the other patches)
Moving to POST, since all of these patches are upstream and will be picked up by rebase
Hi Eirc, I can not reproduce this bug using simple migration, my steps: 1.Use ctl+c to terminate the migrate process before finish: # virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by client 2.Both source and destination libvirtd not crash: # service libvirtd status libvirtd (pid 27426) is running... 3.And the guest is destroyed and disappear on destination. I saw you said it is a race problem, so is it difficult to reproduce? Or do I miss something? Thanks, EricLee
Backport notes: looks like it is also important to have this one: commit 3898ba7f2cf067ae5852c40d68460c64fb06c94f Author: Jiri Denemark <jdenemar@redhat.com> Date: Fri Feb 15 13:05:12 2013 +0100 qemu: Turn closeCallbacks into virObjectLockable To avoid having to hold the qemu driver lock while iterating through close callbacks and calling them. This fixes a real deadlock when a domain which is being migrated from another host gets autodestoyed as a result of broken connection to the other host.
(In reply to comment #2) > Hi Eirc, > > I can not reproduce this bug using simple migration, my steps: > > 1.Use ctl+c to terminate the migrate process before finish: > # virsh migrate --live mig qemu+ssh://10.66.85.217/system --verbose > Migration: [ 85 %]^Cerror: operation aborted: migration job: canceled by > client Based on additional feedback on my patches for 6.4: http://post-office.corp.redhat.com/archives/rhvirt-patches/2013-April/msg00276.html The only observable bugs in 6.4 are the use-after-free in commit 4e4c6620 (but it is only a read, not a write, so the symptoms are limited to printing garbage in a log or observing a complaint when run under valgrind), and a potential use-after-free of a mutex in commit 7ccad0b (try migrating a transient guest). The other upstream commits mentioned in comment 0 deal with deadlock that was only present in a small window of upstream libvirt.git when we dropped the big qemu driver lock; given that the deadlock was not present until commit a9e97e0 (upstream 1.0.3), it is not present in 6.4, and the rebase for 6.5 has already patched things. The fact that you can't reproduce a deadlock on formal builds is good; you'd have to compare a reproducer against a specific build from libvirt.git. > I saw you said it is a race problem, so is it difficult to reproduce? > Or do I miss something? I can still try and come up with a temporary patch that uses strategic sleep() to make the use-after-free of the mutex on a migrated transient guest a bit more obvious.
Since all the issues are already fixed in 6.5, I'm posting the reproducer formulas to the 6.4.z counterpart, bug 951073. Of the two upstream commits we ended up backporting to 6.4, I've already found at one reproducer using valgrind.
Hi erric I need verify this bug on the latest libvirt version, However,I can just reproduce this mem leak ,can't reproduce the libvirt crash. I saw the How reproducible description in comment 0 that "It is probably difficult to trigger the races; but I could probably add some strategic sleep() statements", bug I'm not very clear about it, so can you offer me the detailed steps, or other methods to verify this bug ? thanks
hi eric, I'm very sorry to not write your name correcty.
(In reply to zhenfeng wang from comment #8) > Hi erric > I need verify this bug on the latest libvirt version, However,I can just > reproduce this mem leak ,can't reproduce the libvirt crash. Reproducing that the mem leak existed (formula in bug 951703), and has now been fixed, should be good enough to verify this bug. It's an observable symptom of the bug, and I don't know if it is worth the effort to try to come up with a more impressive symptom.
Thanks for Eric's reply. According to the comment 10, I retest this bug on libvirt-0.10.2-19.el6 , and find the mem leak has gone, so mark this bug verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1581.html