Red Hat Bugzilla – Bug 1020216
libvirt fails to shut down domain: could not destroy libvirt domain: Requested operation is not valid: domain is not running
Last modified: 2016-04-26 10:12:29 EDT
Description of problem:
I get this error intermittently when calling virDomainDestroyFlags
Fatal error: exception Guestfs.Error("could not destroy libvirt domain: Requested operation is not valid: domain is not running [code=55 domain=10]")
The domain has possibly exited itself before we call
However, and this is strange: if I add a sleep to the guest
so it doesn't shut down immediately, eg. 'sleep 30', then
virDomainDestroyFlags will hang for 30 seconds, and *then*
give the same error as above.
There are no errors in the qemu log file.
qemu does not appear to be segfaulting (so different from bug 853369).
Version-Release number of selected component (if applicable):
(Will try updating to qemu from Rawhide shortly)
Not reliably reproducible. Right now on my laptop it's happening
90% of the time, but usually it doesn't happen at all.
Steps to Reproduce:
1. Run a virt tool such as virt-resize.
Some more random data points:
If the machine is loaded with disk activity, then the bug doesn't
happen. It seems like a race condition of some sort.
Upgrading to qemu-1.6.0-10.fc21 does appear to have made the bug
happen less often.
I'm afraid I don't have a good reproducer for this. It may
be connected with ./configure --enable-valgrind-daemon which is
a debugging option that changes the order of shutdown: in production
builds we always rely on libvirt actively killing qemu, but when
--enable-valgrind-daemon is used, the appliance can shut itself
down. Production builds would never have this option enabled.
For reference the command I'm actually using to reproduce this locally is:
LIBGUESTFS_DEBUG=1 ./run ./builder/website/test-guest.sh fedora-18
(In reply to Richard W.M. Jones from comment #0)
> However, and this is strange: if I add a sleep to the guest
> so it doesn't shut down immediately, eg. 'sleep 30', then
> virDomainDestroyFlags will hang for 30 seconds, and *then*
> give the same error as above.
Note: This part is NOT strange. The hang here was in libguestfs.
Just ignore this paragraph in the bug description.
On the surface this doesn't really look like a bug. If the guest is not running when virDomainDestroyFlags is called, then getting back this error code is expected. So the real question here is why QEMU is exited before libguestfs expected it to.
Can you capture a trace of libvirtd with the following log settings
LIBVIRT_LOG_OUTPUTS="1:qemu 1:command 1:security 1:process 1:cgroup"
while triggering the 'virDomainDestroyFlags' API, and also provide the corresponding /var/log/libvirt/qemu/$GUEST.log. The timestamps between the two may let us identify the sequencing
Unfortunately, the overhead of debugging makes the bug go away ...
Here is the script I'm using:
rm -f $vfile $gfile
export LIBVIRT_LOG_OUTPUTS="1:qemu 1:command 1:security 1:process 1:cgroup 1:file:$vfile"
$dir/run $dir/builder/virt-builder \
fedora-19 --output /tmp/fedora-19.img --size 10G |& tee $gfile
ls -l $vfile $gfile
Why does that script never write to libvirt.log?
(In reply to Daniel Berrange from comment #3)
> On the surface this doesn't really look like a bug. If the guest is not
> running when virDomainDestroyFlags is called, then getting back this error
> code is expected. So the real question here is why QEMU is exited before
> libguestfs expected it to.
As I mentioned on IRC:
(1) We need to find out if qemu segfaulted during shutdown.
That's the reason for the graceful flag:
(2) While it may be true that currently virDomainDestroyFlags acts
like you've described, it's not useful behaviour. What we really
want is more like how Unix kill + waitpid works, ie. you can kill
a process and wait for its exit status, and that works even if the
process exits itself before or between the two system calls.
My bad, I gave the wrong env variable name
LIBVIRT_LOG_FILTERS="1:qemu 1:command 1:security 1:process 1:cgroup"
For some reason this bug has started happening again.
I'll see if I can collect some debug information this time ...
This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle.
Changing version to '22'.
More information and reason for this action is here:
Haven't heard much on this bug for a while, so assuming it's gone. If anyone is still hitting this, please reopen