Bug 518032

Summary: Restoring a qemu guest from a saved state file using -incoming sometimes fails and hangs
Product: [Fedora] Fedora Reporter: Daniel Berrangé <berrange>
Component: qemuAssignee: Juan Quintela <quintela>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 13CC: berrange, dwmw2, gcosta, itamar, jaswinder, markmc, maurizio.antillon, quintela, rbinkhor, virt-maint
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-0.12.3-5.fc13 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 570174 (view as bug list) Environment:
Last Closed: 2011-06-27 14:20:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 498969, 570174    
Attachments:
Description Flags
Clear fd on migration errors none

Description Daniel Berrangé 2009-08-18 14:40:45 UTC
Description of problem:
Running the libvirt-tck on rawhide, periodically results in a hang / stuck QEMU.

The QEMU log says

"load of migration failed"

but the QEMU process did not exit after this - it just sits there doing nothing.

So there's 2 bugs here

 1. restore from migration should not fail
 2. if it does fail, the process MUST exit


I'm reproducing this problem using the libvirt-tck, inside a F12 x86_64 guest, so QEMU's using TCG here.


THe full QEMU log / command is

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin /usr/bin/qemu-system-x86_64 -S -M pc-0.11 -no-kvm -m 64 -smp 1 -name test -uuid b5749a08-d73b-f3fa-fffd-5695c39bf751 -nographic -monitor unix:/var/run/libvirt/qemu/test.monitor,server,nowait -boot c -kernel /var/cache/libvirt-tck/os-i686-hvm/vmlinuz -initrd /var/cache/libvirt-tck/os-i686-hvm/initrd -drive file=/var/cache/libvirt-tck/os-i686-hvm/disk.img,if=virtio,index=0,boot=on -net none -serial pty -parallel none -usb -incoming exec:cat 
char device redirected to /dev/pts/3
load of migration failed

the libvirt XML for this was

<domain type="qemu">
  <name>test</name>
  <memory>65536</memory>
  <currentMemory>65536</currentMemory>
  <os>
    <type>hvm</type>
    <kernel>/var/cache/libvirt-tck/os-i686-hvm/vmlinuz</kernel>
    <initrd>/var/cache/libvirt-tck/os-i686-hvm/initrd</initrd>
  </os>
  <features>
    <acpi />
    <apic />
  </features>
  <devices>
    <disk type="file">
      <source file="/var/cache/libvirt-tck/os-i686-hvm/disk.img" />
      <target dev="vda" />
    </disk>
    <console type="pty" />
  </devices>
</domain>


The vmlinuz/initrd file I'm using is a F11 x86_64 install kernel, but that probably doesn't matter, since i'm not actually running it for long. 

The case case will boot QEMU, and immediately save it to file using migrate + exec, and then immediately restore it.

It'll hang 30% of the time

Version-Release number of selected component (if applicable):
qemu-0.10.91-0.5.rc1.fc12.x86_64
perl-Sys-Virt-TCK-0.1.0-4.fc12.noarch
libvirt-0.7.0-4.fc12.x86_64

How reproducible:
Sometimes, perhaps 30%

Steps to Reproduce:
1. Provision a F12 x86_64 guest
2. Install qemu, libvirt, perl-Sys-Virt-TCK
3. Reboot
4. Edit /etc/libvirt-tck/default.cfg, and set uri=qemu:///system
5. Run  

libvirt-tck -v -t /usr/share/libvirt-tck/tests/domain/100-transient-save-restore.t

  
Actual results:
It'll periodically get stuck at

1..5
# Creating a new transient domain
ok 1 - created transient domain object
# Checking that transient domain has gone away
ok 2 - NO_DOMAIN error raised from missing domain

and /var/log/libvirt/qemu/test.log shows
"load of migration failed"

You have to manually kill the qemu process to get it unstuck


Expected results:
Restore always completes successsfully

1..5
# Creating a new transient domain
ok 1 - created transient domain object
# Checking that transient domain has gone away
ok 2 - NO_DOMAIN error raised from missing domain
ok 3 - domain has been restored
ok 4 - restored domain is still there
# Destroying the transient domain
# Checking that transient domain has gone away
ok 5 - NO_DOMAIN error raised from missing domain
ok
All tests successful.
Files=1, Tests=5,  3 wallclock secs ( 0.05 usr  0.05 sys +  0.45 cusr  0.21 csys =  0.76 CPU)
Result: PASS


Additional info:

Comment 1 Daniel Berrangé 2009-08-18 15:01:58 UTC
This may or may not be TCG specific. I've not got spare hardware available to reproduce with F12 KVM.

Comment 2 Mark McLoughlin 2009-08-18 16:45:10 UTC
Juan's patches looks like they might help the "we should exit on error" problem:

  http://lists.gnu.org/archive/html/qemu-devel/2009-08/msg00876.html

Comment 3 Juan Quintela 2009-09-30 17:08:44 UTC
Taking a look at this.

Comment 4 Daniel Berrangé 2009-10-16 16:32:27 UTC
AFAICT, Juan's patches are for a different bit of code - savevm.


The libvirt restore process using incoming migrate. I've since found the error  message in guest logs 'load of migration failed' whcih points to this code in migration-exec.c:

static void exec_accept_incoming_migration(void *opaque)
{
    QEMUFile *f = opaque;
    int ret;

    ret = qemu_loadvm_state(f);
    if (ret < 0) {
        fprintf(stderr, "load of migration failed\n");
        goto err;
    }
    qemu_announce_self();
    dprintf("successfully loaded vm state\n");
    /* we've successfully migrated, close the fd */
    qemu_set_fd_handler2(qemu_stdio_fd(f), NULL, NULL, NULL, NULL);
    if (autostart)
        vm_start();

err:
    qemu_fclose(f);
}

This method just prints an error message upon load failure, and then carries on letting QEMU run as if all was well. In fact QEMU is pretty much trashed at this point, spinning in a 100% CPU loop, not responding to any monitor commands at all.   It needs to abort/exit  if it can't load an incoming migration 

The  TCP migration code suffers from the same problem of just printing error & pretending all is well.

I really need this fixed in F11/12/rawhide asap because it causes the libvirt test suite to hang, when we try to test our handling of broken migration/restore data.

Comment 5 Daniel Berrangé 2009-10-16 16:35:53 UTC
FYI, you can reproduce this problem very easily

 1. virsh  save GUEST  guest.saved
 2. truncate --size 50MB  guest.saved 
 3. virsh restore guest.saved

ie, just destroy the end of the guest saved state file, without touching the initial libvirt XML header.

Comment 6 Fedora Admin XMLRPC Client 2010-03-09 16:54:27 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 7 Fedora Admin XMLRPC Client 2010-03-09 17:20:06 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 8 Juan Quintela 2010-03-09 23:33:40 UTC
Created attachment 398965 [details]
Clear fd on migration errors

Comment 9 Juan Quintela 2010-03-09 23:34:48 UTC
Attached patch fixes the problem.  Already sent upstream or review.

Comment 10 Fedora Update System 2010-03-10 20:28:26 UTC
qemu-0.12.3-1.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/qemu-0.12.3-1.fc12

Comment 11 Fedora Update System 2010-03-10 20:41:23 UTC
qemu-0.12.3-5.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/qemu-0.12.3-5.fc13

Comment 12 Fedora Update System 2010-03-11 13:30:24 UTC
qemu-0.12.3-5.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 13 Fedora Update System 2010-03-16 04:09:58 UTC
qemu-0.12.3-2.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/qemu-0.12.3-2.fc12

Comment 14 Daniel Berrangé 2010-03-16 10:32:58 UTC
The provided fix does not ensure that QEMU exits. It merely stops it using 100% cpu. QEMU needs to exit since there is no other way to detect that the incoming migration has failed.

Comment 15 Bug Zapper 2010-03-16 12:18:40 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 16 Fedora Update System 2010-03-16 23:15:01 UTC
qemu-0.12.3-2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update qemu'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/qemu-0.12.3-2.fc12

Comment 17 Fedora Update System 2010-03-20 03:36:14 UTC
qemu-0.12.3-2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update qemu'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/qemu-0.12.3-2.fc12

Comment 18 Fedora Update System 2010-04-26 12:17:09 UTC
qemu-0.12.3-4.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/qemu-0.12.3-4.fc12

Comment 19 Fedora Update System 2010-04-28 01:12:43 UTC
qemu-0.12.3-4.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update qemu'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/qemu-0.12.3-4.fc12

Comment 20 Bug Zapper 2011-06-02 17:49:32 UTC
This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 21 Bug Zapper 2011-06-27 14:20:52 UTC
Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.