Bug 1354341

Summary: guest hang after cancel migration then migrate again
Product: Red Hat Enterprise Linux 7 Reporter: mazhang <mazhang>
Component: qemu-kvm-rhevAssignee: Thomas Huth <thuth>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: amit.shah, dgibson, dgilbert, knoel, michen, qzhang, thuth, virt-maint
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.6.0-16.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-07 21:22:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mazhang 2016-07-11 07:21:46 UTC
Description of problem:


Version-Release number of selected component (if applicable):

Host:
3.10.0-462.el7.ppc64le
qemu-kvm-rhev-2.6.0-11.el7.ppc64le

Guest:
3.10.0-460.el7.ppc64

How reproducible:
100%

Steps to Reproduce:
1.Boot guest on source host:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults  \
    -vga std  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_XgxM_L/monitor-qmpmonitor1-20160705-073204-vVF3o9IV,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_XgxM_L/monitor-catch_monitor-20160705-073204-vVF3o9IV,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_XgxM_L/serial-serial0-20160705-073204-vVF3o9IV,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device pci-ohci,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/RHEL-Server-7.3-ppc64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:6c:6d:6e:6f:70,id=idpBLLIM,vectors=4,netdev=idsGsxBH,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on  \
    -netdev tap,id=idsGsxBH,vhost=on \
    -m 16384  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device virtio-mouse-pci,id=mouse0 \
    -device virtio-keyboard-pci,id=kbd0 \
    -device virtio-tablet-pci,id=tbt0 \

2. Start qemu-kvm process on destination host with incoming connections on 5800.
${source command line} -incoming tcp:0:5800

3. Start migrate process then cancel it.
(qemu) migrate -d tcp:xxxx:5800
(qemu) migrate_cancel
(qemu) qemu-kvm: socket_writev_buffer: Got err=32 for (107041/18446744073709551615)                 <--- this error also hit on x86.

4. The qemu-kvm process on destination host will quit on step 3 , just start it, then start migration process again.
   After migration completed, guest has been migrate to destination host, check guest status.
 

Actual results:
Guest hang, no response on vnc window, also can not access by serial.

Expected results:
Guest work well.

Additional info:
1. Without cancel migration, guest work well.
2. X86 platform doesn't hit this problem.

Comment 1 mazhang 2016-07-11 08:05:42 UTC
Downgrade to qemu-kvm-rhev-2.3.0-31.el7.ppc64le and re-test, doesn't hit this problem, so this bug is a regression.

Comment 4 Thomas Huth 2016-07-20 11:28:03 UTC
I think the problem might be that when doing migrate_cancel, close_htab_fd() is currently not called, so the spapr->htab_fd file descriptor stays valid. During the next migration attempt, the htab is migrated using the old file descriptor, so it likely misses the beginning of the htab. I think we should make sure that the htab_fd is closed when migration fails. Something like this seems to work:

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1614,10 +1613,18 @@ static int htab_load(QEMUFile *f, void *opaque, int version_id)
     return 0;
 }
 
+static void htab_cleanup(void *opaque)
+{
+    sPAPRMachineState *spapr = opaque;
+
+    close_htab_fd(spapr);
+}
+
 static SaveVMHandlers savevm_htab_handlers = {
     .save_live_setup = htab_save_setup,
     .save_live_iterate = htab_save_iterate,
     .save_live_complete_precopy = htab_save_complete,
+    .cleanup = htab_cleanup,
     .load_state = htab_load,
 };

Not sure yet whether this is really the right solution, though...

Comment 5 Thomas Huth 2016-07-21 09:47:56 UTC
I've now sent the patch with the htab_cleanup() fix upstream:
http://news.gmane.org/find-root.php?message_id=1469092894-12801-1-git-send-email-thuth@redhat.com

Comment 6 Miroslav Rezanina 2016-07-26 06:57:14 UTC
Fix included in qemu-kvm-rhev-2.6.0-16.el7

Comment 8 mazhang 2016-08-02 02:48:42 UTC
Test this bug on qemu-kvm-rhev-2.6.0-17.el7.ppc64le 3 times as the steps in comment#0, the problem not happened any more.

Host:
3.10.0-481.el7.ppc64le
qemu-kvm-rhev-2.6.0-17.el7.ppc64le

Guest:
3.10.0-481.el7.ppc64

So this bug has been fixed.

Comment 11 errata-xmlrpc 2016-11-07 21:22:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html