Bug 1354341

Summary:	guest hang after cancel migration then migrate again
Product:	Red Hat Enterprise Linux 7	Reporter:	mazhang <mazhang>
Component:	qemu-kvm-rhev	Assignee:	Thomas Huth <thuth>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.3	CC:	amit.shah, dgibson, dgilbert, knoel, michen, qzhang, thuth, virt-maint
Target Milestone:	rc	Keywords:	Regression
Target Release:	---
Hardware:	ppc64le
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.6.0-16.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-07 21:22:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description mazhang 2016-07-11 07:21:46 UTC

Description of problem:


Version-Release number of selected component (if applicable):

Host:
3.10.0-462.el7.ppc64le
qemu-kvm-rhev-2.6.0-11.el7.ppc64le

Guest:
3.10.0-460.el7.ppc64

How reproducible:
100%

Steps to Reproduce:
1.Boot guest on source host:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults  \
    -vga std  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_XgxM_L/monitor-qmpmonitor1-20160705-073204-vVF3o9IV,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_XgxM_L/monitor-catch_monitor-20160705-073204-vVF3o9IV,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_XgxM_L/serial-serial0-20160705-073204-vVF3o9IV,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device pci-ohci,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/RHEL-Server-7.3-ppc64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:6c:6d:6e:6f:70,id=idpBLLIM,vectors=4,netdev=idsGsxBH,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on  \
    -netdev tap,id=idsGsxBH,vhost=on \
    -m 16384  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device virtio-mouse-pci,id=mouse0 \
    -device virtio-keyboard-pci,id=kbd0 \
    -device virtio-tablet-pci,id=tbt0 \

2. Start qemu-kvm process on destination host with incoming connections on 5800.
${source command line} -incoming tcp:0:5800

3. Start migrate process then cancel it.
(qemu) migrate -d tcp:xxxx:5800
(qemu) migrate_cancel
(qemu) qemu-kvm: socket_writev_buffer: Got err=32 for (107041/18446744073709551615)                 <--- this error also hit on x86.

4. The qemu-kvm process on destination host will quit on step 3 , just start it, then start migration process again.
   After migration completed, guest has been migrate to destination host, check guest status.
 

Actual results:
Guest hang, no response on vnc window, also can not access by serial.

Expected results:
Guest work well.

Additional info:
1. Without cancel migration, guest work well.
2. X86 platform doesn't hit this problem.

Comment 1 mazhang 2016-07-11 08:05:42 UTC

Downgrade to qemu-kvm-rhev-2.3.0-31.el7.ppc64le and re-test, doesn't hit this problem, so this bug is a regression.

Comment 4 Thomas Huth 2016-07-20 11:28:03 UTC

I think the problem might be that when doing migrate_cancel, close_htab_fd() is currently not called, so the spapr->htab_fd file descriptor stays valid. During the next migration attempt, the htab is migrated using the old file descriptor, so it likely misses the beginning of the htab. I think we should make sure that the htab_fd is closed when migration fails. Something like this seems to work:

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1614,10 +1613,18 @@ static int htab_load(QEMUFile *f, void *opaque, int version_id)
     return 0;
 }
 
+static void htab_cleanup(void *opaque)
+{
+    sPAPRMachineState *spapr = opaque;
+
+    close_htab_fd(spapr);
+}
+
 static SaveVMHandlers savevm_htab_handlers = {
     .save_live_setup = htab_save_setup,
     .save_live_iterate = htab_save_iterate,
     .save_live_complete_precopy = htab_save_complete,
+    .cleanup = htab_cleanup,
     .load_state = htab_load,
 };

Not sure yet whether this is really the right solution, though...

Comment 5 Thomas Huth 2016-07-21 09:47:56 UTC

I've now sent the patch with the htab_cleanup() fix upstream:
http://news.gmane.org/find-root.php?message_id=1469092894-12801-1-git-send-email-thuth@redhat.com

Comment 6 Miroslav Rezanina 2016-07-26 06:57:14 UTC

Fix included in qemu-kvm-rhev-2.6.0-16.el7

Comment 8 mazhang 2016-08-02 02:48:42 UTC

Test this bug on qemu-kvm-rhev-2.6.0-17.el7.ppc64le 3 times as the steps in comment#0, the problem not happened any more.

Host:
3.10.0-481.el7.ppc64le
qemu-kvm-rhev-2.6.0-17.el7.ppc64le

Guest:
3.10.0-481.el7.ppc64

So this bug has been fixed.

Comment 11 errata-xmlrpc 2016-11-07 21:22:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html