Bug 1355683

Summary: qemu core dump when do postcopy migration again after canceling a migration in postcopy phase
Product: Red Hat Enterprise Linux 7 Reporter: Qianqian Zhu <qizhu>
Component: qemu-kvm-rhevAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED ERRATA QA Contact: Qianqian Zhu <qizhu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: amit.shah, chayang, juzhang, knoel, quintela, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.6.0-17.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-07 21:23:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qianqian Zhu 2016-07-12 08:51:13 UTC
Description of problem:
Qemu core dump when do postcopy migration again after canceling a migration in postcopy phase.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-12.el7.x86_64
kernel-3.10.0-461.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Launch src guest:
gdb /usr/libexec/qemu-kvm
(gdb) run -name linux -cpu Westmere,check -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 7bef3814-631a-48bb-bae8-2b1de75f7a13 -nodefaults -monitor stdio -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot order=c,menu=on -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/nfsmount/RHEL-Server-7.3-64-virtio.qcow2,if=none,cache=writeback,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on -spice port=5901,disable-ticketing -vga qxl -global qxl-vga.revision=3 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3C:D9:2B:09:AB:44,bus=pci.0,addr=0x3

2.Launch guest on dest host with same cmd
3.Start postcopy migration then cancel it immediately
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.73.72.55:1234
(qemu) migrate_start_postcopy
(qemu) migrate_cancel

4.Launch guest on dest host again.
5.Start postcopy migration again
(qemu) migrate -d tcp:10.73.72.55:1234
(qemu) migrate_start_postcopy

Actual results:
Qemu core dump:
(qemu) 2016-07-12T08:42:34.819057Z qemu-kvm: invalid runstate transition: 'finish-migrate' -> 'finish-migrate'

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff567fe700 (LWP 28314)]
0x00007fffec5041d7 in raise () from /lib64/libc.so.6

Expected results:
Postcopy migration succeed

Additional info:

Comment 2 Dr. David Alan Gilbert 2016-07-15 10:12:36 UTC
Yes, I can recreate this.

It should be an unusual circumstance in practice; cancelling after postcopy has started is unsafe unless you control the destination.  If the destination hasn't started running it's OK to restart the source and try again, so libvirt could potentially do that - however, it would issue a continue to the source before retrying the migration so wouldn't hit this case.

I'll look into it.

Comment 4 Qianqian Zhu 2016-07-20 08:18:16 UTC
Test with:
qemu-kvm-rhev-2.6.0-13.el7.1355683a.x86_64
kernel-3.10.0-461.el7.x86_64

Steps:
1.Launch src guest
2.Launch guest on dest host with same cmd
3.Start postcopy migration then cancel it immediately
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.73.72.55:1234
(qemu) migrate_start_postcopy
(qemu) migrate_cancel

4.Launch guest on dest host again.
5.Start postcopy migration again
(qemu) migrate -d tcp:10.73.72.55:1234
(qemu) migrate_start_postcopy

Results:
No core dump, postcopy migration succeed and guest works well After step5.


Normal migration cancelling, succeed, but with below error:
(qemu) migrate_cancel 
(qemu) 2016-07-20T08:14:09.855908Z qemu-kvm: socket_writev_buffer: Got err=32 for (73885/18446744073709551615)

Cancelling in postcopy phase:
(qemu) 2016-07-20T08:06:34.581064Z qemu-kvm: socket_writev_buffer: Got err=32 for (131337/18446744073709551615)
2016-07-20T08:06:34.581090Z qemu-kvm: RP: Received invalid message 0x0000 length 0x0000

Comment 6 Miroslav Rezanina 2016-07-29 09:12:12 UTC
Fix included in qemu-kvm-rhev-2.6.0-17.el7

Comment 8 Qianqian Zhu 2016-08-23 05:44:48 UTC
Verified with:
qemu-kvm-rhev-2.6.0-20.el7.x86_64
kernel-3.10.0-491.el7.x86_64

Steps same as comment 4.
cli:
/usr/libexec/qemu-kvm -name linux -cpu SandyBridge -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 7bef3814-631a-48bb-bae8-2b1de75f7a13 -nodefaults -monitor stdio -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot order=c,menu=on -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/mntnfs/RHEL-Server-7.3-64-virtio.qcow2,if=none,cache=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on -spice port=5901,disable-ticketing -vga qxl -global qxl-vga.revision=3 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3C:D9:2B:09:AB:44,bus=pci.0,addr=0x3 -qmp tcp::5555,server,nowait

Result:
Postcopy migration succeed and guest works well.(qemu) 
Cancelling with the same warning:
2016-07-20T08:06:34.581064Z qemu-kvm: socket_writev_buffer: Got err=32 for (131337/18446744073709551615)
2016-07-20T08:06:34.581090Z qemu-kvm: RP: Received invalid message 0x0000 length 0x0000

Comment 9 Qianqian Zhu 2016-08-23 05:45:38 UTC
Moving to VERIFIED as per comment 8

Comment 11 errata-xmlrpc 2016-11-07 21:23:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html