Bug 1368422

Summary:	Post-copy migration fails with XBZRLE compression
Product:	Red Hat Enterprise Linux 7	Reporter:	Milan Zamazal <mzamazal>
Component:	qemu-kvm-rhev	Assignee:	Dr. David Alan Gilbert <dgilbert>
Status:	CLOSED ERRATA	QA Contact:	xianwang <xianwang>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	7.3	CC:	chayang, dgilbert, hhuang, jherrman, juzhang, michal.skrivanek, mrezanin, mtessun, mzamazal, qizhu, qzhang, virt-maint, xianwang
Target Milestone:	rc	Keywords:	ZStream
Target Release:	7.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.8.0-1	Doc Type:	Bug Fix
Doc Text:	Using post-copy migration with XOR-based zero run-lenth enconding (XBZRLE) compression previously caused the migration to fail and the guest to stay in a paused state. This update disables XBZRLE page compression for post-copy migration, and thus avoids the described problem.	Story Points:	---
Clone Of:
Clones:	1395360 (view as bug list)		Environment:
Last Closed:	2017-08-01 23:34:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1395265, 1395360, 1401400

Description Milan Zamazal 2016-08-19 10:49:44 UTC

Description of problem:

When I migrate a VM with XBZRLE compression enabled and I switch the migration to post-copy mode after several unsuccessful iterations of the migration, the migration fails and the VM remains in a paused state.

Version-Release number of selected component (if applicable):

2.6.0-20.el7.x86_64

How reproducible:

Most of the time.

Steps to Reproduce:

1. Run a VM:

     virsh create DOMAIN.xml

2. Run a memory intensive application in the VM.

3. Limit migration bandwidth to prevent success of pre-copy migration, e.g.:

     virsh migrate-setspeed DOMAIN 10

4. Migrate the VM with XBZRLE compression and postcopy enabled:

     virsh migrate DOMAIN qemu+tcp://root@HOST/system --verbose --live --compressed --comp-methods xbzrle --postcopy

5. Wait a couple of iterations, then switch to post-copy from another shell:

     virsh migrate-postcopy DOMAIN

6. The migration fails with an error like this:

     qemu-kvm: Unknown combination of migration flags: 0x40 (postcopy mode)
     qemu-kvm: error while loading state section id 2(ram)
     qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22

   and the VM gets paused.

Actual results:

The migration fails and the VM gets paused.

Expected results:

The migration succeeds.

Comment 11 xianwang 2017-02-14 03:17:33 UTC

This bug has been verified both for ppc and x86.

Bug reproduced in PPC platform:
Version-Release number of selected component (if applicable):
Host:
kernel:3.10.0-558.el7.ppc64le
qemu-kvm-rhev-2.6.0-22.el7.ppc64le
SLOF-20160223-6.gitdbbfda4.el7.noarch
Guest:
3.10.0-558.el7.ppc64le

Steps to Reproduce:

1.Boot a vm in src host with qemu cli:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries-rhel7.3.0 \
    -vga std  \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x4 \
    -chardev socket,id=devorg.qemu.guest_agent.0,path=/tmp/virtio_port-org.qemu.guest_agent.0-20160516-164929-dHQ00mMM,server,nowait \
    -device virtserialport,chardev=devorg.qemu.guest_agent.0,name=org.qemu.guest_agent.0,id=org.qemu.guest_agent.0,bus=virtio_serial_pci0.0  \
    -device nec-usb-xhci,id=usb1,addr=1d.7,multifunction=on,bus=pci.0 \
    -drive file=/root/RHEL.7.3.qcow2,if=none,id=blk1 \
    -device virtio-blk-pci,scsi=off,drive=blk1,id=blk-disk1,bootindex=1 \
    -drive id=drive_cd1,if=none,snapshot=off,aio=native,cache=none,media=cdrom,file=/root/RHEL-7.3-20161019.0-Server-ppc64le-dvd1.iso \
    -device scsi-cd,id=cd1,drive=drive_cd1,bootindex=2 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:71,id=idtlLxAk,vectors=4,netdev=idlkwV8e,bus=pci.0,addr=05 \
    -netdev tap,id=idlkwV8e,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 8G \
    -smp 2 \
    -cpu host \
    -device usb-kbd \
    -device usb-tablet \
    -qmp tcp:0:8881,server,nowait \
    -vnc :1  \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -monitor stdio \
    -enable-kvm
2.Boot a vm in dst host with qemu cli the same as src host and appending "-incoming tcp:0:5801"    
3. Run "test" which is a program in guest that make memory intensive and can produce dirty pages during migration,the detail of program is as "Additional info".
#gcc test.c -o test
#./test
4. Set migration configuration in HMP and do migration
(qemu) migrate_set_capability xbzrle on
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.19.112.39:5801
5. Check migration status, after producing dirty pages switch to post-copy.
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
dirty sync count: 7
dirty pages rate: 13587 pages
.......other info....
(qemu) migrate_start_postcopy

Actual results:

The migration fails and the VM gets paused.
In src HMP:
(qemu) migrate_start_postcopy 
(qemu) 2017-02-13T06:56:00.913043Z qemu-kvm: RP: Sibling indicated error 1
2017-02-13T06:56:01.105488Z qemu-kvm: socket_writev_buffer: Got err=104 for (32768/18446744073709551615)
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on 
Migration status: failed
total time: 0 milliseconds
(qemu) info status 
VM status: paused (postmigrate)

While in dst HMP:
(qemu) 2017-02-13T06:56:00.911136Z qemu-kvm: Unknown combination of migration flags: 0x40 (postcopy m)
2017-02-13T06:56:00.911222Z qemu-kvm: error while loading state section id 2(ram)
2017-02-13T06:56:00.911233Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22


Additional info:

(1)the program that specified in step 3
#gcc test.c -o test
#./test
#cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
int main()
{
void wakeup();
signal(SIGALRM,wakeup);
alarm(120);
char *buf = (char *) calloc(40960, 4096);
while (1) {
int i;
for (i = 0; i < 40960 * 4; i++) {
buf[i * 4096 / 4]++;
}
printf(".");
}
}
void wakeup()
{
exit(0);
}

Bug verify in ppc platform
Bug is verified in following version:
Host:
kernel:3.10.0-558.el7.ppc64le
qemu-kvm-rhev-2.8.0-1.el7.ppc64le
SLOF-20160223-6.gitdbbfda4.el7.noarch
Guest:
3.10.0-558.el7.ppc64le

steps:
the same as bug reproduced.

Actual results:
The migration successed and the VM is running
In src HMP:
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off 
Migration status: completed
dirty sync count: 22
postcopy request count: 1492
In dst HMP:
(qemu) info status
VM status: running

Bug verify in x86 platform
Bug is verified in following version:
Host:
3.10.0-563.el7.x86_64
qemu-kvm-rhev-2.8.0-1.el7.x86_64
Guest:
3.10.0-514.10.1.el7.x86_64

steps:
the same as bug reproduced.

Actual results:
The migration successed and the VM is running
In src HMP:
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off 
Migration status: completed
dirty sync count: 14
postcopy request count: 3010
In dst HMP:
(qemu) info status
VM status: running

So, this bug is fixed.

Comment 13 errata-xmlrpc 2017-08-01 23:34:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 14 errata-xmlrpc 2017-08-02 01:12:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 15 errata-xmlrpc 2017-08-02 02:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 16 errata-xmlrpc 2017-08-02 02:45:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 17 errata-xmlrpc 2017-08-02 03:09:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 18 errata-xmlrpc 2017-08-02 03:29:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392