Bug 1395360

Summary: Post-copy migration fails with XBZRLE compression
Product: Red Hat Enterprise Linux 7 Reporter: Marcel Kolaja <mkolaja>
Component: qemu-kvm-rhevAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED ERRATA QA Contact: xianwang <xianwang>
Severity: unspecified Docs Contact:
Priority: high    
Version: 7.3CC: chayang, dgilbert, hhuang, jherrman, juzhang, knoel, michal.skrivanek, mrezanin, mzamazal, qizhu, qzhang, virt-maint, xianwang, zhengtli
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.6.0-28.el7_3.1 Doc Type: Bug Fix
Doc Text:
Using post-copy migration with XOR-based zero run-lenth enconding (XBZRLE) compression previously caused the migration to fail and the guest to stay in a paused state. This update disables XBZRLE page compression for post-copy migration, and thus avoids the described problem.
Story Points: ---
Clone Of: 1368422 Environment:
Last Closed: 2017-01-17 20:10:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1368422    
Bug Blocks:    

Description Marcel Kolaja 2016-11-15 19:12:16 UTC
This bug has been copied from bug #1368422 and has been proposed
to be backported to 7.3 z-stream (EUS).

Comment 3 Miroslav Rezanina 2016-11-30 10:43:21 UTC
Fix included in qemu-kvm-rhev-2.6.0-28.el7_3.1

Comment 5 Miroslav Rezanina 2016-12-01 10:26:43 UTC
Hi Qunfang,

there are issues with build target configuration. Package with fix should be qemu-kvm-rhev-2.6.0-28.el7_3.1. As soon as target is fixed, I'll build correct version.

Version -28 added some arm only fixes so it wasn't released for x86_64/ppc64. However, we will keep -28 for z-stream instead of -27.

Mirek

Comment 6 Qunfang Zhang 2016-12-02 02:08:27 UTC
Hi, Mirek

Got it, Thanks for the information.

Comment 7 xianwang 2016-12-05 13:14:35 UTC
Bug reproduced:

This bug has been reproduced in PPC platform.

Version-Release number of selected component (if applicable):
Host:
kernel:3.10.0-514.el7.ppc64le
qemu-kvm-rhev-2.6.0-20.el7.ppc64le
SLOF-20160223-6.gitdbbfda4.el7
Guest:
3.10.0-514.el7.ppc64le

Steps to Reproduce:

1. This production is in single one host,ie.,the src=dst.
Boot a vm with qemu cli in ppc64le host,the full cli is as "Additional info",then, boot another vm in same host with same cli as first one and appending "-incoming tcp:0:5801"
     
2. Run "test" which is a program in guest that make memory intensive and can produce dirty pages during migration,the detail of program is as "Additional info".
#gcc test.c -o test
#./test

3. Set migration configuration in HMP and do migration
(qemu) migrate_set_speed 10
(qemu) migrate_set_capability xbzrle on
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:127.0.0.1:5801

4. Check migration status, after producing dirty pages switch to post-copy.
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on
dirty sync count: 8
dirty pages rate: 5650 pages
.......other info....
(qemu) migrate_start_postcopy


Actual results:

The migration fails and the VM gets paused.
In src HMP:
(qemu) 2016-12-05T08:48:52.199109Z qemu-kvm: RP: Sibling indicated error 1
2016-12-05T08:48:52.279863Z qemu-kvm: socket_writev_buffer: Got err=104 for (32768/18446744073709551615)
(qemu) info status
VM status: paused (postmigrate)
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on 
Migration status: failed
total time: 0 milliseconds
While in dst HMP:
(qemu) 2016-12-05T08:48:52.157951Z qemu-kvm: Unknown combination of migration flags: 0x40 (postcopy mode)
2016-12-05T08:48:52.158039Z qemu-kvm: error while loading state section id 2(ram)
2016-12-05T08:48:52.158049Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22

Additional info:
(1)the full qemu cli:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries-rhel7.3.0 \
    -vga std  \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x4 \
    -device ich9-usb-ehci1,id=usb1,addr=1d.7,multifunction=on,bus=pci.0 \
    -chardev socket,id=console0,path=/tmp/console0,server,nowait \
    -device spapr-vty,chardev=console0 \
    -chardev socket,id=console1,path=/tmp/console1,server,nowait \
    -device spapr-vty,chardev=console1 \
    -drive file=/root/R1.qcow2,if=none,id=blk1 \
    -device virtio-blk-pci,scsi=off,drive=blk1,id=blk-disk1,bootindex=1 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:71,id=idtlLxAk,vectors=4,netdev=idlkwV8e,bus=pci.0,addr=05 \
    -netdev tap,id=idlkwV8e,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 4096 \
    -smp 4 \
    -cpu host \
    -device usb-kbd \
    -device usb-mouse \
    -qmp tcp:0:8881,server,nowait \
    -vnc :1  \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -monitor stdio \
    -enable-kvm
(2)the program that specified in step 2
#gcc test.c -o test
#./test
#cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
int main()
{
void wakeup();
signal(SIGALRM,wakeup);
alarm(120);
char *buf = (char *) calloc(40960, 4096);
while (1) {
int i;
for (i = 0; i < 40960 * 4; i++) {
buf[i * 4096 / 4]++;
}
printf(".");
}
}
void wakeup()
{
exit(0);
}

Bug verify
Bug is verified pass both in ppc64le and x86 with qemu-kvm-rhev-2.6.0-28.el7_3.1
Bug is verified in ppc version:
Host:
3.10.0-514.el7.ppc64le
qemu-kvm-rhev-2.6.0-28.el7_3.1.ppc64le
SLOF-20160223-6.gitdbbfda4.el7
Guest:
3.10.0-514.el7.ppc64le

steps:
the same as bug reproduced.

Actual results:
The migration successed and the VM is running
In src HMP:
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on 
Migration status: completed
dirty sync count: 7
postcopy request count: 15
In dst HMP:
(qemu) info status
VM status: running

Bug is verified in x86 version:
Host:
3.10.0-514.el7.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.1.x86_64
Guest:
3.10.0-514.el7.x86_64

steps:
the same as bug reproduced.

Actual results:
The migration successed and the VM is running
In src HMP:
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on 
Migration status: completed
dirty sync count: 4
postcopy request count: 46
In dst HMP:
(qemu) info status
VM status: running

So, this bug is verified, it should be changed status to verified.

Comment 9 Qunfang Zhang 2016-12-12 03:01:26 UTC
Setting to VERIFIED according to comment 7.

Comment 11 errata-xmlrpc 2017-01-17 20:10:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0115.html