1368422 – Post-copy migration fails with XBZRLE compression

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1368422 - Post-copy migration fails with XBZRLE compression

Summary: Post-copy migration fails with XBZRLE compression

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	7.4
Assignee:	Dr. David Alan Gilbert
QA Contact:	xianwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1395265 1395360 1401400
TreeView+	depends on / blocked

Reported:	2016-08-19 10:49 UTC by Milan Zamazal
Modified:	2017-08-02 03:29 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qemu-kvm-rhev-2.8.0-1
Doc Type:	Bug Fix
Doc Text:	Using post-copy migration with XOR-based zero run-lenth enconding (XBZRLE) compression previously caused the migration to fail and the guest to stay in a paused state. This update disables XBZRLE page compression for post-copy migration, and thus avoids the described problem.
Clone Of:
Clones:	1395360 (view as bug list)
Environment:
Last Closed:	2017-08-01 23:34:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:2392	0	normal	SHIPPED_LIVE	Important: qemu-kvm-rhev security, bug fix, and enhancement update	2017-08-01 20:04:36 UTC

Description Milan Zamazal 2016-08-19 10:49:44 UTC

Description of problem:

When I migrate a VM with XBZRLE compression enabled and I switch the migration to post-copy mode after several unsuccessful iterations of the migration, the migration fails and the VM remains in a paused state.

Version-Release number of selected component (if applicable):

2.6.0-20.el7.x86_64

How reproducible:

Most of the time.

Steps to Reproduce:

1. Run a VM:

     virsh create DOMAIN.xml

2. Run a memory intensive application in the VM.

3. Limit migration bandwidth to prevent success of pre-copy migration, e.g.:

     virsh migrate-setspeed DOMAIN 10

4. Migrate the VM with XBZRLE compression and postcopy enabled:

     virsh migrate DOMAIN qemu+tcp://root@HOST/system --verbose --live --compressed --comp-methods xbzrle --postcopy

5. Wait a couple of iterations, then switch to post-copy from another shell:

     virsh migrate-postcopy DOMAIN

6. The migration fails with an error like this:

     qemu-kvm: Unknown combination of migration flags: 0x40 (postcopy mode)
     qemu-kvm: error while loading state section id 2(ram)
     qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22

   and the VM gets paused.

Actual results:

The migration fails and the VM gets paused.

Expected results:

The migration succeeds.

Comment 11 xianwang 2017-02-14 03:17:33 UTC

This bug has been verified both for ppc and x86.

Bug reproduced in PPC platform:
Version-Release number of selected component (if applicable):
Host:
kernel:3.10.0-558.el7.ppc64le
qemu-kvm-rhev-2.6.0-22.el7.ppc64le
SLOF-20160223-6.gitdbbfda4.el7.noarch
Guest:
3.10.0-558.el7.ppc64le

Steps to Reproduce:

1.Boot a vm in src host with qemu cli:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries-rhel7.3.0 \
    -vga std  \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x4 \
    -chardev socket,id=devorg.qemu.guest_agent.0,path=/tmp/virtio_port-org.qemu.guest_agent.0-20160516-164929-dHQ00mMM,server,nowait \
    -device virtserialport,chardev=devorg.qemu.guest_agent.0,name=org.qemu.guest_agent.0,id=org.qemu.guest_agent.0,bus=virtio_serial_pci0.0  \
    -device nec-usb-xhci,id=usb1,addr=1d.7,multifunction=on,bus=pci.0 \
    -drive file=/root/RHEL.7.3.qcow2,if=none,id=blk1 \
    -device virtio-blk-pci,scsi=off,drive=blk1,id=blk-disk1,bootindex=1 \
    -drive id=drive_cd1,if=none,snapshot=off,aio=native,cache=none,media=cdrom,file=/root/RHEL-7.3-20161019.0-Server-ppc64le-dvd1.iso \
    -device scsi-cd,id=cd1,drive=drive_cd1,bootindex=2 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:71,id=idtlLxAk,vectors=4,netdev=idlkwV8e,bus=pci.0,addr=05 \
    -netdev tap,id=idlkwV8e,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 8G \
    -smp 2 \
    -cpu host \
    -device usb-kbd \
    -device usb-tablet \
    -qmp tcp:0:8881,server,nowait \
    -vnc :1  \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -monitor stdio \
    -enable-kvm
2.Boot a vm in dst host with qemu cli the same as src host and appending "-incoming tcp:0:5801"    
3. Run "test" which is a program in guest that make memory intensive and can produce dirty pages during migration,the detail of program is as "Additional info".
#gcc test.c -o test
#./test
4. Set migration configuration in HMP and do migration
(qemu) migrate_set_capability xbzrle on
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.19.112.39:5801
5. Check migration status, after producing dirty pages switch to post-copy.
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off
dirty sync count: 7
dirty pages rate: 13587 pages
.......other info....
(qemu) migrate_start_postcopy

Actual results:

The migration fails and the VM gets paused.
In src HMP:
(qemu) migrate_start_postcopy 
(qemu) 2017-02-13T06:56:00.913043Z qemu-kvm: RP: Sibling indicated error 1
2017-02-13T06:56:01.105488Z qemu-kvm: socket_writev_buffer: Got err=104 for (32768/18446744073709551615)
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on 
Migration status: failed
total time: 0 milliseconds
(qemu) info status 
VM status: paused (postmigrate)

While in dst HMP:
(qemu) 2017-02-13T06:56:00.911136Z qemu-kvm: Unknown combination of migration flags: 0x40 (postcopy m)
2017-02-13T06:56:00.911222Z qemu-kvm: error while loading state section id 2(ram)
2017-02-13T06:56:00.911233Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22


Additional info:

(1)the program that specified in step 3
#gcc test.c -o test
#./test
#cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
int main()
{
void wakeup();
signal(SIGALRM,wakeup);
alarm(120);
char *buf = (char *) calloc(40960, 4096);
while (1) {
int i;
for (i = 0; i < 40960 * 4; i++) {
buf[i * 4096 / 4]++;
}
printf(".");
}
}
void wakeup()
{
exit(0);
}

Bug verify in ppc platform
Bug is verified in following version:
Host:
kernel:3.10.0-558.el7.ppc64le
qemu-kvm-rhev-2.8.0-1.el7.ppc64le
SLOF-20160223-6.gitdbbfda4.el7.noarch
Guest:
3.10.0-558.el7.ppc64le

steps:
the same as bug reproduced.

Actual results:
The migration successed and the VM is running
In src HMP:
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off 
Migration status: completed
dirty sync count: 22
postcopy request count: 1492
In dst HMP:
(qemu) info status
VM status: running

Bug verify in x86 platform
Bug is verified in following version:
Host:
3.10.0-563.el7.x86_64
qemu-kvm-rhev-2.8.0-1.el7.x86_64
Guest:
3.10.0-514.10.1.el7.x86_64

steps:
the same as bug reproduced.

Actual results:
The migration successed and the VM is running
In src HMP:
(qemu) info migrate
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off 
Migration status: completed
dirty sync count: 14
postcopy request count: 3010
In dst HMP:
(qemu) info status
VM status: running

So, this bug is fixed.

Comment 13 errata-xmlrpc 2017-08-01 23:34:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 14 errata-xmlrpc 2017-08-02 01:12:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 15 errata-xmlrpc 2017-08-02 02:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 16 errata-xmlrpc 2017-08-02 02:45:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 17 errata-xmlrpc 2017-08-02 03:09:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 18 errata-xmlrpc 2017-08-02 03:29:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Note You need to log in before you can comment on or make changes to this bug.