Red Hat Bugzilla – Bug 1368422
Post-copy migration fails with XBZRLE compression
Last modified: 2017-08-01 23:29:59 EDT
Description of problem: When I migrate a VM with XBZRLE compression enabled and I switch the migration to post-copy mode after several unsuccessful iterations of the migration, the migration fails and the VM remains in a paused state. Version-Release number of selected component (if applicable): 2.6.0-20.el7.x86_64 How reproducible: Most of the time. Steps to Reproduce: 1. Run a VM: virsh create DOMAIN.xml 2. Run a memory intensive application in the VM. 3. Limit migration bandwidth to prevent success of pre-copy migration, e.g.: virsh migrate-setspeed DOMAIN 10 4. Migrate the VM with XBZRLE compression and postcopy enabled: virsh migrate DOMAIN qemu+tcp://root@HOST/system --verbose --live --compressed --comp-methods xbzrle --postcopy 5. Wait a couple of iterations, then switch to post-copy from another shell: virsh migrate-postcopy DOMAIN 6. The migration fails with an error like this: qemu-kvm: Unknown combination of migration flags: 0x40 (postcopy mode) qemu-kvm: error while loading state section id 2(ram) qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22 and the VM gets paused. Actual results: The migration fails and the VM gets paused. Expected results: The migration succeeds.
This bug has been verified both for ppc and x86. Bug reproduced in PPC platform: Version-Release number of selected component (if applicable): Host: kernel:3.10.0-558.el7.ppc64le qemu-kvm-rhev-2.6.0-22.el7.ppc64le SLOF-20160223-6.gitdbbfda4.el7.noarch Guest: 3.10.0-558.el7.ppc64le Steps to Reproduce: 1.Boot a vm in src host with qemu cli: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox off \ -nodefaults \ -machine pseries-rhel7.3.0 \ -vga std \ -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \ -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x4 \ -chardev socket,id=devorg.qemu.guest_agent.0,path=/tmp/virtio_port-org.qemu.guest_agent.0-20160516-164929-dHQ00mMM,server,nowait \ -device virtserialport,chardev=devorg.qemu.guest_agent.0,name=org.qemu.guest_agent.0,id=org.qemu.guest_agent.0,bus=virtio_serial_pci0.0 \ -device nec-usb-xhci,id=usb1,addr=1d.7,multifunction=on,bus=pci.0 \ -drive file=/root/RHEL.7.3.qcow2,if=none,id=blk1 \ -device virtio-blk-pci,scsi=off,drive=blk1,id=blk-disk1,bootindex=1 \ -drive id=drive_cd1,if=none,snapshot=off,aio=native,cache=none,media=cdrom,file=/root/RHEL-7.3-20161019.0-Server-ppc64le-dvd1.iso \ -device scsi-cd,id=cd1,drive=drive_cd1,bootindex=2 \ -device virtio-net-pci,mac=9a:7b:7c:7d:7e:71,id=idtlLxAk,vectors=4,netdev=idlkwV8e,bus=pci.0,addr=05 \ -netdev tap,id=idlkwV8e,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -m 8G \ -smp 2 \ -cpu host \ -device usb-kbd \ -device usb-tablet \ -qmp tcp:0:8881,server,nowait \ -vnc :1 \ -msg timestamp=on \ -rtc base=localtime,clock=vm,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -monitor stdio \ -enable-kvm 2.Boot a vm in dst host with qemu cli the same as src host and appending "-incoming tcp:0:5801" 3. Run "test" which is a program in guest that make memory intensive and can produce dirty pages during migration,the detail of program is as "Additional info". #gcc test.c -o test #./test 4. Set migration configuration in HMP and do migration (qemu) migrate_set_capability xbzrle on (qemu) migrate_set_capability postcopy-ram on (qemu) migrate -d tcp:10.19.112.39:5801 5. Check migration status, after producing dirty pages switch to post-copy. (qemu) info migrate capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off dirty sync count: 7 dirty pages rate: 13587 pages .......other info.... (qemu) migrate_start_postcopy Actual results: The migration fails and the VM gets paused. In src HMP: (qemu) migrate_start_postcopy (qemu) 2017-02-13T06:56:00.913043Z qemu-kvm: RP: Sibling indicated error 1 2017-02-13T06:56:01.105488Z qemu-kvm: socket_writev_buffer: Got err=104 for (32768/18446744073709551615) (qemu) info migrate capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on Migration status: failed total time: 0 milliseconds (qemu) info status VM status: paused (postmigrate) While in dst HMP: (qemu) 2017-02-13T06:56:00.911136Z qemu-kvm: Unknown combination of migration flags: 0x40 (postcopy m) 2017-02-13T06:56:00.911222Z qemu-kvm: error while loading state section id 2(ram) 2017-02-13T06:56:00.911233Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22 Additional info: (1)the program that specified in step 3 #gcc test.c -o test #./test #cat test.c #include <stdlib.h> #include <stdio.h> #include <signal.h> int main() { void wakeup(); signal(SIGALRM,wakeup); alarm(120); char *buf = (char *) calloc(40960, 4096); while (1) { int i; for (i = 0; i < 40960 * 4; i++) { buf[i * 4096 / 4]++; } printf("."); } } void wakeup() { exit(0); } Bug verify in ppc platform Bug is verified in following version: Host: kernel:3.10.0-558.el7.ppc64le qemu-kvm-rhev-2.8.0-1.el7.ppc64le SLOF-20160223-6.gitdbbfda4.el7.noarch Guest: 3.10.0-558.el7.ppc64le steps: the same as bug reproduced. Actual results: The migration successed and the VM is running In src HMP: (qemu) info migrate capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off Migration status: completed dirty sync count: 22 postcopy request count: 1492 In dst HMP: (qemu) info status VM status: running Bug verify in x86 platform Bug is verified in following version: Host: 3.10.0-563.el7.x86_64 qemu-kvm-rhev-2.8.0-1.el7.x86_64 Guest: 3.10.0-514.10.1.el7.x86_64 steps: the same as bug reproduced. Actual results: The migration successed and the VM is running In src HMP: (qemu) info migrate capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off Migration status: completed dirty sync count: 14 postcopy request count: 3010 In dst HMP: (qemu) info status VM status: running So, this bug is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392