Bug 1744530 - Migration failed with enabling postcopy and multifd, qemu crash on destination and guest hang on source end
Summary: Migration failed with enabling postcopy and multifd, qemu crash on destinatio...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.1
Hardware: All
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Dr. David Alan Gilbert
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks: 1753522 1758964 1771318
TreeView+ depends on / blocked
 
Reported: 2019-08-22 10:51 UTC by xianwang
Modified: 2021-11-03 13:28 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-15 07:38:37 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description xianwang 2019-08-22 10:51:03 UTC
Description of problem:
With enabling "postcopy" and "multifd", migration failed and guest hung(paused) on source end, qemu crash on destination end. 

Version-Release number of selected component (if applicable):
both on qemu4.1 and qemu4.0:
4.18.0-136.el8.ppc64le
qemu-kvm-4.1.0-4.module+el8.1.0+4020+16089f93.ppc64le
and
4.18.0-136.el8.ppc64le
qemu-kvm-4.0.0-6.module+el8.1.0+3736+a2aefea3.ppc64le

How reproducible:
100%

Steps to Reproduce:
1.Boot a guest on source host with qemu cli:
/usr/libexec/qemu-kvm -machine pseries -nodefaults -monitor stdio -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x7 -drive file=/home/xianwang/rhel810-ppc64le-virtio-scsi.qcow2.bak,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop -device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0

2.Boot incoming mode on destination host with same qemu cli as above appending "-incoming tcp:0:5801"

3.Enable postcopy and multifd on both source and destination end
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate_set_capability multifd  on

4.Do migration on source end, and after the value of "dirty sync count" more than 1, switch to postcopy mode
(qemu) migrate -d tcp:10.0.1.69:5801
(qemu) info migrate
Migration status: active
dirty sync count: 5
(qemu) migrate_start_postcopy
(qemu) info migrate
Migration status: postcopy-active

Actual results:
Migration keeps "postcopy-active" for about 20 minutes and hung at "postcopy request count: 3", it didn't increase more, while destination qemu hung.After about 20 minutes, migration failed on source and qemu crash on destination.

source:
(qemu) qemu-kvm: multifd_send_pages: channel 0 has already quit!
qemu-kvm: multifd_send_pages: channel 0 has already quit!
qemu-kvm: multifd_send_sync_main: multifd_send_pages fail
qemu-kvm: Unable to write to socket: Connection timed out

(qemu) info status 
VM status: paused (postmigrate)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: on dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: failed (Unable to write to socket: Connection timed out)
total time: 0 milliseconds

destination:
(qemu) qemu-kvm: Non-sequential target page 0x7fff8ce88000/(nil)
qemu-kvm: error while loading state section id 1(ram)
qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22
qemu-kvm: VQ 2 size 0x80 Guest index 0x0 inconsistent with Host index 0x3b84: delta 0xc47c
qemu-kvm: Failed to load virtio-scsi:virtio
qemu-kvm: error while loading state for instance 0x0 of device 'pci@800000020000000:07.0/virtio-scsi'
qemu-kvm: load of migration failed: Operation not permitted

Expected results:
Migration completed and vm works well on destination host

Additional info:

Comment 1 xianwang 2019-08-22 10:58:46 UTC
I will update hardware later after I tried it on x86_64.

Comment 2 xianwang 2019-08-22 11:06:18 UTC
I hit core dump twice while testing this scenario:
build information and steps are same with bug report, qemu cli is as following:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries-rhel8.1.0 \
    -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=0x3 \
    -object iothread,id=iothread0 \
    -chardev socket,id=console0,path=/tmp/console0,server,nowait \
    -device spapr-vty,chardev=console0,reg=0x30000000 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x5 \
    -device pci-bridge,chassis_nr=1,id=bridge1,bus=pci.0,addr=0x6 \
    -device pci-bridge,chassis_nr=2,id=bridge2,bus=pci.0,addr=0x8 \
    -device virtio-scsi-pci,id=scsi1,bus=bridge1,addr=0x7 \
    -drive file=/home/xianwang/rhel810-ppc64le-virtio-scsi.qcow2.bak,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop \
    -device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0 \
    -device virtio-scsi-pci,id=scsi_add,bus=pci.0,addr=0x9 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=0xa \
    -netdev tap,id=idjlQN53,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 2048,slots=4,maxmem=32G \
    -smp 4 \
    -vga std \
    -vnc :11 \
    -cpu host \
    -device usb-kbd \
    -incoming tcp:0:5801 \
    -device usb-mouse \
    -qmp tcp:0:8881,server,nowait \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -monitor stdio \
    -boot order=cdn,once=n,menu=on,strict=off \
    -enable-kvm \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xc \
    -device i6300esb,id=wdt0 \
    -watchdog-action pause \

result:
on source:
(qemu) info status 
VM status: paused (finish-migrate)
(qemu) 2019-08-22T07:34:46.172349Z qemu-kvm: multifd_send_pages: channel 0 has already quit!
2019-08-22T07:34:46.172386Z qemu-kvm: multifd_send_pages: channel 1 has already quit!
2019-08-22T07:34:46.172400Z qemu-kvm: multifd_send_sync_main: multifd_send_pages fail
2019-08-22T07:34:46.203558Z qemu-kvm: Unable to write to socket: Connection timed out
(qemu) info status 
VM status: paused (postmigrate)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: on dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: failed (Unable to write to socket: Connection timed out)
total time: 0 milliseconds

on destination:
(qemu) 2019-08-22T06:53:04.432056Z qemu-kvm: Non-sequential target page 0x7fff0d316000/0x7fff0d14f000
2019-08-22T06:53:04.432090Z qemu-kvm: error while loading state section id 1(ram)
2019-08-22T06:53:04.432104Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -22
2019-08-22T06:53:04.579058Z qemu-kvm: CMD_POSTCOPY_RUN in wrong postcopy state (5)
2019-08-22T06:53:04.579123Z qemu-kvm: postcopy_fault_thread_notify: incrementing failed: Bad file descriptor
2019-08-22T06:53:04.579141Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
boot.sh: line 37: 27019 Segmentation fault      (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox off -nodefaults -machine pseries-rhel8.1.0 -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=0x3 -object iothread,id=iothread0 -chardev socket,id=console0,path=/tmp/console0,server,nowait -device spapr-vty,chardev=console0,reg=0x30000000 -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x5 -device pci-bridge,chassis_nr=1,id=bridge1,bus=pci.0,addr=0x6 -device pci-bridge,chassis_nr=2,id=bridge2,bus=pci.0,addr=0x8 -device virtio-scsi-pci,id=scsi1,bus=bridge1,addr=0x7 -drive file=/home/xianwang/rhel810-ppc64le-virtio-scsi.qcow2.bak,format=qcow2,if=none,cache=none,id=drive_scsi1,werror=stop,rerror=stop -device scsi-hd,drive=drive_scsi1,id=scsi-disk1,bus=scsi1.0,channel=0,scsi-id=0x6,lun=0x3,bootindex=0 -device virtio-scsi-pci,id=scsi_add,bus=pci.0,addr=0x9 -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=0xa -netdev tap,id=idjlQN53,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -m 2048,slots=4,maxmem=32G -smp 4 -vga std -vnc :11 -cpu host -device usb-kbd -incoming tcp:0:5801 -device usb-mouse -qmp tcp:0:8881,server,nowait -msg timestamp=on -rtc base=localtime,clock=vm,driftfix=slew -monitor stdio -boot order=cdn,once=n,menu=on,strict=off -enable-kvm -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xc -device i6300esb,id=wdt0 -watchdog-action pause

Comment 4 xianwang 2019-08-23 02:19:17 UTC
(In reply to xianwang from comment #1)
> I will update hardware later after I tried it on x86_64.

This issue also exists on x86_64 platform, so, I will update hardware to "all". 
Because "postcopy" migration is an important function, and we hit core dump, so, I think "severity" should also be "high", anyone should change it if you think it is incorrect.

Build information:
4.18.0-129.el8.x86_64
qemu-kvm-4.1.0-1.module+el8.1.0+3966+4a23dca1.x86_64

Comment 5 Laurent Vivier 2019-08-23 07:02:51 UTC
Looks like BZ 1738451

Comment 6 xianwang 2019-08-23 07:32:11 UTC
(In reply to Laurent Vivier from comment #5)
> Looks like BZ 1738451

At first, I also think it is similar to BZ 1738451, but there are something about postcopy in its error message, and this scenario didn't change multifd channel and didn't execute "migrate_cancel", and this core dump error is on destination end while that bz core dump is on src end, what's more, the error message of them are different.
So, I am not sure whether their root cause are same, I report it to track this issue and this scenario.

Comment 7 Juan Quintela 2019-11-19 14:17:35 UTC
Multifd + postcopy are not supported simustanously upstream.
We will enable it once that we support it.

I am doing a patch series upstream that will give one error when you try to enable both capabilities.

Thanks, Juan.

Comment 8 Ademar Reis 2020-02-05 23:03:42 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 9 Juan Quintela 2020-06-15 11:45:43 UTC
Hi

We don't support this combination.
The plan is not get one error if one tries this combination.

Comment 10 John Ferlan 2020-07-06 20:59:31 UTC
Adding Triaged keyword and resetting to NEW for placement on the backlog for future assignment (although reading comment 9 it would seem it could be CLOSEd as NOTABUG)

Comment 13 RHEL Program Management 2021-03-15 07:38:37 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 14 Li Xiaohui 2021-04-15 13:13:05 UTC
Multifd + postcopy work well on rhelav 8.4.0(kernel-4.18.0-302.el8.x86_64&qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64),
vm works well after multifd+postcopy migration.

So close this bz as CurrentRelease.


BTW, do we support multifd+postcopy migration now?

Comment 15 Juan Quintela 2021-11-03 13:28:30 UTC
Just closing the needinfo


Note You need to log in before you can comment on or make changes to this bug.