Created attachment 1722434 [details] logs Description of problem: Do migration with copy storage->cancel migration at memory copy phase -> do migration with copy storage again, migration failed Version-Release number of selected component (if applicable): libvirt-6.6.0-7.module+el8.3.0+8424+5ea525c5.x86_64 qemu-kvm-5.1.0-13.module+el8.3.0+8424+5ea525c5.x86_64 How reproducible: 100% Steps to Reproduce: 0. Start a vm with local storage, and create a vm image on dest host with same size as src vm image: On src: # virsh start avocado-vt-vm1 # virsh domblklist avocado-vt-vm1 Target Source ------------------------------------------------------------------------ vda /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2 On dest: # qemu-img create -f qcow2 /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2 10G 1.Do migration with copy storage, and cancel it in memory copy phase: [root@libvirt-rhel-8 ~]# virsh migrate avocado-vt-vm1 qemu+ssh://10.0.139.240/system --live --p2p --verbose --copy-storage-inc Migration: [ 69 %]error: operation aborted: migration out job: canceled by client 2.Do migration again: [root@libvirt-rhel-8 ~]# virsh migrate avocado-vt-vm1 qemu+ssh://10.0.139.240/system --live --p2p --verbose --copy-storage-inc Migration: [ 35 %]error: internal error: qemu unexpectedly closed the monitor: 2020-10-18T03:51:03.805809Z qemu-kvm: -device cirrus-vga,id=video0,bus=pcie.0,addr=0x1: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead 2020-10-18T03:51:04.984709Z qemu-kvm: check_section_footer: Read section footer failed: -5 2020-10-18T03:51:04.985272Z qemu-kvm: load of migration failed: Invalid argument Actual results: Expected results: Additional info: 1. virsh event output: Src: 2020-10-18 03:50:36.157+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started' 2020-10-18 03:50:36.558+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused 2020-10-18 03:50:36.558+0000: event 'lifecycle' for domain avocado-vt-vm1: Started Booted 2020-10-18 03:50:56.409+0000: event 'block-job' for domain avocado-vt-vm1: Block Copy for /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2 ready 2020-10-18 03:50:56.409+0000: event 'block-job-2' for domain avocado-vt-vm1: Block Copy for vda ready 2020-10-18 03:50:56.427+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1' 2020-10-18 03:50:56.621+0000: event 'block-job' for domain avocado-vt-vm1: Block Copy for /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2 failed 2020-10-18 03:50:56.621+0000: event 'block-job-2' for domain avocado-vt-vm1: Block Copy for vda failed 2020-10-18 03:51:04.774+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1' 2020-10-18 03:51:05.084+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '2' 2020-10-18 03:51:05.085+0000: event 'lifecycle' for domain avocado-vt-vm1: Suspended Migrated 2020-10-18 03:51:05.300+0000: event 'block-job' for domain avocado-vt-vm1: Block Copy for /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2 failed 2020-10-18 03:51:05.300+0000: event 'block-job-2' for domain avocado-vt-vm1: Block Copy for vda failed 2020-10-18 03:51:05.341+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused 2020-10-18 03:51:15.736+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event' Dest: 2020-10-18 03:50:42.473+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started' 2020-10-18 03:50:42.915+0000: event 'lifecycle' for domain avocado-vt-vm1: Started Migrated 2020-10-18 03:50:56.454+0000: event 'lifecycle' for domain avocado-vt-vm1: Stopped Failed 2020-10-18 03:51:03.832+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started' 2020-10-18 03:51:04.284+0000: event 'lifecycle' for domain avocado-vt-vm1: Started Migrated 2020-10-18 03:51:05.037+0000: event 'control-error' for domain avocado-vt-vm1 2020-10-18 03:51:05.260+0000: event 'lifecycle' for domain avocado-vt-vm1: Stopped Failed 2. Not sure whether it has same root cause as Bug 1459801 - nbd/server.c:nbd_receive_request():L706: read failed when do migration_cancel
Created attachment 1722435 [details] vm xml
Hi, I have done some analyse from provided libvirtd-src.log: this bz is about storage vm migration, including copy storage and block related operations(blockdev-add and blockdev-mirror). Cong, could you help mark this right QA contact and then reproduce this bz, thanks.
The bug could be reproduced with live migration steps, i.e. using shared storage instead of combining any blockdev-add/mirror steps. Following are the detailed steps: 1. Boot up the vm in source host: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine pseries,max-cpu-compat=power8 \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -m 1024 \ -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 \ -cpu 'host' \ -chardev socket,id=qmp_id_qmpmonitor1,server,nowait,path=/var/tmp/avocado_1 \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=chardev_serial0,server,nowait,path=/var/tmp/avocado_2 \ -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 \ -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \ -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel830-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device virtio-net-pci,mac=9a:fa:14:5e:3d:15,id=idLviSMj,netdev=idlrY22o,bus=pci.0,addr=0x5 \ -netdev tap,id=idlrY22o,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm \ -monitor stdio \ 2. Boot up the destination qemu process using the same qemu command lines as the soure one but appending '-incoming defer' at the end 3. Set migration capabilities in both source and destination host: SRC: {"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "pause-before-switchover", "state": true}]}} DST: {"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "late-block-activate", "state": true}]}} 4. Set incoming port in destination host: DST: {"execute": "migrate-incoming","arguments": {"uri": "tcp:[::]:5200"}} 5. Migrate the vm to destination host from source host and wait until it reach "pre-switchover" status: SRC: {"execute": "migrate", "arguments": {"uri": "tcp:10.16.200.238:5200"}, "id": "DXcuz3D7"} SRC: {"execute": "query-migrate"} {"return": {"expected-downtime": 305, "status": "pre-switchover", "setup-time": 27, "total-time": 387578, "ram": {"total": 1090846720, "postcopy-requests": 0, "dirty-sync-count": 111, "multifd-bytes": 0, "pages-per-second": 34539, "page-size": 4096, "remaining": 27930624, "mbps": 934.542353, "transferred": 10266041346, "duplicate": 614439, "dirty-pages-rate": 33886, "skipped": 0, "normal-bytes": 10240499712, "normal": 2500122}}} 6. Cancel the migration: { "execute": "migrate_cancel" } {"return": {}} {"timestamp": {"seconds": 1603252232, "microseconds": 344845}, "event": "RESUME"} {"timestamp": {"seconds": 1603252435, "microseconds": 939974}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}} {"timestamp": {"seconds": 1603252436, "microseconds": 17285}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "reason": "No medium found", "operation": "read", "action": "report"}} {"timestamp": {"seconds": 1603252436, "microseconds": 17335}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "reason": "No medium found", "operation": "read", "action": "report"}} {"timestamp": {"seconds": 1603252436, "microseconds": 17380}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "reason": "No medium found", "operation": "read", "action": "report"}} {"timestamp": {"seconds": 1603252436, "microseconds": 17423}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "reason": "No medium found", "operation": "read", "action": "report"}} {"timestamp": {"seconds": 1603252436, "microseconds": 17464}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "reason": "No medium found", "operation": "read", "action": "report"}} {"timestamp": {"seconds": 1603252436, "microseconds": 17505}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "reason": "No medium found", "operation": "read", "action": "report"}} {"timestamp": {"seconds": 1603252436, "microseconds": 17546}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "reason": "No medium found", "operation": "read", "action": "report"}} 7. The bug reproduced: DES: (qemu) qemu-kvm: check_section_footer: Read section footer failed: -5 qemu-kvm: load of migration failed: Invalid argument [root@ibm-p8-kvm-02-qe ngu]# vim vm11.sh Software versions: Host kernel: 4.18.0-240.el8.ppc64le Guest kernel: 4.18.0-240.el8.ppc64le Qemu: qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.ppc64le SLOF: SLOF-20200717-1.gite18ddad8.module+el8.3.0+7638+07cf13d2.noarch So Xiaohui, reassign the bug to you.
Hi Nini, I didn't reproduce this bz only with shared storage image. This bz is about "migration failed with try the second time migration after cancel migration", so you should do like following steps: when the first time migration failed(like the steps from your comment 5), start migration again: 1.Boot the guest on dst host again; 2.Enable late-block-activate on dst host; 3.Set migration incoming on dst host; 4.Start migration on src host again; Check whether migration failed after step 4: if failed, and error like comment 0, then you reproduce bz. BTW the first time migration failed, the hmp on dst host will quit automatically and print following information, it's expecting action since we cancel migration on src host: (qemu) qemu-kvm: check_section_footer: Read section footer failed: -5 qemu-kvm: load of migration failed: Invalid argument
Sorry for clear two needinfo: 1. Needinfo from Ademar to fjin in Comment 3; 2. Needinfo from Ademar to Eric in Comment 4.
(In reply to Li Xiaohui from comment #6) > Hi Nini, > I didn't reproduce this bz only with shared storage image. > This bz is about "migration failed with try the second time migration after > cancel migration", so you should do like following steps: > when the first time migration failed(like the steps from your comment 5), > start migration again: > 1.Boot the guest on dst host again; > 2.Enable late-block-activate on dst host; > 3.Set migration incoming on dst host; > 4.Start migration on src host again; > > Check whether migration failed after step 4: if failed, and error like > comment 0, then you reproduce bz. > > > BTW the first time migration failed, the hmp on dst host will quit > automatically and print following information, > it's expecting action since we cancel migration on src host: > (qemu) qemu-kvm: check_section_footer: Read section footer failed: -5 > qemu-kvm: load of migration failed: Invalid argument Ok, thanks Xiaohui. I originally thought the error info inthe dst qemu process after cancelling the migration itself was a problem, so there is comment #5. Since we don't have the migration cancel test scenario in our existed storage vm migration test cases, I am trying to figure out the test steps and doing the reproduce.
It's not a recent regression, also reproduced with: libvirt-6.0.0-25.4.module+el8.2.1+8060+c0c58169.x86_64 qemu-kvm-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
When do migration at the second time, libvirt thought that "All disk mirrors are ready" wrongly and started to do memory migrate, that's the problem: 28920:2020-10-18 03:51:04.736+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-420"}] 28949:2020-10-18 03:51:04.736+0000: 13827: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f952803aa20 msg={"execute":"blockdev-add","arguments":{"node-name":"migration-vda-format","read-only":false,"driver":"raw","file":"migration-vda-storage"},"id":"libvirt-421"} 28958:2020-10-18 03:51:04.738+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-421"}] 28990:2020-10-18 03:51:04.738+0000: 13827: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f952803aa20 msg={"execute":"blockdev-mirror","arguments":{"job-id":"drive-virtio-disk0","device":"libvirt-1-format","target":"migration-vda-format","speed":9223372036853727232,"sync":"top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-422"} 28999:2020-10-18 03:51:04.745+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1602993064, "microseconds": 741875}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive-virtio-disk0"}}] 29039:2020-10-18 03:51:04.745+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1602993064, "microseconds": 741971}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive-virtio-disk0"}}] 29105:2020-10-18 03:51:04.761+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-422"}] 29133:2020-10-18 03:51:04.762+0000: 13827: debug : qemuMigrationSrcNBDStorageCopyReady:551 : All disk mirrors are ready ^^^^^^^^^^^^^^^^^^^^^^^^^^ 29147:2020-10-18 03:51:04.762+0000: 13827: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f952803aa20 msg={"execute":"query-block-jobs","id":"libvirt-423"} 29156:2020-10-18 03:51:04.763+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": [{"auto-finalize": true, "io-status": "ok", "device": "drive-virtio-disk0", "auto-dismiss": false, "busy": true, "len": 1712586752, "offset": 0, "status": "running", "paused": false, "speed": 9223372036853727232, "ready": false, "type": "mirror"}], "id": "libvirt-423"}] 29262:2020-10-18 03:51:04.766+0000: 13827: debug : qemuMonitorSendFileHandle:2644 : fdname=migrate fd=40 29263:2020-10-18 03:51:04.766+0000: 13827: debug : qemuMonitorSendFileHandle:2646 : mon:0x7f952803aa20 vm:0x561aae1226a0 fd:38 29272:2020-10-18 03:51:04.766+0000: 13827: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f952803aa20 msg={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-424"} 29282:2020-10-18 03:51:04.767+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-424"}] 29309:2020-10-18 03:51:04.767+0000: 13827: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f952803aa20 msg={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-425"} 29318:2020-10-18 03:51:04.769+0000: 14869: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1602993064, "microseconds": 768381}, "event": "MIGRATION", "data": {"status": "setup"}}]
(In reply to Fangge Jin from comment #10) > When do migration at the second time, libvirt thought that "All disk mirrors > are ready" wrongly and started to do memory migrate, that's the problem: [...] > 29133:2020-10-18 03:51:04.762+0000: 13827: debug : > qemuMigrationSrcNBDStorageCopyReady:551 : All disk mirrors are ready Indeed. In blockdev mode we didn't reset the variable which holds the 'ready' state so the job was wrongly considered done right away. Fixed upstream: commit 202d61db4870468a02817e64fefb5ede06b8a59a Author: Peter Krempa <pkrempa> Date: Mon Jan 11 10:42:15 2021 +0100 qemuBlockJobEventProcess: Always clear 'mirrorState' when a job finishes When a block job is terminated we should clear the 'mirrorState' and 'mirrorJob' variables so that stale values are not present prior to a new job. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit a09c421e3e5969582ed0a5cf9a132ecbfd9794d2 Author: Peter Krempa <pkrempa> Date: Mon Jan 11 10:40:43 2021 +0100 qemuMigrationSrcNBDStorageCopyReady: Use ready-state of mirror from qemuBlockJobData Use the per-job state to determine when the non-shared-storage mirror is complete rather than the per-disk definition one. The qemuBlockJobData is a newer approach and is always cleared after the blockjob is terminated while the 'mirrorState' variable in the definition of the disk may be left over. In such case the disk mirror would be considered complete prematurely. https://bugzilla.redhat.com/show_bug.cgi?id=1889131 Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> v7.0.0-rc1-10-g202d61db48
Test on libvirtv7.0.0-rc2-2-g0a6cb05e95 and qemu-5.2.0-4.fc34.x86_64 as following: while true;do virsh migrate test qemu+ssh://root/system --live --p2p --copy-storage-inc& sleep $((1 + $RANDOM % 7)) virsh domjobabort test fg done Never hit an error like comment0 after thousands of loops.
Created attachment 1748904 [details] VM xml, libvirtd log, qmp log A error happens when cancelling the migration with --copy-storage-all Version: libvirt-7.0.0-2.module+el8.4.0+9520+ef609c5f.x86_64 qemu-kvm-5.2.0-3.module+el8.4.0+9499+42e58f08.x86_64 Step: Start a VM and cancel the migration --copy-storage-all: ➜ ~ virsh migrate seabios qemu+ssh://root.redhat.com/system --live --p2p --copy-storage-all ^Cerror: End of file while reading data: : Input/output error The error happens at: 2021-01-20 01:19:50.754+0000: 106219: error : virNetSocketReadWire:1814 : End of file while reading data: : Input/output error It looks similiar to https://bugzilla.redhat.com/show_bug.cgi?id=1821277 Please help to check if it is the regression introduced by this bugfix
See comment17
No it's not a regression caused by the bugfix. The relevant context is: 2021-01-20 01:19:50.751+0000: 106219: debug : qemuDomainObjEndJob:1143 : Stopping job: async nested (async=migration out vm=0x7f30bc1aca70 name=seabios) 2021-01-20 01:19:50.751+0000: 106219: debug : qemuDomainObjEnterRemote:5833 : Entering remote (vm=0x7f30bc1aca70 name=seabios) 2021-01-20 01:19:50.751+0000: 106219: debug : virConnectUnregisterCloseCallback:1479 : conn=0x7f30ac00e8d0 2021-01-20 01:19:50.752+0000: 106219: debug : qemuDomainObjExitRemote:5844 : Exited remote (vm=0x7f30bc1aca70 name=seabios) 2021-01-20 01:19:50.754+0000: 106219: error : virNetSocketReadWire:1814 : End of file while reading data: : Input/output error The error is reported when closing the migration connection on one side apparently and this bug was not touching that at all. This bugfix was related to clearing stale data after blockjob completion and has nothing to do with the remote driver. The main cause may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1821277 but I will not go into detailed analysis here. Given that aforementioned bug is too crowded with comments and here you have exact steps to reproduce, filing a new one, if you want might be better than trying to attach the migration case there too.
(In reply to Peter Krempa from comment #19) > No it's not a regression caused by the bugfix. The relevant context is: > > 2021-01-20 01:19:50.751+0000: 106219: debug : qemuDomainObjEndJob:1143 : > Stopping job: async nested (async=migration out vm=0x7f30bc1aca70 > name=seabios) > 2021-01-20 01:19:50.751+0000: 106219: debug : qemuDomainObjEnterRemote:5833 > : Entering remote (vm=0x7f30bc1aca70 name=seabios) > 2021-01-20 01:19:50.751+0000: 106219: debug : > virConnectUnregisterCloseCallback:1479 : conn=0x7f30ac00e8d0 > 2021-01-20 01:19:50.752+0000: 106219: debug : qemuDomainObjExitRemote:5844 : > Exited remote (vm=0x7f30bc1aca70 name=seabios) > 2021-01-20 01:19:50.754+0000: 106219: error : virNetSocketReadWire:1814 : > End of file while reading data: : Input/output error > > The error is reported when closing the migration connection on one side > apparently and this bug was not touching that at all. This bugfix was > related to clearing stale data after blockjob completion and has nothing to > do with the remote driver. > > The main cause may be related to > https://bugzilla.redhat.com/show_bug.cgi?id=1821277 but I will not go into > detailed analysis here. Given that aforementioned bug is too crowded with > comments and here you have exact steps to reproduce, filing a new one, if > you want might be better than trying to attach the migration case there too. Thank you. New bug: https://bugzilla.redhat.com/show_bug.cgi?id=1918211
Verified on libvirt-7.0.0-3.module+el8.4.0+9709+a99efd61.x86_64 qemu-kvm-5.2.0-5.module+el8.4.0+9775+0937c167.x86_64 as comment13.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098