Description of problem: Vm on the target host shutdown unexpected when cancelling postcopy at a early stage Version-Release number of selected component (if applicable): libvirt-8.5.0-5.el9.x86_64 qemu-kvm-7.0.0-11.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1.Do postcopy with a low postcopy speed(leave time to cancel postcopy): # virsh migrate avocado-vt-vm1 qemu+ssh://*/system --live --p2p --verbose --bandwidth 10 --postcopy --timeout-postcopy --timeout 10 --postcopy-bandwidth 5 2.Open another terminal, cancel postcopy looply: #while true; do virsh domjobabort avocado-vt-vm1 --postcopy ; done 3.After 2 minutes, migration in step 1 terminated: Migration: [ 77 %]error: internal error: qemu unexpectedly closed the monitor: 2022-08-26T08:46:01.742879Z qemu-kvm: -device {"driver":"cirrus-vga","id":"video0","bus":"pcie.0","addr":"0x1"}: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead 2022-08-26T08:46:11.174499Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -5 3.Check guest states both on source and target host: source host: #virsh list Id Name State ------------------------------- 5 avocado-vt-vm1 paused target host: #virsh list no output 4.Do migration again: # virsh migrate avocado-vt-vm1 qemu+ssh://*/system --live --p2p --verbose --bandwidth 10 --postcopy --timeout-postcopy --timeout 10 --postcopy-bandwidth 5 error: Requested operation is not valid: another migration job is already running for domain 'avocado-vt-vm1' Actual results: Vm on the target host shutdown unexpected when cancelling postcopy at a early stage Expected results: Vm on the target host should be running after cancelling postcopy. Additional info:
Indeed, the domjobabort loop caused the post-copy migration to be paused very early. Once migration enters postcopy-active: 2022-08-26 08:46:10.988+0000: 98631: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c54087460 event={"timestamp": {"seconds": 1661503570, "microseconds": 988214}, "event": "MIGRATION", "data": {"status": "postcopy-active"}} the Perform phase ends and we call Finish on the destination: 2022-08-26 08:46:10.989+0000: 97602: debug : qemuDomainObjSetJobPhase:734 : Setting 'migration out' phase to 'perform3_done' 2022-08-26 08:46:10.989+0000: 97602: debug : qemuMigrationSrcPerformPeer2Peer3:5702 : Finish3 0x7f2c54015190 ret=0 almost at the same time migration is paused as requested by domjobabort and the state changes to postcopy-paused: 2022-08-26 08:46:11.006+0000: 97605: info : qemuMonitorSend:887 : QEMU_MONITOR_SEND_MSG: mon=0x7f2c54087460 msg={"execute":"migrate-pause","id":"libvirt-436"} 2022-08-26 08:46:11.174+0000: 98631: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c54087460 event={"timestamp": {"seconds": 1661503571, "microseconds": 174139}, "event": "MIGRATION", "data": {"status": "postcopy-paused"}} On the destination Finish phase is started by 2022-08-26 08:46:10.989+0000: 25701: debug : virDomainMigrateFinish3Params:5425 : dconn=0x7f2c20002650, params=0x7f2c3c00cd80, nparams=5, cookiein=0x7f2c3c0158d0, cookieinlen=1185, cookieout=0x7f2c4e8018d0, cookieoutlen=0x7f2c4e8018c4, flags=0x8003, cancelled=0 2022-08-26 08:46:10.989+0000: 25701: debug : qemuDomainObjStartJobPhase:765 : Starting phase 'finish3' of 'migration in' job A bit later migration enters postcopy-active immediately followed by failed state: 2022-08-26 08:46:11.174+0000: 27535: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c4805e2f0 event={"timestamp": {"seconds": 1661503571, "microseconds": 174443}, "event": "MIGRATION", "data": {"status": "postcopy-active"}} 2022-08-26 08:46:11.174+0000: 27535: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c4805e2f0 event={"timestamp": {"seconds": 1661503571, "microseconds": 174521}, "event": "MIGRATION", "data": {"status": "failed"}} and the reason of the failure is reported on QEMU stderr: 2022-08-26T08:46:11.174499Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -5 This all happened even before we called "cont" QMP command on the destination so the domain did not get a chance to run there. That said, there are two issues here. The first one is our Finish API has no way of telling the source that migration failed before the virtual CPUs were resumed. Knowing this would allow the source to just abort the migration the complete state of the domain is still there. Fixing this will require introducing a completely new API for the Finish phase. The second issue is in QEMU. I think it should enter postcopy-paused rather than failed state when migration breaks in postcopy-active state. I don't know if this is fixable on QEMU side, but you could clone the BZ for them to have a look. And even if QEMU is fixed, we should still fix our part to avoid leaving the domain paused unless it's really necessary.
The issue on libvirt side is the same as observed in bug 2121706, which contains an easier reproducer so I will close this one as a duplicate. Although I'll wait a bit until the QEMU side of this issue is taken care of.
I thought about it a bit more and I think there's little value in fixing the QEMU issue. It would allow the migration to be resumed later, but the domain would still be paused although only until the migration is resumed. And once we fix this on libvirt side the domain on the destination would be killed anyway even if QEMU is fixed. *** This bug has been marked as a duplicate of bug 2121706 ***