Bug 2121686 - Vm on the target host shutdown unexpected when cancelling postcopy at a early stage
Summary: Vm on the target host shutdown unexpected when cancelling postcopy at a early...
Keywords:
Status: CLOSED DUPLICATE of bug 2121706
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Fangge Jin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-26 08:55 UTC by yafu
Modified: 2022-11-30 09:25 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-30 09:25:23 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-132424 0 None None None 2022-08-26 08:56:05 UTC

Description yafu 2022-08-26 08:55:05 UTC
Description of problem:
Vm on the target host shutdown unexpected when cancelling postcopy at a early stage


Version-Release number of selected component (if applicable):
libvirt-8.5.0-5.el9.x86_64
qemu-kvm-7.0.0-11.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Do postcopy with a low postcopy speed(leave time to cancel postcopy):
# virsh migrate avocado-vt-vm1 qemu+ssh://*/system --live --p2p  --verbose    --bandwidth 10 --postcopy --timeout-postcopy --timeout 10 --postcopy-bandwidth 5

2.Open another terminal, cancel postcopy looply:
#while true; do virsh domjobabort avocado-vt-vm1 --postcopy ; done

3.After 2 minutes, migration in step 1 terminated:
Migration: [ 77 %]error: internal error: qemu unexpectedly closed the monitor: 2022-08-26T08:46:01.742879Z qemu-kvm: -device {"driver":"cirrus-vga","id":"video0","bus":"pcie.0","addr":"0x1"}: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
2022-08-26T08:46:11.174499Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -5
  
3.Check guest states both on source and target host:
source host:
#virsh list
 Id   Name             State
-------------------------------
 5    avocado-vt-vm1   paused

target host:
#virsh list
no output

4.Do migration again:
# virsh migrate avocado-vt-vm1 qemu+ssh://*/system --live --p2p  --verbose    --bandwidth 10 --postcopy --timeout-postcopy --timeout 10 --postcopy-bandwidth 5
error: Requested operation is not valid: another migration job is already running for domain 'avocado-vt-vm1'

Actual results:
Vm on the target host shutdown unexpected when cancelling postcopy at a early stage

Expected results:
Vm on the target host should be running after cancelling postcopy.

Additional info:

Comment 4 Jiri Denemark 2022-11-29 15:19:26 UTC
Indeed, the domjobabort loop caused the post-copy migration to be paused very
early.

Once migration enters postcopy-active:

    2022-08-26 08:46:10.988+0000: 98631: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c54087460 event={"timestamp": {"seconds": 1661503570, "microseconds": 988214}, "event": "MIGRATION", "data": {"status": "postcopy-active"}}

the Perform phase ends and we call Finish on the destination:

    2022-08-26 08:46:10.989+0000: 97602: debug : qemuDomainObjSetJobPhase:734 : Setting 'migration out' phase to 'perform3_done'
    2022-08-26 08:46:10.989+0000: 97602: debug : qemuMigrationSrcPerformPeer2Peer3:5702 : Finish3 0x7f2c54015190 ret=0

almost at the same time migration is paused as requested by domjobabort and
the state changes to postcopy-paused:

    2022-08-26 08:46:11.006+0000: 97605: info : qemuMonitorSend:887 : QEMU_MONITOR_SEND_MSG: mon=0x7f2c54087460 msg={"execute":"migrate-pause","id":"libvirt-436"}
    2022-08-26 08:46:11.174+0000: 98631: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c54087460 event={"timestamp": {"seconds": 1661503571, "microseconds": 174139}, "event": "MIGRATION", "data": {"status": "postcopy-paused"}}


On the destination Finish phase is started by

    2022-08-26 08:46:10.989+0000: 25701: debug : virDomainMigrateFinish3Params:5425 : dconn=0x7f2c20002650, params=0x7f2c3c00cd80, nparams=5, cookiein=0x7f2c3c0158d0, cookieinlen=1185, cookieout=0x7f2c4e8018d0, cookieoutlen=0x7f2c4e8018c4, flags=0x8003, cancelled=0
    2022-08-26 08:46:10.989+0000: 25701: debug : qemuDomainObjStartJobPhase:765 : Starting phase 'finish3' of 'migration in' job

A bit later migration enters postcopy-active immediately followed by failed
state:

    2022-08-26 08:46:11.174+0000: 27535: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c4805e2f0 event={"timestamp": {"seconds": 1661503571, "microseconds": 174443}, "event": "MIGRATION", "data": {"status": "postcopy-active"}}
    2022-08-26 08:46:11.174+0000: 27535: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c4805e2f0 event={"timestamp": {"seconds": 1661503571, "microseconds": 174521}, "event": "MIGRATION", "data": {"status": "failed"}}

and the reason of the failure is reported on QEMU stderr:

    2022-08-26T08:46:11.174499Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -5

This all happened even before we called "cont" QMP command on the destination
so the domain did not get a chance to run there.

That said, there are two issues here. The first one is our Finish API has no
way of telling the source that migration failed before the virtual CPUs were
resumed. Knowing this would allow the source to just abort the migration the
complete state of the domain is still there. Fixing this will require
introducing a completely new API for the Finish phase.

The second issue is in QEMU. I think it should enter postcopy-paused rather
than failed state when migration breaks in postcopy-active state. I don't know
if this is fixable on QEMU side, but you could clone the BZ for them to have a
look.

And even if QEMU is fixed, we should still fix our part to avoid leaving the
domain paused unless it's really necessary.

Comment 5 Jiri Denemark 2022-11-29 15:41:02 UTC
The issue on libvirt side is the same as observed in bug 2121706, which
contains an easier reproducer so I will close this one as a duplicate.
Although I'll wait a bit until the QEMU side of this issue is taken care of.

Comment 6 Jiri Denemark 2022-11-30 09:25:23 UTC
I thought about it a bit more and I think there's little value in fixing the
QEMU issue. It would allow the migration to be resumed later, but the domain
would still be paused although only until the migration is resumed. And once
we fix this on libvirt side the domain on the destination would be killed
anyway even if QEMU is fixed.

*** This bug has been marked as a duplicate of bug 2121706 ***


Note You need to log in before you can comment on or make changes to this bug.