Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2121686

Summary: Vm on the target host shutdown unexpected when cancelling postcopy at a early stage
Product: Red Hat Enterprise Linux 9 Reporter: yafu <yafu>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
libvirt sub component: Live Migration QA Contact: Fangge Jin <fjin>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: fjin, lmen, virt-maint, xuzhang
Version: 9.1Keywords: Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-30 09:25:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yafu 2022-08-26 08:55:05 UTC
Description of problem:
Vm on the target host shutdown unexpected when cancelling postcopy at a early stage


Version-Release number of selected component (if applicable):
libvirt-8.5.0-5.el9.x86_64
qemu-kvm-7.0.0-11.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Do postcopy with a low postcopy speed(leave time to cancel postcopy):
# virsh migrate avocado-vt-vm1 qemu+ssh://*/system --live --p2p  --verbose    --bandwidth 10 --postcopy --timeout-postcopy --timeout 10 --postcopy-bandwidth 5

2.Open another terminal, cancel postcopy looply:
#while true; do virsh domjobabort avocado-vt-vm1 --postcopy ; done

3.After 2 minutes, migration in step 1 terminated:
Migration: [ 77 %]error: internal error: qemu unexpectedly closed the monitor: 2022-08-26T08:46:01.742879Z qemu-kvm: -device {"driver":"cirrus-vga","id":"video0","bus":"pcie.0","addr":"0x1"}: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
2022-08-26T08:46:11.174499Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -5
  
3.Check guest states both on source and target host:
source host:
#virsh list
 Id   Name             State
-------------------------------
 5    avocado-vt-vm1   paused

target host:
#virsh list
no output

4.Do migration again:
# virsh migrate avocado-vt-vm1 qemu+ssh://*/system --live --p2p  --verbose    --bandwidth 10 --postcopy --timeout-postcopy --timeout 10 --postcopy-bandwidth 5
error: Requested operation is not valid: another migration job is already running for domain 'avocado-vt-vm1'

Actual results:
Vm on the target host shutdown unexpected when cancelling postcopy at a early stage

Expected results:
Vm on the target host should be running after cancelling postcopy.

Additional info:

Comment 4 Jiri Denemark 2022-11-29 15:19:26 UTC
Indeed, the domjobabort loop caused the post-copy migration to be paused very
early.

Once migration enters postcopy-active:

    2022-08-26 08:46:10.988+0000: 98631: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c54087460 event={"timestamp": {"seconds": 1661503570, "microseconds": 988214}, "event": "MIGRATION", "data": {"status": "postcopy-active"}}

the Perform phase ends and we call Finish on the destination:

    2022-08-26 08:46:10.989+0000: 97602: debug : qemuDomainObjSetJobPhase:734 : Setting 'migration out' phase to 'perform3_done'
    2022-08-26 08:46:10.989+0000: 97602: debug : qemuMigrationSrcPerformPeer2Peer3:5702 : Finish3 0x7f2c54015190 ret=0

almost at the same time migration is paused as requested by domjobabort and
the state changes to postcopy-paused:

    2022-08-26 08:46:11.006+0000: 97605: info : qemuMonitorSend:887 : QEMU_MONITOR_SEND_MSG: mon=0x7f2c54087460 msg={"execute":"migrate-pause","id":"libvirt-436"}
    2022-08-26 08:46:11.174+0000: 98631: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c54087460 event={"timestamp": {"seconds": 1661503571, "microseconds": 174139}, "event": "MIGRATION", "data": {"status": "postcopy-paused"}}


On the destination Finish phase is started by

    2022-08-26 08:46:10.989+0000: 25701: debug : virDomainMigrateFinish3Params:5425 : dconn=0x7f2c20002650, params=0x7f2c3c00cd80, nparams=5, cookiein=0x7f2c3c0158d0, cookieinlen=1185, cookieout=0x7f2c4e8018d0, cookieoutlen=0x7f2c4e8018c4, flags=0x8003, cancelled=0
    2022-08-26 08:46:10.989+0000: 25701: debug : qemuDomainObjStartJobPhase:765 : Starting phase 'finish3' of 'migration in' job

A bit later migration enters postcopy-active immediately followed by failed
state:

    2022-08-26 08:46:11.174+0000: 27535: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c4805e2f0 event={"timestamp": {"seconds": 1661503571, "microseconds": 174443}, "event": "MIGRATION", "data": {"status": "postcopy-active"}}
    2022-08-26 08:46:11.174+0000: 27535: info : qemuMonitorJSONIOProcessLine:209 : QEMU_MONITOR_RECV_EVENT: mon=0x7f2c4805e2f0 event={"timestamp": {"seconds": 1661503571, "microseconds": 174521}, "event": "MIGRATION", "data": {"status": "failed"}}

and the reason of the failure is reported on QEMU stderr:

    2022-08-26T08:46:11.174499Z qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -5

This all happened even before we called "cont" QMP command on the destination
so the domain did not get a chance to run there.

That said, there are two issues here. The first one is our Finish API has no
way of telling the source that migration failed before the virtual CPUs were
resumed. Knowing this would allow the source to just abort the migration the
complete state of the domain is still there. Fixing this will require
introducing a completely new API for the Finish phase.

The second issue is in QEMU. I think it should enter postcopy-paused rather
than failed state when migration breaks in postcopy-active state. I don't know
if this is fixable on QEMU side, but you could clone the BZ for them to have a
look.

And even if QEMU is fixed, we should still fix our part to avoid leaving the
domain paused unless it's really necessary.

Comment 5 Jiri Denemark 2022-11-29 15:41:02 UTC
The issue on libvirt side is the same as observed in bug 2121706, which
contains an easier reproducer so I will close this one as a duplicate.
Although I'll wait a bit until the QEMU side of this issue is taken care of.

Comment 6 Jiri Denemark 2022-11-30 09:25:23 UTC
I thought about it a bit more and I think there's little value in fixing the
QEMU issue. It would allow the migration to be resumed later, but the domain
would still be paused although only until the migration is resumed. And once
we fix this on libvirt side the domain on the destination would be killed
anyway even if QEMU is fixed.

*** This bug has been marked as a duplicate of bug 2121706 ***