RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1475431 - migration/postcopy: Handle network failures (libvirt)
Summary: migration/postcopy: Handle network failures (libvirt)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Fangge Jin
URL:
Whiteboard:
Depends On: 1475305
Blocks: 1654722
TreeView+ depends on / blocked
 
Reported: 2017-07-26 16:33 UTC by Hai Huang
Modified: 2022-12-05 08:31 UTC (History)
23 users (show)

Fixed In Version: libvirt-8.5.0-1.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1475305
: 1654722 (view as bug list)
Environment:
Last Closed: 2022-11-15 10:03:03 UTC
Type: Feature Request
Target Upstream Version: 8.5.0
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker LIBVIRTAT-13625 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13626 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13627 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13628 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13629 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13630 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13631 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13632 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Issue Tracker LIBVIRTAT-13633 0 None None None 2022-10-31 17:01:39 UTC
Red Hat Product Errata RHSA-2022:8003 0 None None None 2022-11-15 10:03:48 UTC

Description Hai Huang 2017-07-26 16:33:46 UTC
+++ This bug was initially created as a clone of Bug #1475305 +++

Description of problem:
A failure of the migration network during the postcopy phase of migration is fatal; we can't restart the source since the current state is spread between the two hosts.  We want to be able to reconnect the migration stream and complete the migration after someone has fixed the network

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start a migration
2. switch to postcopy mode
3. take an axe to the network cable
4. wait for TCP timeout
5. Replace network cable

Actual results:
Migration fails after the TCP timeout, the destination VM fails, and can't be restarted.

Expected results:
Some way to recover.

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-07-26 08:19:07 EDT ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

Comment 6 Jiri Denemark 2018-05-29 22:05:59 UTC
This is now implemented in QEMU by a series of commits ending with

commit d37297dc66202c33f9cafbc48ccae629e7d6dc31
Refs: v2.12.0-571-gd37297dc66
Author:     Peter Xu <peterx>
AuthorDate: Wed May 2 18:47:40 2018 +0800
Commit:     Juan Quintela <quintela>
CommitDate: Tue May 15 22:13:08 2018 +0200

    migration/hmp: add migrate_pause command

    Wrapper for QMP command "migrate-pause".

    Reviewed-by: Dr. David Alan Gilbert <dgilbert>
    Signed-off-by: Peter Xu <peterx>
    Message-Id: <20180502104740.12123-25-peterx>
    Signed-off-by: Juan Quintela <quintela>

Comment 9 Fabian Deutsch 2020-08-17 14:06:13 UTC
Is there actually a plan of how this can be provided by qemu/libvirt?

Comment 11 Jiri Denemark 2020-09-03 12:46:27 UTC
Yes, there is a plan... we can call migrate_recover QMP command to restart the
incoming migration on the destination and resume the migration with
resume=true argument for the migrate QMP command on the source. We'll need to
figure out some details and start implementing it in libvirt.

Comment 12 Peter Xu 2020-09-03 15:02:04 UTC
Just in case - the qemu api reference:

https://wiki.qemu.org/Features/PostcopyRecovery

Comment 14 Dr. David Alan Gilbert 2020-09-11 09:32:23 UTC
(In reply to Jiri Denemark from comment #11)
> Yes, there is a plan... we can call migrate_recover QMP command to restart
> the
> incoming migration on the destination and resume the migration with
> resume=true argument for the migrate QMP command on the source. We'll need to
> figure out some details and start implementing it in libvirt.

The tricky bit is to make sure it's all done using OOB commands and that nothing else
in libvirt gets knotted up with non-OOB qmp's that might not be responding.

Dave

Comment 18 John Ferlan 2021-08-10 15:16:25 UTC
There's been no update for a bit and rather than just move to 8.6.0, just moving to the backlog to be re-evaluated during planning.

Comment 19 John Ferlan 2021-09-09 18:30:44 UTC
Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 20 Peter Xu 2022-03-07 03:33:29 UTC
Hi, Jiri - any update on this?

Comment 21 Jiri Denemark 2022-03-07 15:50:36 UTC
It's in progress. The feature works, but there still a few missing pieces,
e.g., such as stats once migration completes. The good thing is my list of
things to do before sending the (rather big) series for review is shrinking,
while it was mostly growing in the past while I was testing it and hitting
cases I did not properly handle.

Comment 22 Peter Xu 2022-03-09 12:39:13 UTC
Thanks, Jiri.  Let me know if there's anything I can do from qemu side.

Comment 23 RHEL Program Management 2022-04-15 07:27:26 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 24 Peter Xu 2022-04-15 13:43:00 UTC
Hold on - I can understand this could be challenging on libvirt side and when to have it is uncertain.  If there's still any possibility it'll be worked out in libvirt, should it be kept OPEN and we just move the target version?

Any more information on how this decision was made?

Thanks.

Comment 25 Dr. David Alan Gilbert 2022-04-21 09:29:42 UTC
Peter: That last message is just the auto-stale bot; no one made any decision; so I assume Jiri is still working it upstream.

Comment 26 Jaroslav Suchanek 2022-04-21 12:02:38 UTC
Reopening, work in progress.

Comment 27 Jiri Denemark 2022-06-17 13:31:02 UTC
This feature has been implemented in a series of commits ending with

commit cf3842ef08792c5d34051746b9d6f217c26c6647
Refs: v8.4.0-168-gcf3842ef08
Author:     Jiri Denemark <jdenemar>
AuthorDate: Tue May 10 15:20:25 2022 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Tue Jun 7 17:40:20 2022 +0200

    qemu: Enable support for VIR_MIGRATE_POSTCOPY_RESUME

    Since all parts of post-copy recovery have been implemented now, it's
    time to enable the corresponding flag.

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Peter Krempa <pkrempa>
    Reviewed-by: Pavel Hrdina <phrdina>

Comment 29 Fangge Jin 2022-06-27 03:19:41 UTC
Tested with upstream patches, most test scenarios passed.
There are some small issues, which I will report on gitlab.

Comment 33 Fangge Jin 2022-07-28 10:51:50 UTC
Test with:
libvirt-8.5.0-2.el9.x86_64
qemu-kvm-7.0.0-9.el9.x86_64

Most scenarios passed, some small issues are tracked on gitlab(see comment30)

* Basic scenarios
   * When to abort migration
      * Abort after migration switches to postcopy and before target vm becomes running
   * How to abort migration
      * Domjobabort –postcopy
      * Network issue
         * Libvirt layer: keepalive timeout
            * Migration is not affected, and switches to unattended migration
         * Qemu layer: tcp timeout
         * Both libvirt layer and qemu layer
      * Proxy issue for unix+proxy transport
   * Control path
      * Non-p2p
      * P2p
   * Network data transport
      * Tcp
      * Tls
      * Unix + proxy
* Checkpoints during test
   * Check domjobinfo
   * Check domstate –reason
   * Check event output
   * Check migration port reuse when recovering
   * Check migration port release after recovering finishes
   * Migrate vm back
* Extended scenarios/checkpoints:
   * Resume failed, then resume again
      * E.g. migration port is not opened in firewalld
      * E.g. migration port is occupied by other app
      * E.g. proxy is not fixed
   * Do statistic commands: dommemstat, domstats, domblkinfo during testing
   * Vm IO error
      * During postcopy
      * During postcopy-paused
      * During postcopy-recovering
   * Operations during postcopy migration
      * Restart virtqemud 
      * Kill virtqemud
      * Kill qemu process
      * Poweroff inside vm
   * Operations after postcopy is aborted
      * Restart virtqemud
      * Kill virtqemud
      * Kill qemu process
      * Poweroff inside vm
   * Operations during postcopy recovering
      * Abort migration again 
         * Network issue
            * Tcp timeout
            * Proxy broken
         * User triggered abort(virsh domjobabort –postcopy <domain>)
      * Restart virtqemud
      * Kill virtqemud
      * Kill qemu process
      * Poweroff inside vm
      * etc
   * Operations during unattended migration
      * Abort migration
      * Poweroff inside vm
      * etc
   * Readonly mode:
      * #  virsh -r domjobabort vm1 --postcopy
error: operation forbidden: read only access prevents virDomainAbortJobFlags
   * Resume postcopy when no aborted postcopy migration exists
      * Resume when vm is not migrating at all
      * Resume when migration is running normally
   * Test with –persistent –undefinesource
   * Test with –migrateuri
      * With specified ip, port
      * Notes:
         * From the test result, need to specify –migrateuri when resuming if you don’t want to use the default one
   * Test with –listen-address
      * Notes:
         * From the test result, need to specify –listen-address when resuming if you don’t want to use the default one
   * Test with –parallel
   * Test with –dname/–xml/–persistent-xml
   * Set postcopy migration bandwidth during recovering
      * Use –postcopy-bandwidth when resuming
      * Virsh migrate-setspeed –postcopy
   * Check guest agent status
* Regression
   * Migrate paused guest
   * Offline migration
   * Precopy migration
      * Non-p2p, p2p
      * Poweroff vm 
      * Kill qemu/libvirtd

Comment 34 Fangge Jin 2022-07-28 13:54:22 UTC
Another issue: Bug 2111948 - Postcopy-recover failed if vm I/O error occurred during postcopy-paused status

Comment 37 errata-xmlrpc 2022-11-15 10:03:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8003


Note You need to log in before you can comment on or make changes to this bug.