Bug 1475431
| Summary: | migration/postcopy: Handle network failures (libvirt) | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Hai Huang <hhuang> | |
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | |
| libvirt sub component: | Live Migration | QA Contact: | Fangge Jin <fjin> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | chayang, dgilbert, dyuan, dzheng, fdeutsch, fjin, hhan, jdenemar, jen, jsuchane, juzhang, kanderso, kchamart, knoel, laine, lcheng, lmen, mtessun, peterx, qzhang, virt-maint, xuzhang, yalzhang | |
| Version: | unspecified | Keywords: | FutureFeature, Reopened, Triaged | |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
|
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | libvirt-8.5.0-1.el9 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1475305 | |||
| : | 1654722 (view as bug list) | Environment: | ||
| Last Closed: | 2022-11-15 10:03:03 UTC | Type: | Feature Request | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | 8.5.0 | |
| Embargoed: | ||||
| Bug Depends On: | 1475305 | |||
| Bug Blocks: | 1654722 | |||
|
Description
Hai Huang
2017-07-26 16:33:46 UTC
This is now implemented in QEMU by a series of commits ending with
commit d37297dc66202c33f9cafbc48ccae629e7d6dc31
Refs: v2.12.0-571-gd37297dc66
Author: Peter Xu <peterx>
AuthorDate: Wed May 2 18:47:40 2018 +0800
Commit: Juan Quintela <quintela>
CommitDate: Tue May 15 22:13:08 2018 +0200
migration/hmp: add migrate_pause command
Wrapper for QMP command "migrate-pause".
Reviewed-by: Dr. David Alan Gilbert <dgilbert>
Signed-off-by: Peter Xu <peterx>
Message-Id: <20180502104740.12123-25-peterx>
Signed-off-by: Juan Quintela <quintela>
Is there actually a plan of how this can be provided by qemu/libvirt? Yes, there is a plan... we can call migrate_recover QMP command to restart the incoming migration on the destination and resume the migration with resume=true argument for the migrate QMP command on the source. We'll need to figure out some details and start implementing it in libvirt. Just in case - the qemu api reference: https://wiki.qemu.org/Features/PostcopyRecovery (In reply to Jiri Denemark from comment #11) > Yes, there is a plan... we can call migrate_recover QMP command to restart > the > incoming migration on the destination and resume the migration with > resume=true argument for the migrate QMP command on the source. We'll need to > figure out some details and start implementing it in libvirt. The tricky bit is to make sure it's all done using OOB commands and that nothing else in libvirt gets knotted up with non-OOB qmp's that might not be responding. Dave There's been no update for a bit and rather than just move to 8.6.0, just moving to the backlog to be re-evaluated during planning. Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Hi, Jiri - any update on this? It's in progress. The feature works, but there still a few missing pieces, e.g., such as stats once migration completes. The good thing is my list of things to do before sending the (rather big) series for review is shrinking, while it was mostly growing in the past while I was testing it and hitting cases I did not properly handle. Thanks, Jiri. Let me know if there's anything I can do from qemu side. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Hold on - I can understand this could be challenging on libvirt side and when to have it is uncertain. If there's still any possibility it'll be worked out in libvirt, should it be kept OPEN and we just move the target version? Any more information on how this decision was made? Thanks. Peter: That last message is just the auto-stale bot; no one made any decision; so I assume Jiri is still working it upstream. Reopening, work in progress. This feature has been implemented in a series of commits ending with
commit cf3842ef08792c5d34051746b9d6f217c26c6647
Refs: v8.4.0-168-gcf3842ef08
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue May 10 15:20:25 2022 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Jun 7 17:40:20 2022 +0200
qemu: Enable support for VIR_MIGRATE_POSTCOPY_RESUME
Since all parts of post-copy recovery have been implemented now, it's
time to enable the corresponding flag.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Peter Krempa <pkrempa>
Reviewed-by: Pavel Hrdina <phrdina>
Tested with upstream patches, most test scenarios passed. There are some small issues, which I will report on gitlab. Track the issues on gitlab: https://gitlab.com/libvirt/libvirt/-/issues/338 https://gitlab.com/libvirt/libvirt/-/issues/337 https://gitlab.com/libvirt/libvirt/-/issues/336 https://gitlab.com/libvirt/libvirt/-/issues/334 https://gitlab.com/libvirt/libvirt/-/issues/333 Test with: libvirt-8.5.0-2.el9.x86_64 qemu-kvm-7.0.0-9.el9.x86_64 Most scenarios passed, some small issues are tracked on gitlab(see comment30) * Basic scenarios * When to abort migration * Abort after migration switches to postcopy and before target vm becomes running * How to abort migration * Domjobabort –postcopy * Network issue * Libvirt layer: keepalive timeout * Migration is not affected, and switches to unattended migration * Qemu layer: tcp timeout * Both libvirt layer and qemu layer * Proxy issue for unix+proxy transport * Control path * Non-p2p * P2p * Network data transport * Tcp * Tls * Unix + proxy * Checkpoints during test * Check domjobinfo * Check domstate –reason * Check event output * Check migration port reuse when recovering * Check migration port release after recovering finishes * Migrate vm back * Extended scenarios/checkpoints: * Resume failed, then resume again * E.g. migration port is not opened in firewalld * E.g. migration port is occupied by other app * E.g. proxy is not fixed * Do statistic commands: dommemstat, domstats, domblkinfo during testing * Vm IO error * During postcopy * During postcopy-paused * During postcopy-recovering * Operations during postcopy migration * Restart virtqemud * Kill virtqemud * Kill qemu process * Poweroff inside vm * Operations after postcopy is aborted * Restart virtqemud * Kill virtqemud * Kill qemu process * Poweroff inside vm * Operations during postcopy recovering * Abort migration again * Network issue * Tcp timeout * Proxy broken * User triggered abort(virsh domjobabort –postcopy <domain>) * Restart virtqemud * Kill virtqemud * Kill qemu process * Poweroff inside vm * etc * Operations during unattended migration * Abort migration * Poweroff inside vm * etc * Readonly mode: * # virsh -r domjobabort vm1 --postcopy error: operation forbidden: read only access prevents virDomainAbortJobFlags * Resume postcopy when no aborted postcopy migration exists * Resume when vm is not migrating at all * Resume when migration is running normally * Test with –persistent –undefinesource * Test with –migrateuri * With specified ip, port * Notes: * From the test result, need to specify –migrateuri when resuming if you don’t want to use the default one * Test with –listen-address * Notes: * From the test result, need to specify –listen-address when resuming if you don’t want to use the default one * Test with –parallel * Test with –dname/–xml/–persistent-xml * Set postcopy migration bandwidth during recovering * Use –postcopy-bandwidth when resuming * Virsh migrate-setspeed –postcopy * Check guest agent status * Regression * Migrate paused guest * Offline migration * Precopy migration * Non-p2p, p2p * Poweroff vm * Kill qemu/libvirtd Another issue: Bug 2111948 - Postcopy-recover failed if vm I/O error occurred during postcopy-paused status Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: libvirt security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8003 |