Bug 2111948
| Summary: | Postcopy-recover failed if vm I/O error occurred during postcopy-paused status | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Fangge Jin <fjin> | ||||
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||
| libvirt sub component: | Live Migration | QA Contact: | Fangge Jin <fjin> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | unspecified | ||||||
| Priority: | unspecified | CC: | jdenemar, lcheng, lmen, virt-maint, xuzhang, yafu | ||||
| Version: | 9.1 | Keywords: | AutomationTriaged, Triaged | ||||
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | libvirt-9.0.0-1.el9 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-05-09 07:26:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | 9.0.0 | ||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Patches sent upstream for review: https://listman.redhat.com/archives/libvir-list/2022-December/236393.html Fixed upstream by
commit b92cba67c67551139e5421d97a66620e836a0523
Refs: v8.10.0-202-gb92cba67c6
Author: Jiri Denemark <jdenemar>
AuthorDate: Wed Dec 7 14:46:25 2022 +0100
Commit: Jiri Denemark <jdenemar>
CommitDate: Fri Jan 6 16:17:38 2023 +0100
conf: Drop virDomainJobOperation parameter from virDomainObjIsPostcopy
The parameter was only used to select which states correspond to an
active or failed post-copy migration. But these states are either
applicable to both operations or the check would just paper over a code
bug in case of an impossible combination of state and operation. By
dropping the check we can make the code simpler and also reuse existing
virDomainObjIsFailedPostcopy function and only check for active
post-copy states.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Michal Privoznik <mprivozn>
commit 49a57540638aa0898432ace1e016a77006d272af
Refs: v8.10.0-203-g49a5754063
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue Dec 13 16:43:53 2022 +0100
Commit: Jiri Denemark <jdenemar>
CommitDate: Fri Jan 6 16:17:38 2023 +0100
conf: Add job parameter to virDomainObjIsFailedPostcopy
Unused for now, but this will change soon.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Michal Privoznik <mprivozn>
commit 7050dad5f92010720cc8e8b7d5c37eaad7696c5e
Refs: v8.10.0-204-g7050dad5f9
Author: Jiri Denemark <jdenemar>
AuthorDate: Thu Dec 15 14:12:43 2022 +0100
Commit: Jiri Denemark <jdenemar>
CommitDate: Fri Jan 6 16:17:38 2023 +0100
qemu: Remember failed post-copy migration in job
When post-copy migration fails, the domain stays running on the
destination with a VIR_DOMAIN_RUNNING_POSTCOPY_FAILED reason. Both the
state and the reason can later be rewritten in case the domain gets
paused for other reasons (such as an I/O error). Thus we need a separate
place to remember the post-copy migration failed to be able to resume
the migration.
https://bugzilla.redhat.com/show_bug.cgi?id=2111948
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Michal Privoznik <mprivozn>
Pre-verify with libvirt-9.0.0-1.el9.x86_64
Steps:
1. Start vm
2. Migrate vm to target host and switch to postcopy
# virsh migrate vm1 qemu+tcp://{target_ip}/system --live --postcopy --undefinesource --persistent --p2p --bandwidth 3 --postcopy-bandwidth 3 --migrateuri tcp://{target_ip}:49153
3. Abort postcopy migration
# virsh domjobabort vm1 --postcopy
4. Make I/O error in vm.
For example: change ownership of vm to root:root, then do disk I/O in vm.
Event output:
2023-01-17 03:43:03.879+0000: event 'io-error' for domain 'vm1': /nfs/RHEL-9.1-x86_64-latest-ovmf.qcow2.2 (virtio-disk0) report
2023-01-17 03:43:03.879+0000: event 'io-error-reason' for domain 'vm1': /nfs/RHEL-9.1-x86_64-latest-ovmf.qcow2.2 (virtio-disk0) report due to
5. Check domain state with reason:
# virsh domstate vm1 --reason
running (post-copy failed)
6. Resume postcopy migration, it succeeds:
# virsh migrate vm1 qemu+tcp://{target_ip}/system --live --postcopy --undefinesource --persistent --p2p --bandwidth 3 --postcopy-bandwidth 3 --migrateuri tcp://{target_ip}:49153 --postcopy-resume
Event output on target host:
2023-01-17 03:45:43.940+0000: event 'lifecycle' for domain 'vm1': Resumed Post-copy
2023-01-17 03:46:15.658+0000: event 'lifecycle' for domain 'vm1': Resumed Migrated
Verified with libvirt-9.0.0-4.el9.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (libvirt bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2171 |
Created attachment 1899963 [details] libvirt and qemu log Description of problem: Make vm I/O error when postcopy is paused, then try to recover postcopy migration, it failed and said: error: Requested operation is not valid: migration of domain uefi-5 is not in post-copy phase Version-Release number of selected component (if applicable): libvirt-8.5.0-2.el9.x86_64 qemu-kvm-7.0.0-9.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start vm 2. Migrate vm to target host, and switch to postcopy # virsh migrate uefi-5 qemu+tcp://******/system --live --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3 --migrateuri tcp://******:49153 --p2p # virsh migrate-postcopy uefi-5 3. Abort migration: # virsh domjobabort uefi-5 --postcopy 4. Make I/O error in vm. For example: change ownership of vm to root:root, then do disk I/O in vm. [target host]# virsh domstate uefi-5 --reason paused (I/O error) [src host]# virsh domstate uefi-5 --reason paused (post-copy failed) 5. Try to recover postcopy migration # virsh migrate uefi-5 qemu+tcp://******/system --live --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3 --migrateuri tcp://******:49153 --p2p --postcopy-resume error: Requested operation is not valid: migration of domain uefi-5 is not in post-copy phase Actual results: As step5, postcopy recover failed. Expected results: Step5 can succeed. Additional info: