Bug 2107817
| Summary: | Postcopy-recover failed when abort and recover postcopy migration with multifd enabled | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Fangge Jin <fjin> | ||||
| Component: | qemu-kvm | Assignee: | Peter Xu <peterx> | ||||
| qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> | ||||
| Status: | ASSIGNED --- | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | low | CC: | chayang, coli, jinzhao, juzhang, lcheng, leobras, nilal, peterx, quintela, virt-maint, xiaohli, yafu | ||||
| Version: | 9.1 | Keywords: | Triaged | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
I could also reproduce this bug on qemu-kvm-7.0.0-11.el9.x86_64. After recovery the migration, migration will switch from postcopy-active to postcopy-paused by automatically in about 30 seconds, and can't be recovered: 1) src hmp log: (qemu) 2022-08-29T06:42:09.462011Z qemu-kvm: failed to save SaveStateEntry with id(name): 1(ram): -5 2022-08-29T06:42:09.462082Z qemu-kvm: Detected IO failure for postcopy. Migration paused. 2) dst hmp log: (qemu) 2022-08-29T06:42:09.461417Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8277e00000 from: 0x7f827000cc18 (size: 4096) 2022-08-29T06:42:09.462035Z qemu-kvm: error while loading state section id 1(ram) 2022-08-29T06:42:09.462585Z qemu-kvm: Detected IO failure for postcopy. Migration paused. I also reported one postcopy + multifd bug, I don't know whether they're releated: Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled No immediate thing found from the logs. Hopefully let's mark this as low priority. Since I have some other things to look at for today (and tomorrow public holiday here..) I may need to do that next week, sorry. But I'll update when I found something. I didn't verify, but quickly checking the code, it seems we'll try to rebuild multifd channels too for postcopy recovery on dest qemu (migration_ioc_process_incoming()), however that won't work because src doesn't do that (migrate_fd_connect()). Instead of fixing it, I'm thinking whether we should just disable postcopy with multifd completely because I can't think of anything that it could be useful, especially before a full support of them. We seem to be tackling with more than one issues for this but probably no one is using it. Xiaohui/Fangge, do you have use case that enabling these two features can help in any form? Hi Peter, I didn't add postcopy + multifd test cases in migration test plan. I realized we should have supported postcopy + multifd from RHEL 9.1.0 which Juan told me. So I tested the basic postcopy + multifd migration, and reported the following bug: Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled |
Created attachment 1897633 [details] libvirt log Description of problem: Do postcopy migration with multifd enabled -> abort migration -> try to recover migration. Recover failed. Version-Release number of selected component (if applicable): libvirt-8.5.0-1.el9.x86_64 qemu-kvm-7.0.0-8.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start a vm 2. Do migration with multifd enabled [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge --p2p --parallel 3. Switch to postcopy before migration completes: [T2] # virsh migrate-postcopy vm1 4. Abort migration before it completes: [T2] # virsh domjobabort vm1 --postcopy [T1] error: operation failed: job 'migration in' failed in post-copy phase 5. Try to recover migration: [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge --p2p --parallel --postcopy-resume error: operation failed: job 'migration in' failed in post-copy phase [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge --p2p --parallel --postcopy-resume error: operation failed: job 'migration in' failed in post-copy phase Actual results: As step5, postcopy recover failed Expected results: postcopy recover succeeds. Additional info: 1. Src qemu log: 2022-07-16 10:00:43.147+0000: initiating migration 2022-07-16T10:00:47.810822Z qemu-kvm: Unable to write to socket: Broken pipe 2022-07-16T10:00:47.810955Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5 2022-07-16T10:00:47.810995Z qemu-kvm: Detected IO failure for postcopy. Migration paused. 2022-07-16 10:01:02.331+0000: resuming migration 2022-07-16T10:01:02.335987Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5 2022-07-16T10:01:02.336050Z qemu-kvm: Detected IO failure for postcopy. Migration paused. 2022-07-16 10:01:04.629+0000: resuming migration 2022-07-16T10:01:04.633004Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5 2022-07-16T10:01:04.633047Z qemu-kvm: Detected IO failure for postcopy. Migration paused. 2. Dest qemu log: 2022-07-16T10:00:47.827701Z qemu-kvm: error while loading state section id 2(ram) 2022-07-16T10:00:47.827770Z qemu-kvm: Detected IO failure for postcopy. Migration paused. 2022-07-16T10:01:02.345350Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8fa7e00000 from: 0x5569d747c1ec (size: 4096) 2022-07-16T10:01:02.345363Z qemu-kvm: error while loading state section id 2(ram) 2022-07-16T10:01:02.345400Z qemu-kvm: Detected IO failure for postcopy. Migration paused. 2022-07-16T10:01:04.642370Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8fa7e00000 from: 0x5569d747c1ec (size: 4096) 2022-07-16T10:01:04.642377Z qemu-kvm: error while loading state section id 2(ram) 2022-07-16T10:01:04.642406Z qemu-kvm: Detected IO failure for postcopy. Migration paused. 3. Src qemu monitor command: - {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":true},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":true},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":true},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":true},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":true},{"capability":"zero-copy-send","state":false}]},"id":"libvirt-408"} - {"execute":"migrate-set-parameters","arguments":{"tls-creds":"","tls-hostname":"","max-bandwidth":10485760},"id":"libvirt-409"} - {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"resume":false,"uri":"tcp:dell-per750-04.******:49152"},"id":"libvirt-410"} - {"execute":"migrate-start-postcopy","id":"libvirt-411"} - {"execute":"migrate-continue","arguments":{"state":"pre-switchover"},"id":"libvirt-413"} - {"execute":"migrate-pause","id":"libvirt-415"} - {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"resume":true,"uri":"tcp:dell-per750-04.******:49153"},"id":"libvirt-417"}