Bug 2107817

Summary: Postcopy-recover failed when abort and recover postcopy migration with multifd enabled
Product: Red Hat Enterprise Linux 9 Reporter: Fangge Jin <fjin>
Component: qemu-kvmAssignee: Peter Xu <peterx>
qemu-kvm sub component: Live Migration QA Contact: Li Xiaohui <xiaohli>
Status: ASSIGNED --- Docs Contact:
Severity: medium    
Priority: low CC: chayang, coli, jinzhao, juzhang, lcheng, leobras, nilal, peterx, quintela, virt-maint, xiaohli, yafu
Version: 9.1Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirt log none

Description Fangge Jin 2022-07-16 10:14:11 UTC
Created attachment 1897633 [details]
libvirt log

Description of problem:
Do postcopy migration with multifd enabled -> abort migration -> try to recover migration. Recover failed.

Version-Release number of selected component (if applicable):
libvirt-8.5.0-1.el9.x86_64
qemu-kvm-7.0.0-8.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Start a vm

2. Do migration with multifd enabled
   [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge  --p2p --parallel

3. Switch to postcopy before migration completes:
   [T2] # virsh migrate-postcopy vm1

4. Abort migration before it completes:
   [T2] # virsh domjobabort vm1 --postcopy
   [T1] error: operation failed: job 'migration in' failed in post-copy phase


5. Try to recover migration:
   [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge  --p2p --parallel --postcopy-resume
error: operation failed: job 'migration in' failed in post-copy phase

   [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge  --p2p --parallel --postcopy-resume
error: operation failed: job 'migration in' failed in post-copy phase


Actual results:
As step5, postcopy recover failed


Expected results:
postcopy recover succeeds.

Additional info:
1. Src qemu log:
2022-07-16 10:00:43.147+0000: initiating migration
2022-07-16T10:00:47.810822Z qemu-kvm: Unable to write to socket: Broken pipe
2022-07-16T10:00:47.810955Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5
2022-07-16T10:00:47.810995Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16 10:01:02.331+0000: resuming migration
2022-07-16T10:01:02.335987Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5
2022-07-16T10:01:02.336050Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16 10:01:04.629+0000: resuming migration
2022-07-16T10:01:04.633004Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5
2022-07-16T10:01:04.633047Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

2. Dest qemu log:
2022-07-16T10:00:47.827701Z qemu-kvm: error while loading state section id 2(ram)
2022-07-16T10:00:47.827770Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16T10:01:02.345350Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8fa7e00000 from: 0x5569d747c1ec (size: 4096)
2022-07-16T10:01:02.345363Z qemu-kvm: error while loading state section id 2(ram)
2022-07-16T10:01:02.345400Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16T10:01:04.642370Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8fa7e00000 from: 0x5569d747c1ec (size: 4096)
2022-07-16T10:01:04.642377Z qemu-kvm: error while loading state section id 2(ram)
2022-07-16T10:01:04.642406Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

3. Src qemu monitor command:
- {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":true},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":true},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":true},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":true},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":true},{"capability":"zero-copy-send","state":false}]},"id":"libvirt-408"}
- {"execute":"migrate-set-parameters","arguments":{"tls-creds":"","tls-hostname":"","max-bandwidth":10485760},"id":"libvirt-409"}
- {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"resume":false,"uri":"tcp:dell-per750-04.******:49152"},"id":"libvirt-410"}
- {"execute":"migrate-start-postcopy","id":"libvirt-411"}
- {"execute":"migrate-continue","arguments":{"state":"pre-switchover"},"id":"libvirt-413"}
- {"execute":"migrate-pause","id":"libvirt-415"}
- {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"resume":true,"uri":"tcp:dell-per750-04.******:49153"},"id":"libvirt-417"}

Comment 1 Li Xiaohui 2022-08-29 06:56:07 UTC
I could also reproduce this bug on qemu-kvm-7.0.0-11.el9.x86_64.

After recovery the migration, migration will switch from postcopy-active to postcopy-paused by automatically in about 30 seconds, and can't be recovered:
1) src hmp log:
(qemu) 2022-08-29T06:42:09.462011Z qemu-kvm: failed to save SaveStateEntry with id(name): 1(ram): -5
2022-08-29T06:42:09.462082Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

2) dst hmp log:
(qemu) 2022-08-29T06:42:09.461417Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8277e00000 from: 0x7f827000cc18 (size: 4096)
2022-08-29T06:42:09.462035Z qemu-kvm: error while loading state section id 1(ram)
2022-08-29T06:42:09.462585Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

Comment 2 Li Xiaohui 2022-08-29 07:03:13 UTC
I also reported one postcopy + multifd bug, I don't know whether they're releated:
Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled

Comment 5 Peter Xu 2022-11-10 16:14:42 UTC
No immediate thing found from the logs.

Hopefully let's mark this as low priority.  Since I have some other things to look at for today (and tomorrow public holiday here..) I may need to do that next week, sorry.  But I'll update when I found something.

Comment 6 Peter Xu 2022-11-14 20:27:00 UTC
I didn't verify, but quickly checking the code, it seems we'll try to rebuild multifd channels too for postcopy recovery on dest qemu (migration_ioc_process_incoming()), however that won't work because src doesn't do that (migrate_fd_connect()).

Instead of fixing it, I'm thinking whether we should just disable postcopy with multifd completely because I can't think of anything that it could be useful, especially before a full support of them.  We seem to be tackling with more than one issues for this but probably no one is using it.

Xiaohui/Fangge, do you have use case that enabling these two features can help in any form?

Comment 7 Li Xiaohui 2022-11-15 04:37:21 UTC
Hi Peter,
I didn't add postcopy + multifd test cases in migration test plan. I realized we should have supported postcopy + multifd from RHEL 9.1.0 which Juan told me.
So I tested the basic postcopy + multifd migration, and reported the following bug:
Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled