Bug 2107817 - Postcopy-recover failed when abort and recover postcopy migration with multifd enabled
Summary: Postcopy-recover failed when abort and recover postcopy migration with multif...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.1
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: rc
: ---
Assignee: Peter Xu
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-16 10:14 UTC by Fangge Jin
Modified: 2023-08-08 07:02 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirt log (218.18 KB, application/x-bzip)
2022-07-16 10:14 UTC, Fangge Jin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-127934 0 None None None 2022-07-16 10:14:50 UTC

Description Fangge Jin 2022-07-16 10:14:11 UTC
Created attachment 1897633 [details]
libvirt log

Description of problem:
Do postcopy migration with multifd enabled -> abort migration -> try to recover migration. Recover failed.

Version-Release number of selected component (if applicable):
libvirt-8.5.0-1.el9.x86_64
qemu-kvm-7.0.0-8.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Start a vm

2. Do migration with multifd enabled
   [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge  --p2p --parallel

3. Switch to postcopy before migration completes:
   [T2] # virsh migrate-postcopy vm1

4. Abort migration before it completes:
   [T2] # virsh domjobabort vm1 --postcopy
   [T1] error: operation failed: job 'migration in' failed in post-copy phase


5. Try to recover migration:
   [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge  --p2p --parallel --postcopy-resume
error: operation failed: job 'migration in' failed in post-copy phase

   [T1]# virsh migrate vm1 qemu+tcp://dell-per750-04******/system --live --postcopy --bandwidth 10 --auto-converge  --p2p --parallel --postcopy-resume
error: operation failed: job 'migration in' failed in post-copy phase


Actual results:
As step5, postcopy recover failed


Expected results:
postcopy recover succeeds.

Additional info:
1. Src qemu log:
2022-07-16 10:00:43.147+0000: initiating migration
2022-07-16T10:00:47.810822Z qemu-kvm: Unable to write to socket: Broken pipe
2022-07-16T10:00:47.810955Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5
2022-07-16T10:00:47.810995Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16 10:01:02.331+0000: resuming migration
2022-07-16T10:01:02.335987Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5
2022-07-16T10:01:02.336050Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16 10:01:04.629+0000: resuming migration
2022-07-16T10:01:04.633004Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram): -5
2022-07-16T10:01:04.633047Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

2. Dest qemu log:
2022-07-16T10:00:47.827701Z qemu-kvm: error while loading state section id 2(ram)
2022-07-16T10:00:47.827770Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16T10:01:02.345350Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8fa7e00000 from: 0x5569d747c1ec (size: 4096)
2022-07-16T10:01:02.345363Z qemu-kvm: error while loading state section id 2(ram)
2022-07-16T10:01:02.345400Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
2022-07-16T10:01:04.642370Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8fa7e00000 from: 0x5569d747c1ec (size: 4096)
2022-07-16T10:01:04.642377Z qemu-kvm: error while loading state section id 2(ram)
2022-07-16T10:01:04.642406Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

3. Src qemu monitor command:
- {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":true},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":true},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":true},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":true},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":true},{"capability":"zero-copy-send","state":false}]},"id":"libvirt-408"}
- {"execute":"migrate-set-parameters","arguments":{"tls-creds":"","tls-hostname":"","max-bandwidth":10485760},"id":"libvirt-409"}
- {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"resume":false,"uri":"tcp:dell-per750-04.******:49152"},"id":"libvirt-410"}
- {"execute":"migrate-start-postcopy","id":"libvirt-411"}
- {"execute":"migrate-continue","arguments":{"state":"pre-switchover"},"id":"libvirt-413"}
- {"execute":"migrate-pause","id":"libvirt-415"}
- {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"resume":true,"uri":"tcp:dell-per750-04.******:49153"},"id":"libvirt-417"}

Comment 1 Li Xiaohui 2022-08-29 06:56:07 UTC
I could also reproduce this bug on qemu-kvm-7.0.0-11.el9.x86_64.

After recovery the migration, migration will switch from postcopy-active to postcopy-paused by automatically in about 30 seconds, and can't be recovered:
1) src hmp log:
(qemu) 2022-08-29T06:42:09.462011Z qemu-kvm: failed to save SaveStateEntry with id(name): 1(ram): -5
2022-08-29T06:42:09.462082Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

2) dst hmp log:
(qemu) 2022-08-29T06:42:09.461417Z qemu-kvm: postcopy_place_page: File exists copy host: 0x7f8277e00000 from: 0x7f827000cc18 (size: 4096)
2022-08-29T06:42:09.462035Z qemu-kvm: error while loading state section id 1(ram)
2022-08-29T06:42:09.462585Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

Comment 2 Li Xiaohui 2022-08-29 07:03:13 UTC
I also reported one postcopy + multifd bug, I don't know whether they're releated:
Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled

Comment 5 Peter Xu 2022-11-10 16:14:42 UTC
No immediate thing found from the logs.

Hopefully let's mark this as low priority.  Since I have some other things to look at for today (and tomorrow public holiday here..) I may need to do that next week, sorry.  But I'll update when I found something.

Comment 6 Peter Xu 2022-11-14 20:27:00 UTC
I didn't verify, but quickly checking the code, it seems we'll try to rebuild multifd channels too for postcopy recovery on dest qemu (migration_ioc_process_incoming()), however that won't work because src doesn't do that (migrate_fd_connect()).

Instead of fixing it, I'm thinking whether we should just disable postcopy with multifd completely because I can't think of anything that it could be useful, especially before a full support of them.  We seem to be tackling with more than one issues for this but probably no one is using it.

Xiaohui/Fangge, do you have use case that enabling these two features can help in any form?

Comment 7 Li Xiaohui 2022-11-15 04:37:21 UTC
Hi Peter,
I didn't add postcopy + multifd test cases in migration test plan. I realized we should have supported postcopy + multifd from RHEL 9.1.0 which Juan told me.
So I tested the basic postcopy + multifd migration, and reported the following bug:
Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled


Note You need to log in before you can comment on or make changes to this bug.