Bug 2111332
Summary: | Recovering postcopy before network issue is resolved leads to wrong qemu migration status | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Fangge Jin <fjin> | ||||
Component: | qemu-kvm | Assignee: | Peter Xu <peterx> | ||||
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> | ||||
Status: | CLOSED MIGRATED | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | low | CC: | chayang, gveitmic, jinzhao, lcheng, lersek, nilal, peterx, virt-maint, xiaohli, yafu, ymankad | ||||
Version: | 9.1 | Keywords: | CustomerScenariosInitiative, MigratedToJIRA, Triaged | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2023-09-22 13:46:42 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Hi Nitesh, According to Germano's reply in Comment 1, can we fix this bug in RHEL 9.3.0? And also backport it on RHEL 9.2.z? Sorry, I'm on Public Holiday from Apr 28 to May 3. I would try to reproduce this bug and show the result here (In reply to Li Xiaohui from comment #8) > Sorry, I'm on Public Holiday from Apr 28 to May 3. > > I would try to reproduce this bug and show the result here No worries, we knew you would run it when you return. :) Please take your time. Hi Fangge, When I try to reproduce with libvirt, get one error during migration: [root@hp-dl385g10-14 home]# virsh migrate rhel930 qemu+unix:///system?socket=/tmp/sock --live --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3 --migrateuri unix:///tmp/33333-sock error: Failed to connect socket to '/tmp/33333-sock': Permission denied Can you help? (In reply to Li Xiaohui from comment #10) > Hi Fangge, > When I try to reproduce with libvirt, get one error during migration: > [root@hp-dl385g10-14 home]# virsh migrate rhel930 > qemu+unix:///system?socket=/tmp/sock --live --postcopy --undefinesource > --persistent --bandwidth 3 --postcopy-bandwidth 3 --migrateuri > unix:///tmp/33333-sock > error: Failed to connect socket to '/tmp/33333-sock': Permission denied > > > Can you help? You can set selinux to Permissive mode, and try again Since when has "recovering postcopy" a thing?
I've just made a serious attempt to read "docs/devel/migration.rst" in the upstream QEMU tree @ 2a6ae6915454, and it says:
> 'Postcopy' migration is a way to deal with migrations that refuse to converge
> (or take too long to converge) its plus side is that there is an upper bound on
> the amount of migration traffic and time it takes, the down side is that during
> the postcopy phase, a failure of *either* side or the network connection causes
> the guest to be lost.
It comes from 2bfdd1c8a6ac ("Add postcopy documentation", 2015-11-10).
"Recovering postcopy" (e.g. after network failure) is not consistent with the statement that the guest is lost.
If there has been a separate feature for "postcopy recovery" meanwhile, then the developers have missed updating the documentation.
What is the (main) RHBZ for "postcopy recovery"?
Related upstream commit: a688d2c1abc7 ("migration: new postcopy-pause state", 2018-05-15) The containing series is probably the one that should have updated the documentation ("guest to be lost"). I *think* this might be fixed once I fix bug 2018404, so setting a dependency accordingly. (In reply to Laszlo Ersek from comment #13) > Related upstream commit: > > a688d2c1abc7 ("migration: new postcopy-pause state", 2018-05-15) > > The containing series is probably the one that should have updated the > documentation ("guest to be lost"). This is my fault (I'll also blame maintainers to not remind me :), but yeah 99% mine). I was trying to remedy that with this (also posted just a few days ago; please feel free to review): https://lore.kernel.org/all/20230627200222.557529-1-peterx@redhat.com/ We used to have a simple wiki page: https://wiki.qemu.org/Features/PostcopyRecovery Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug on RHEL 9.3.0? If so, please help set the ITR. I had tried to reproduce this bug on May, but fail to reproduce it through qemu. Can easily reproduce through libvirt. Will go on to find the difference between libvirt and qemu. (In reply to Li Xiaohui from comment #16) > Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug > on RHEL 9.3.0? If so, please help set the ITR. > > > I had tried to reproduce this bug on May, but fail to reproduce it through > qemu. Can easily reproduce through libvirt. > Will go on to find the difference between libvirt and qemu. Hi Xiaohui, From Laszlo's comment, the fix for Bug 2018404 should also fix this issue. However, I don't know if the commits have been merged upstream yet. Let's check this with Peter. Peter, do you think we can do this in 9.3? Wrt customer impact, since this is a low priority, we should not rush the changes if they are risky. (In reply to Nitesh Narayan Lal from comment #17) > (In reply to Li Xiaohui from comment #16) > > Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug > > on RHEL 9.3.0? If so, please help set the ITR. > > > > > > I had tried to reproduce this bug on May, but fail to reproduce it through > > qemu. Can easily reproduce through libvirt. > > Will go on to find the difference between libvirt and qemu. Thanks Xiaohui, that'll be very helpful. > > Hi Xiaohui, From Laszlo's comment, the fix for Bug 2018404 should also fix > this issue. However, I don't know if the commits have been merged upstream > yet. It's not landed, and I am not sure this is the same issue. Laszlo, did I miss something, though? > Let's check this with Peter. > Peter, do you think we can do this in 9.3? Per low priority (which I agree), I'd say we can wait for Xiaohui's result and just postpone it if it won't make it. > > Wrt customer impact, since this is a low priority, we should not rush the > changes if they are risky. In comment#14, I still thought that being stuck in postcopy-paused state, due to an unreachable IP address for example, was the problem underlying bug 2018404. Back then my idea was that we should turn that into a hard failure. That would fix the QMP status report (which depended on "failed" state), and potentially this bug as well -- I thought "QEMU reports migration is still running" here was related to the *lost* vmstate change. Meanwhile though Peter explained that we didn't want to fail migration permanently, under bug 2018404; instead, the socket status/error reporting via QMP should be extended to "postcopy-paused" state (IIUC). In that case the dependency I set here with my comment#14 seems incorrect indeed, so I'm reverting it now. Hi all, seems I have reproduced this bug through QEMU side. hosts info: kernel-5.14.0-332.el9.x86_64 && qemu-kvm-8.0.0-11.el9.x86_64 guest info: kernel-5.14.0-354.el9.x86_64 Test steps: 1. boot a VM on src host, and boot a VM with appending '-incoming defer' on dst host 2. dst host: socat tcp-listen:22222,reuseaddr,fork unix:/var/run/libvirt/virtqemud-sock socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock 3. src host: socat unix-listen:/tmp/sock,reuseaddr,fork tcp:$dst_host_ip:22222 socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333 4. enable postcopy capabilities on src and dst host: {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}} 5. set max-bandwidth and max-postcopy-bandwidth to 3M on src host: {"execute":"migrate-set-parameters","arguments":{"max-bandwidth": 3145728}} {"execute":"migrate-set-parameters","arguments":{"max-postcopy-bandwidth": 3145728}} 6. start postcopy migration (1) set migrate incoming on dst host: {"execute":"migrate-incoming","arguments":{"uri":"unix:/tmp/33333-sock"},"id":"libvirt-419"} (2) start migration on src host: {"execute":"migrate", "arguments":{"uri":"unix:/tmp/33333-sock"}} (3) switch to postcopy during migration is active: {"execute":"migrate-start-postcopy"} 7. during postcopy is active, break the proxy on dst host [root@dell-per7525-26 ~]# socat tcp-listen:33334,reuseaddr,fork unix:/tmp/33334-sock Note: after step 7, postcopy switches to pause status both on src and dst host 8. before fixing the proxy, recover postcopy (1) on dst host: {"exec-oob":"migrate-recover","arguments":{"uri":"unix:/tmp/33333-sock"},"id":"libvirt-427"} (2) on src host: {"execute":"migrate","arguments":{"uri":"unix:/tmp/33333-sock", "resume":true}} After step 8, postcopy is still in pause status on dst host, but on src host, it's in postcopy-recover status: a. dst host: {"execute":"query-migrate"} {"return": {"status": "postcopy-paused", "socket-address": [{"path": "/tmp/33334-sock", "type": "unix"}]}} b. src host: {"execute":"query-migrate"} {"return": {"status": "postcopy-recover", "setup-time": 215, "downtime": 250, "total-time": 966610, "ram": {"total": 429518430208, "postcopy-requests": 3977, "dirty-sync-count": 2, "multifd-bytes": 0, "pages-per-second": 1387711, "downtime-bytes": 0, "page-size": 4096, "remaining": 426783657984, "postcopy-bytes": 658631688, "mbps": 950.10299999999995, "transferred": 672222633, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 13590945, "duplicate": 504980, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 666374144, "normal": 162689}}} 9. fix the proxy by sending cmds: (1) on dst host: [root@dell-per7525-26 ~]# socat tcp-listen:33334,reuseaddr,fork unix:/tmp/33334-sock (2) on src host: [root@dell-per7525-25 ~]# socat unix-listen:/tmp/33334-sock,reuseaddr,fork tcp:$dst_host_ip:33333 10. recovery postcopy again, will fail when resuming migration on src host: (1) dst host: {"exec-oob":"migrate-recover","arguments":{"uri":"unix:/tmp/33333-sock"},"id":"libvirt-427"} {"timestamp": {"seconds": 1692705986, "microseconds": 442910}, "event": "MIGRATION", "data": {"status": "setup"}} {"return": {}, "id": "libvirt-427"} (2) src host {"execute":"migrate","arguments":{"uri":"unix:/tmp/33333-sock", "resume":true}} {"error": {"class": "GenericError", "desc": "Cannot resume if there is no paused migration"}} If fix the proxy before step 8 in Comment 20, then postcopy recovery succeeds, VM works well after postcopy migration. Thanks for reproducing, Xiaohui! This failure seems by-design so far. The problem is we only killed the dst proxy process not src, so the src ports are all fine, src QEMU won't be able to identify that in any way, except that it will find the channel is block at full send buffer without being flushed, and blocked in the syscall sendmsg()/write(). If we try recover from the src old port (via the old proxy) it should just proceed with no problem (this is after step 8), then we'll see src in RECOVERY state (because it does has a connection now), but dst in PAUSED (because it didn't really get a signal from src due to interrupted dst proxy). Here it's the wrong command that admin sent; the admin should not recover the channel before knowing the channel is valid. So the admin is responsible for fixing it by himself/herself. We only need to guarantee it's fix-able without crashing the VM at any point. Xiaohui, could you try one more step between step 8 & 9, in which could you kill the old src proxy process to see whether the src qemu will go back to a PAUSED state (from RECOVER)? That process should be: $ socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333 Then try recover again using the new channel. Thanks, Peter (In reply to Peter Xu from comment #22) > Thanks for reproducing, Xiaohui! > > This failure seems by-design so far. The problem is we only killed the dst > proxy process not src, so the src ports are all fine, src QEMU won't be able > to identify that in any way, except that it will find the channel is block > at full send buffer without being flushed, and blocked in the syscall > sendmsg()/write(). If we try recover from the src old port (via the old > proxy) it should just proceed with no problem (this is after step 8), then > we'll see src in RECOVERY state (because it does has a connection now), but > dst in PAUSED (because it didn't really get a signal from src due to > interrupted dst proxy). > > Here it's the wrong command that admin sent; the admin should not recover > the channel before knowing the channel is valid. So the admin is responsible > for fixing it by himself/herself. We only need to guarantee it's fix-able > without crashing the VM at any point. > > Xiaohui, could you try one more step between step 8 & 9, in which could you > kill the old src proxy process to see whether the src qemu will go back to a > PAUSED state (from RECOVER)? after step 8, I killed the old src proxy process, and wait for more than 30mins, The migration status of the src qemu is still postcopy-recover. So hit same issue after step 9 && 10 And I found the src proxy process gives some error after step 8 (2): ****************************************************************************** ^C[root@dell-per7525-25 home]# socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:10.73.2.82:33333 2023/08/23 01:00:26 socat[356265] E connect(5, AF=2 10.73.2.82:33333, 16): Connection refused But if we kill the src proxy and dst proxy in step 7 7. during postcopy is active, break the proxy on dst and src host [root@dell-per7525-26 ~]# socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock [root@dell-per7525-25 home]# socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333 then after step 8, postcopy is in pause status both on src and dst host. And postcopy recovery succeeds, VM works well after step 9 and 10 > That process should be: > $ socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333 > > Then try recover again using the new channel. > > Thanks, > Peter (In reply to Li Xiaohui from comment #23) > after step 8, I killed the old src proxy process, and wait for more than > 30mins, Are you sure the src proxy process is dead for real? E.g. "pgrep -fa socat" should have it gone. One more thing to try is trigger "migrate-pause" on src manually to kick the not working channel out. As long as we can get src out of RECOVER stage back to PAUSED for another shot then I think it's fine. (In reply to Peter Xu from comment #24) > (In reply to Li Xiaohui from comment #23) > > after step 8, I killed the old src proxy process, and wait for more than > > 30mins, > > Are you sure the src proxy process is dead for real? E.g. "pgrep -fa socat" > should have it gone. Yes, I'm sure. Before killing the src proxy process: [root@dell-per7525-25 ~]# pgrep -fa socat 378838 socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:10.73.2.82:33333 378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222 After that: [root@dell-per7525-25 ~]# pgrep -fa socat 378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222 [root@dell-per7525-25 ~]# > > One more thing to try is trigger "migrate-pause" on src manually to kick the > not working channel out. As long as we can get src out of RECOVER stage back > to PAUSED for another shot then I think it's fine. Also fail to execute "migrate-pause": {"execute":"migrate-pause"} {"error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}} (In reply to Li Xiaohui from comment #25) > (In reply to Peter Xu from comment #24) > > (In reply to Li Xiaohui from comment #23) > > > after step 8, I killed the old src proxy process, and wait for more than > > > 30mins, > > > > Are you sure the src proxy process is dead for real? E.g. "pgrep -fa socat" > > should have it gone. > > Yes, I'm sure. > > Before killing the src proxy process: > [root@dell-per7525-25 ~]# pgrep -fa socat > 378838 socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:10.73.2.82:33333 > 378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222 > > After that: > [root@dell-per7525-25 ~]# pgrep -fa socat > 378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222 > [root@dell-per7525-25 ~]# It seems we cannot reliably kick it out of "recover" stage indeed. > > > > > > One more thing to try is trigger "migrate-pause" on src manually to kick the > > not working channel out. As long as we can get src out of RECOVER stage back > > to PAUSED for another shot then I think it's fine. > > Also fail to execute "migrate-pause": > {"execute":"migrate-pause"} > {"error": {"class": "GenericError", "desc": "migrate-pause is currently only > supported during postcopy-active state"}} We probably also want to allow migrate-pause to interrupt the RECOVER stage as well. Xiaohui, I'll prepare something soon for you to test, maybe need to wait until tomorrow. So I've prepared a patchset which should be able to fix this, I sent a version here: [PATCH 0/9] migration: Better error handling in rp thread, allow failures in recover https://lore.kernel.org/r/20230829214235.69309-1-peterx@redhat.com Xiaohui, I wanted to build a brew for you, but I found that the backport is not super straightforward. I figured it should be easier if we just leave this until next release (9.4) because the possibility customer hitting this should be ~0 (postcopy, with one network failure, then another network failure during recovery, or wrong operations), but I do agree this is a real bug and we should fix it, and I'll follow up the issue upstream. Changing ITR to 9.4 to reflect the fact that we don't need it for 9.3. Xiaohui, let me know if you have any concerns. Thanks! Peter (In reply to Peter Xu from comment #27) > So I've prepared a patchset which should be able to fix this, I sent a > version here: > > [PATCH 0/9] migration: Better error handling in rp thread, allow failures in > recover > https://lore.kernel.org/r/20230829214235.69309-1-peterx@redhat.com > > Xiaohui, I wanted to build a brew for you, but I found that the backport is > not super straightforward. > > I figured it should be easier if we just leave this until next release (9.4) > because the possibility customer hitting this should be ~0 (postcopy, with > one network failure, then another network failure during recovery, or wrong > operations), but I do agree this is a real bug and we should fix it, and > I'll follow up the issue upstream. > > Changing ITR to 9.4 to reflect the fact that we don't need it for 9.3. > Xiaohui, let me know if you have any concerns. No objection. Thank you. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. |
Created attachment 1899573 [details] libvirt and qemu log Description of problem: Do postcopy migration with unix+proxy transport, break proxy, postcopy migration failed. Try to recover migration before proxy is fixed, it failed as expected. Then fix the proxy, and try to recover migration again, it still failed ans said: "error: Requested operation is not valid: QEMU reports migration is still running" Version-Release number of selected component (if applicable): libvirt-8.5.0-2.el9.x86_64 qemu-kvm-7.0.0-9.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1.Start a vm 2. Set up proxy between src and dest host 1) On dest host: # socat tcp-listen:22222,reuseaddr,fork unix:/var/run/libvirt/virtqemud-sock # socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock 2) On src host: # socat unix-listen:/tmp/sock,reuseaddr,fork tcp:<dest_host>:22222 # socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:<dest_host>:33333 2.Migrate vm to other host with unix transport: # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3 --migrateuri unix:///tmp/33333-sock 3.Switch migration to postcopy # virsh migrate-postcopy uefi 4. Break this proxy, migration will fail immediately: # socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock 5. Try to recover postcopy migration, it failed as expected: # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3 --migrateuri unix:///tmp/33333-sock --postcopy-resume error: operation failed: job 'migration in' failed in post-copy phase 6. Fix proxy 7. Try to recover postcopy migration, it failed unexpected: # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3 --migrateuri unix:///tmp/33333-sock --postcopy-resume error: Requested operation is not valid: QEMU reports migration is still running 8. Try to abort migration: # virsh domjobabort uefi --postcopy error: internal error: unable to execute QEMU command 'migrate-pause': migrate-pause is currently only supported during postcopy-active state Actual results: postcopy recovery failed in step7 Expected results: postcopy recovery can succeed in step7 Additional info: Can't reproduce with tcp transport.