Bug 2111332

Summary:

Recovering postcopy before network issue is resolved leads to wrong qemu migration status

Product:

Red Hat Enterprise Linux 9

Reporter:

Fangge Jin <fjin>

Component:

qemu-kvm

Assignee:

Peter Xu <peterx>

qemu-kvm sub component:

Live Migration

QA Contact:

Li Xiaohui <xiaohli>

Status:

CLOSED MIGRATED

Docs Contact:

Severity:

medium

Priority:

low

CC:

chayang, gveitmic, jinzhao, lcheng, lersek, nilal, peterx, virt-maint, xiaohli, yafu, ymankad

Version:

9.1

Keywords:

CustomerScenariosInitiative, MigratedToJIRA, Triaged

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-09-22 13:46:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
libvirt and qemu log	none

Description Fangge Jin 2022-07-27 05:15:34 UTC

Created attachment 1899573 [details]
libvirt and qemu log

Description of problem:
Do postcopy migration with unix+proxy transport, break proxy, postcopy migration failed. Try to recover migration before proxy is fixed, it failed as expected. Then fix the proxy, and try to recover migration again, it still failed ans said:
"error: Requested operation is not valid: QEMU reports migration is still running"

Version-Release number of selected component (if applicable):
libvirt-8.5.0-2.el9.x86_64
qemu-kvm-7.0.0-9.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Start a vm

2. Set up proxy between src and dest host
1) On dest host:
   # socat tcp-listen:22222,reuseaddr,fork unix:/var/run/libvirt/virtqemud-sock
   # socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock
2) On src host:
   # socat unix-listen:/tmp/sock,reuseaddr,fork tcp:<dest_host>:22222
   # socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:<dest_host>:33333

2.Migrate vm to other host with unix transport:
   # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock

3.Switch migration to postcopy
   # virsh migrate-postcopy uefi

4. Break this proxy, migration will fail immediately:
   # socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock

5. Try to recover postcopy migration, it failed as expected:
   # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock --postcopy-resume
   error: operation failed: job 'migration in' failed in post-copy phase

6. Fix proxy

7. Try to recover postcopy migration, it failed unexpected:
   # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock --postcopy-resume
error: Requested operation is not valid: QEMU reports migration is still running

8. Try to abort migration:
   # virsh domjobabort uefi --postcopy
   error: internal error: unable to execute QEMU command 'migrate-pause': migrate-pause is currently only supported during postcopy-active state


Actual results:
postcopy recovery failed in step7

Expected results:
postcopy recovery can succeed in step7

Additional info:
Can't reproduce with tcp transport.

Comment 3 Li Xiaohui 2023-04-28 10:51:34 UTC

Hi Nitesh, 
According to Germano's reply in Comment 1, can we fix this bug in RHEL 9.3.0? And also backport it on RHEL 9.2.z?

Comment 8 Li Xiaohui 2023-05-05 11:15:19 UTC

Sorry, I'm on Public Holiday from Apr 28 to May 3.

I would try to reproduce this bug and show the result here

Comment 9 Nitesh Narayan Lal 2023-05-05 14:11:57 UTC

(In reply to Li Xiaohui from comment #8)
> Sorry, I'm on Public Holiday from Apr 28 to May 3.
> 
> I would try to reproduce this bug and show the result here

No worries, we knew you would run it when you return. :)
Please take your time.

Comment 10 Li Xiaohui 2023-05-15 14:00:28 UTC

Hi Fangge, 
When I try to reproduce with libvirt, get one error during migration:
[root@hp-dl385g10-14 home]# virsh migrate rhel930 qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock
error: Failed to connect socket to '/tmp/33333-sock': Permission denied


Can you help?

Comment 11 Fangge Jin 2023-05-16 01:43:08 UTC

(In reply to Li Xiaohui from comment #10)
> Hi Fangge, 
> When I try to reproduce with libvirt, get one error during migration:
> [root@hp-dl385g10-14 home]# virsh migrate rhel930
> qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource
> --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri
> unix:///tmp/33333-sock
> error: Failed to connect socket to '/tmp/33333-sock': Permission denied
> 
> 
> Can you help?

You can set selinux to Permissive mode, and try again

Comment 12 Laszlo Ersek 2023-07-05 09:28:58 UTC

Since when has "recovering postcopy" a thing?

I've just made a serious attempt to read "docs/devel/migration.rst" in the upstream QEMU tree @ 2a6ae6915454, and it says:

> 'Postcopy' migration is a way to deal with migrations that refuse to converge
> (or take too long to converge) its plus side is that there is an upper bound on
> the amount of migration traffic and time it takes, the down side is that during
> the postcopy phase, a failure of *either* side or the network connection causes
> the guest to be lost.

It comes from 2bfdd1c8a6ac ("Add postcopy documentation", 2015-11-10).

"Recovering postcopy" (e.g. after network failure) is not consistent with the statement that the guest is lost.

If there has been a separate feature for "postcopy recovery" meanwhile, then the developers have missed updating the documentation.

What is the (main) RHBZ for "postcopy recovery"?

Comment 13 Laszlo Ersek 2023-07-05 10:29:09 UTC

Related upstream commit:

a688d2c1abc7 ("migration: new postcopy-pause state", 2018-05-15)

The containing series is probably the one that should have updated the documentation ("guest to be lost").

Comment 14 Laszlo Ersek 2023-07-05 13:05:34 UTC

I *think* this might be fixed once I fix bug 2018404, so setting a dependency accordingly.

Comment 15 Peter Xu 2023-07-05 13:42:54 UTC

(In reply to Laszlo Ersek from comment #13)
> Related upstream commit:
> 
> a688d2c1abc7 ("migration: new postcopy-pause state", 2018-05-15)
> 
> The containing series is probably the one that should have updated the
> documentation ("guest to be lost").

This is my fault (I'll also blame maintainers to not remind me :), but yeah 99% mine).

I was trying to remedy that with this (also posted just a few days ago; please feel free to review):

https://lore.kernel.org/all/20230627200222.557529-1-peterx@redhat.com/

We used to have a simple wiki page:

https://wiki.qemu.org/Features/PostcopyRecovery

Comment 16 Li Xiaohui 2023-07-13 08:50:30 UTC

Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug on RHEL 9.3.0? If so, please help set the ITR.


I had tried to reproduce this bug on May, but fail to reproduce it through qemu. Can easily reproduce through libvirt.
Will go on to find the difference between libvirt and qemu.

Comment 17 Nitesh Narayan Lal 2023-07-13 09:19:56 UTC

(In reply to Li Xiaohui from comment #16)
> Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug
> on RHEL 9.3.0? If so, please help set the ITR.
> 
> 
> I had tried to reproduce this bug on May, but fail to reproduce it through
> qemu. Can easily reproduce through libvirt.
> Will go on to find the difference between libvirt and qemu.

Hi Xiaohui, From Laszlo's comment, the fix for Bug 2018404 should also fix this issue. However, I don't know if the commits have been merged upstream yet.
Let's check this with Peter.
Peter, do you think we can do this in 9.3?

Wrt customer impact, since this is a low priority, we should not rush the changes if they are risky.

Comment 18 Peter Xu 2023-07-17 18:51:40 UTC

(In reply to Nitesh Narayan Lal from comment #17)
> (In reply to Li Xiaohui from comment #16)
> > Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug
> > on RHEL 9.3.0? If so, please help set the ITR.
> > 
> > 
> > I had tried to reproduce this bug on May, but fail to reproduce it through
> > qemu. Can easily reproduce through libvirt.
> > Will go on to find the difference between libvirt and qemu.

Thanks Xiaohui, that'll be very helpful.

> 
> Hi Xiaohui, From Laszlo's comment, the fix for Bug 2018404 should also fix
> this issue. However, I don't know if the commits have been merged upstream
> yet.

It's not landed, and I am not sure this is the same issue.  Laszlo, did I miss something, though?

> Let's check this with Peter.
> Peter, do you think we can do this in 9.3?

Per low priority (which I agree), I'd say we can wait for Xiaohui's result and just postpone it if it won't make it.

> 
> Wrt customer impact, since this is a low priority, we should not rush the
> changes if they are risky.

Comment 19 Laszlo Ersek 2023-07-20 13:43:02 UTC

In comment#14, I still thought that being stuck in postcopy-paused state, due to an unreachable IP address for example, was the problem underlying bug 2018404. Back then my idea was that we should turn that into a hard failure. That would fix the QMP status report (which depended on "failed" state), and potentially this bug as well -- I thought "QEMU reports migration is still running" here was related to the *lost* vmstate change.

Meanwhile though Peter explained that we didn't want to fail migration permanently, under bug 2018404; instead, the socket status/error reporting via QMP should be extended to "postcopy-paused" state (IIUC). In that case the dependency I set here with my comment#14 seems incorrect indeed, so I'm reverting it now.

Comment 20 Li Xiaohui 2023-08-22 12:26:27 UTC

Hi all, seems I have reproduced this bug through QEMU side.

hosts info: kernel-5.14.0-332.el9.x86_64 && qemu-kvm-8.0.0-11.el9.x86_64
guest info: kernel-5.14.0-354.el9.x86_64


Test steps:
1. boot a VM on src host, and boot a VM with appending '-incoming defer' on dst host

2. dst host:
socat tcp-listen:22222,reuseaddr,fork unix:/var/run/libvirt/virtqemud-sock
socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock

3. src host:
socat unix-listen:/tmp/sock,reuseaddr,fork tcp:$dst_host_ip:22222
socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333

4. enable postcopy capabilities on src and dst host:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}

5. set max-bandwidth and max-postcopy-bandwidth to 3M on src host:
{"execute":"migrate-set-parameters","arguments":{"max-bandwidth": 3145728}}
{"execute":"migrate-set-parameters","arguments":{"max-postcopy-bandwidth": 3145728}}

6. start postcopy migration
(1) set migrate incoming on dst host:
{"execute":"migrate-incoming","arguments":{"uri":"unix:/tmp/33333-sock"},"id":"libvirt-419"}
(2) start migration on src host:
{"execute":"migrate", "arguments":{"uri":"unix:/tmp/33333-sock"}}
(3) switch to postcopy during migration is active:
{"execute":"migrate-start-postcopy"}

7. during postcopy is active, break the proxy on dst host
[root@dell-per7525-26 ~]# socat tcp-listen:33334,reuseaddr,fork unix:/tmp/33334-sock

Note: after step 7, postcopy switches to pause status both on src and dst host

8. before fixing the proxy, recover postcopy
(1) on dst host:
{"exec-oob":"migrate-recover","arguments":{"uri":"unix:/tmp/33333-sock"},"id":"libvirt-427"}
(2) on src host:
{"execute":"migrate","arguments":{"uri":"unix:/tmp/33333-sock", "resume":true}}

After step 8, postcopy is still in pause status on dst host, but on src host, it's in postcopy-recover status:
a. dst host:
{"execute":"query-migrate"}
{"return": {"status": "postcopy-paused", "socket-address": [{"path": "/tmp/33334-sock", "type": "unix"}]}}
b. src host:
{"execute":"query-migrate"}
{"return": {"status": "postcopy-recover", "setup-time": 215, "downtime": 250, "total-time": 966610, "ram": {"total": 429518430208, "postcopy-requests": 3977, "dirty-sync-count": 2, "multifd-bytes": 0, "pages-per-second": 1387711, "downtime-bytes": 0, "page-size": 4096, "remaining": 426783657984, "postcopy-bytes": 658631688, "mbps": 950.10299999999995, "transferred": 672222633, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 13590945, "duplicate": 504980, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 666374144, "normal": 162689}}}

9. fix the proxy by sending cmds:
(1) on dst host:
[root@dell-per7525-26 ~]# socat tcp-listen:33334,reuseaddr,fork unix:/tmp/33334-sock
(2) on src host:
[root@dell-per7525-25 ~]# socat unix-listen:/tmp/33334-sock,reuseaddr,fork tcp:$dst_host_ip:33333

10. recovery postcopy again, will fail when resuming migration on src host:
(1) dst host:
{"exec-oob":"migrate-recover","arguments":{"uri":"unix:/tmp/33333-sock"},"id":"libvirt-427"}
{"timestamp": {"seconds": 1692705986, "microseconds": 442910}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}, "id": "libvirt-427"}

(2) src host
{"execute":"migrate","arguments":{"uri":"unix:/tmp/33333-sock", "resume":true}}
{"error": {"class": "GenericError", "desc": "Cannot resume if there is no paused migration"}}

Comment 21 Li Xiaohui 2023-08-22 12:43:37 UTC

If fix the proxy before step 8 in Comment 20, then postcopy recovery succeeds, VM works well after postcopy migration.

Comment 22 Peter Xu 2023-08-22 14:59:12 UTC

Thanks for reproducing, Xiaohui!

This failure seems by-design so far.  The problem is we only killed the dst proxy process not src, so the src ports are all fine, src QEMU won't be able to identify that in any way, except that it will find the channel is block at full send buffer without being flushed, and blocked in the syscall sendmsg()/write().  If we try recover from the src old port (via the old proxy) it should just proceed with no problem (this is after step 8), then we'll see src in RECOVERY state (because it does has a connection now), but dst in PAUSED (because it didn't really get a signal from src due to interrupted dst proxy).

Here it's the wrong command that admin sent; the admin should not recover the channel before knowing the channel is valid. So the admin is responsible for fixing it by himself/herself.  We only need to guarantee it's fix-able without crashing the VM at any point.

Xiaohui, could you try one more step between step 8 & 9, in which could you kill the old src proxy process to see whether the src qemu will go back to a PAUSED state (from RECOVER)?  That process should be:

  $ socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333

Then try recover again using the new channel.

Thanks,
Peter

Comment 23 Li Xiaohui 2023-08-23 05:16:20 UTC

(In reply to Peter Xu from comment #22)
> Thanks for reproducing, Xiaohui!
> 
> This failure seems by-design so far.  The problem is we only killed the dst
> proxy process not src, so the src ports are all fine, src QEMU won't be able
> to identify that in any way, except that it will find the channel is block
> at full send buffer without being flushed, and blocked in the syscall
> sendmsg()/write().  If we try recover from the src old port (via the old
> proxy) it should just proceed with no problem (this is after step 8), then
> we'll see src in RECOVERY state (because it does has a connection now), but
> dst in PAUSED (because it didn't really get a signal from src due to
> interrupted dst proxy).
> 
> Here it's the wrong command that admin sent; the admin should not recover
> the channel before knowing the channel is valid. So the admin is responsible
> for fixing it by himself/herself.  We only need to guarantee it's fix-able
> without crashing the VM at any point.
> 
> Xiaohui, could you try one more step between step 8 & 9, in which could you
> kill the old src proxy process to see whether the src qemu will go back to a
> PAUSED state (from RECOVER)?  

after step 8, I killed the old src proxy process, and wait for more than 30mins, 
The migration status of the src qemu is still postcopy-recover.
So hit same issue after step 9 && 10

And I found the src proxy process gives some error after step 8 (2):
******************************************************************************
^C[root@dell-per7525-25 home]# socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:10.73.2.82:33333

2023/08/23 01:00:26 socat[356265] E connect(5, AF=2 10.73.2.82:33333, 16): Connection refused



But if we kill the src proxy and dst proxy in step 7
7. during postcopy is active, break the proxy on dst and src host
[root@dell-per7525-26 ~]# socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock
[root@dell-per7525-25 home]# socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333

then after step 8, postcopy is in pause status both on src and dst host.
And postcopy recovery succeeds, VM works well after step 9 and 10


> That process should be:
>   $ socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:$dst_host_ip:33333
> 
> Then try recover again using the new channel.
> 
> Thanks,
> Peter

Comment 24 Peter Xu 2023-08-23 14:17:48 UTC

(In reply to Li Xiaohui from comment #23)
> after step 8, I killed the old src proxy process, and wait for more than
> 30mins, 

Are you sure the src proxy process is dead for real?  E.g. "pgrep -fa socat" should have it gone.

One more thing to try is trigger "migrate-pause" on src manually to kick the not working channel out. As long as we can get src out of RECOVER stage back to PAUSED for another shot then I think it's fine.

Comment 25 Li Xiaohui 2023-08-24 03:26:30 UTC

(In reply to Peter Xu from comment #24)
> (In reply to Li Xiaohui from comment #23)
> > after step 8, I killed the old src proxy process, and wait for more than
> > 30mins, 
> 
> Are you sure the src proxy process is dead for real?  E.g. "pgrep -fa socat"
> should have it gone.

Yes, I'm sure.

Before killing the src proxy process:
[root@dell-per7525-25 ~]# pgrep -fa socat
378838 socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:10.73.2.82:33333
378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222

After that:
[root@dell-per7525-25 ~]# pgrep -fa socat
378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222
[root@dell-per7525-25 ~]# 


> 
> One more thing to try is trigger "migrate-pause" on src manually to kick the
> not working channel out. As long as we can get src out of RECOVER stage back
> to PAUSED for another shot then I think it's fine.

Also fail to execute "migrate-pause":
{"execute":"migrate-pause"}
{"error": {"class": "GenericError", "desc": "migrate-pause is currently only supported during postcopy-active state"}}

Comment 26 Peter Xu 2023-08-24 23:31:22 UTC

(In reply to Li Xiaohui from comment #25)
> (In reply to Peter Xu from comment #24)
> > (In reply to Li Xiaohui from comment #23)
> > > after step 8, I killed the old src proxy process, and wait for more than
> > > 30mins, 
> > 
> > Are you sure the src proxy process is dead for real?  E.g. "pgrep -fa socat"
> > should have it gone.
> 
> Yes, I'm sure.
> 
> Before killing the src proxy process:
> [root@dell-per7525-25 ~]# pgrep -fa socat
> 378838 socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:10.73.2.82:33333
> 378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222
> 
> After that:
> [root@dell-per7525-25 ~]# pgrep -fa socat
> 378841 socat unix-listen:/tmp/sock,reuseaddr,fork tcp:.73.2.82:22222
> [root@dell-per7525-25 ~]# 

It seems we cannot reliably kick it out of "recover" stage indeed.

> 
> 
> > 
> > One more thing to try is trigger "migrate-pause" on src manually to kick the
> > not working channel out. As long as we can get src out of RECOVER stage back
> > to PAUSED for another shot then I think it's fine.
> 
> Also fail to execute "migrate-pause":
> {"execute":"migrate-pause"}
> {"error": {"class": "GenericError", "desc": "migrate-pause is currently only
> supported during postcopy-active state"}}

We probably also want to allow migrate-pause to interrupt the RECOVER stage as well.

Xiaohui, I'll prepare something soon for you to test, maybe need to wait until tomorrow.

Comment 27 Peter Xu 2023-08-29 21:49:53 UTC

So I've prepared a patchset which should be able to fix this, I sent a version here:

[PATCH 0/9] migration: Better error handling in rp thread, allow failures in recover
https://lore.kernel.org/r/20230829214235.69309-1-peterx@redhat.com

Xiaohui, I wanted to build a brew for you, but I found that the backport is not super straightforward.

I figured it should be easier if we just leave this until next release (9.4) because the possibility customer hitting this should be ~0 (postcopy, with one network failure, then another network failure during recovery, or wrong operations), but I do agree this is a real bug and we should fix it, and I'll follow up the issue upstream.

Changing ITR to 9.4 to reflect the fact that we don't need it for 9.3.  Xiaohui, let me know if you have any concerns.  

Thanks!
Peter

Comment 28 Li Xiaohui 2023-08-30 02:42:56 UTC

(In reply to Peter Xu from comment #27)
> So I've prepared a patchset which should be able to fix this, I sent a
> version here:
> 
> [PATCH 0/9] migration: Better error handling in rp thread, allow failures in
> recover
> https://lore.kernel.org/r/20230829214235.69309-1-peterx@redhat.com
> 
> Xiaohui, I wanted to build a brew for you, but I found that the backport is
> not super straightforward.
> 
> I figured it should be easier if we just leave this until next release (9.4)
> because the possibility customer hitting this should be ~0 (postcopy, with
> one network failure, then another network failure during recovery, or wrong
> operations), but I do agree this is a real bug and we should fix it, and
> I'll follow up the issue upstream.
> 
> Changing ITR to 9.4 to reflect the fact that we don't need it for 9.3. 
> Xiaohui, let me know if you have any concerns.  

No objection. Thank you.

Comment 29 RHEL Program Management 2023-09-22 13:24:47 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.