Bug 2111332 - Recovering postcopy before network issue is resolved leads to wrong qemu migration status [NEEDINFO]
Summary: Recovering postcopy before network issue is resolved leads to wrong qemu migr...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.1
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Peter Xu
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-27 05:15 UTC by Fangge Jin
Modified: 2023-08-15 12:31 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:
nilal: needinfo? (xiaohli)


Attachments (Terms of Use)
libvirt and qemu log (166.25 KB, application/x-bzip)
2022-07-27 05:15 UTC, Fangge Jin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-129181 0 None None None 2022-07-27 05:15:53 UTC

Description Fangge Jin 2022-07-27 05:15:34 UTC
Created attachment 1899573 [details]
libvirt and qemu log

Description of problem:
Do postcopy migration with unix+proxy transport, break proxy, postcopy migration failed. Try to recover migration before proxy is fixed, it failed as expected. Then fix the proxy, and try to recover migration again, it still failed ans said:
"error: Requested operation is not valid: QEMU reports migration is still running"

Version-Release number of selected component (if applicable):
libvirt-8.5.0-2.el9.x86_64
qemu-kvm-7.0.0-9.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Start a vm

2. Set up proxy between src and dest host
1) On dest host:
   # socat tcp-listen:22222,reuseaddr,fork unix:/var/run/libvirt/virtqemud-sock
   # socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock
2) On src host:
   # socat unix-listen:/tmp/sock,reuseaddr,fork tcp:<dest_host>:22222
   # socat unix-listen:/tmp/33333-sock,reuseaddr,fork tcp:<dest_host>:33333

2.Migrate vm to other host with unix transport:
   # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock

3.Switch migration to postcopy
   # virsh migrate-postcopy uefi

4. Break this proxy, migration will fail immediately:
   # socat tcp-listen:33333,reuseaddr,fork unix:/tmp/33333-sock

5. Try to recover postcopy migration, it failed as expected:
   # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock --postcopy-resume
   error: operation failed: job 'migration in' failed in post-copy phase

6. Fix proxy

7. Try to recover postcopy migration, it failed unexpected:
   # virsh migrate uefi qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock --postcopy-resume
error: Requested operation is not valid: QEMU reports migration is still running

8. Try to abort migration:
   # virsh domjobabort uefi --postcopy
   error: internal error: unable to execute QEMU command 'migrate-pause': migrate-pause is currently only supported during postcopy-active state


Actual results:
postcopy recovery failed in step7

Expected results:
postcopy recovery can succeed in step7

Additional info:
Can't reproduce with tcp transport.

Comment 3 Li Xiaohui 2023-04-28 10:51:34 UTC
Hi Nitesh, 
According to Germano's reply in Comment 1, can we fix this bug in RHEL 9.3.0? And also backport it on RHEL 9.2.z?

Comment 8 Li Xiaohui 2023-05-05 11:15:19 UTC
Sorry, I'm on Public Holiday from Apr 28 to May 3.

I would try to reproduce this bug and show the result here

Comment 9 Nitesh Narayan Lal 2023-05-05 14:11:57 UTC
(In reply to Li Xiaohui from comment #8)
> Sorry, I'm on Public Holiday from Apr 28 to May 3.
> 
> I would try to reproduce this bug and show the result here

No worries, we knew you would run it when you return. :)
Please take your time.

Comment 10 Li Xiaohui 2023-05-15 14:00:28 UTC
Hi Fangge, 
When I try to reproduce with libvirt, get one error during migration:
[root@hp-dl385g10-14 home]# virsh migrate rhel930 qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri unix:///tmp/33333-sock
error: Failed to connect socket to '/tmp/33333-sock': Permission denied


Can you help?

Comment 11 Fangge Jin 2023-05-16 01:43:08 UTC
(In reply to Li Xiaohui from comment #10)
> Hi Fangge, 
> When I try to reproduce with libvirt, get one error during migration:
> [root@hp-dl385g10-14 home]# virsh migrate rhel930
> qemu+unix:///system?socket=/tmp/sock --live  --postcopy --undefinesource
> --persistent --bandwidth 3 --postcopy-bandwidth 3   --migrateuri
> unix:///tmp/33333-sock
> error: Failed to connect socket to '/tmp/33333-sock': Permission denied
> 
> 
> Can you help?

You can set selinux to Permissive mode, and try again

Comment 12 Laszlo Ersek 2023-07-05 09:28:58 UTC
Since when has "recovering postcopy" a thing?

I've just made a serious attempt to read "docs/devel/migration.rst" in the upstream QEMU tree @ 2a6ae6915454, and it says:

> 'Postcopy' migration is a way to deal with migrations that refuse to converge
> (or take too long to converge) its plus side is that there is an upper bound on
> the amount of migration traffic and time it takes, the down side is that during
> the postcopy phase, a failure of *either* side or the network connection causes
> the guest to be lost.

It comes from 2bfdd1c8a6ac ("Add postcopy documentation", 2015-11-10).

"Recovering postcopy" (e.g. after network failure) is not consistent with the statement that the guest is lost.

If there has been a separate feature for "postcopy recovery" meanwhile, then the developers have missed updating the documentation.

What is the (main) RHBZ for "postcopy recovery"?

Comment 13 Laszlo Ersek 2023-07-05 10:29:09 UTC
Related upstream commit:

a688d2c1abc7 ("migration: new postcopy-pause state", 2018-05-15)

The containing series is probably the one that should have updated the documentation ("guest to be lost").

Comment 14 Laszlo Ersek 2023-07-05 13:05:34 UTC
I *think* this might be fixed once I fix bug 2018404, so setting a dependency accordingly.

Comment 15 Peter Xu 2023-07-05 13:42:54 UTC
(In reply to Laszlo Ersek from comment #13)
> Related upstream commit:
> 
> a688d2c1abc7 ("migration: new postcopy-pause state", 2018-05-15)
> 
> The containing series is probably the one that should have updated the
> documentation ("guest to be lost").

This is my fault (I'll also blame maintainers to not remind me :), but yeah 99% mine).

I was trying to remedy that with this (also posted just a few days ago; please feel free to review):

https://lore.kernel.org/all/20230627200222.557529-1-peterx@redhat.com/

We used to have a simple wiki page:

https://wiki.qemu.org/Features/PostcopyRecovery

Comment 16 Li Xiaohui 2023-07-13 08:50:30 UTC
Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug on RHEL 9.3.0? If so, please help set the ITR.


I had tried to reproduce this bug on May, but fail to reproduce it through qemu. Can easily reproduce through libvirt.
Will go on to find the difference between libvirt and qemu.

Comment 17 Nitesh Narayan Lal 2023-07-13 09:19:56 UTC
(In reply to Li Xiaohui from comment #16)
> Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug
> on RHEL 9.3.0? If so, please help set the ITR.
> 
> 
> I had tried to reproduce this bug on May, but fail to reproduce it through
> qemu. Can easily reproduce through libvirt.
> Will go on to find the difference between libvirt and qemu.

Hi Xiaohui, From Laszlo's comment, the fix for Bug 2018404 should also fix this issue. However, I don't know if the commits have been merged upstream yet.
Let's check this with Peter.
Peter, do you think we can do this in 9.3?

Wrt customer impact, since this is a low priority, we should not rush the changes if they are risky.

Comment 18 Peter Xu 2023-07-17 18:51:40 UTC
(In reply to Nitesh Narayan Lal from comment #17)
> (In reply to Li Xiaohui from comment #16)
> > Hi Nitesh, since this bug has a customer impact, do we plan to fix this bug
> > on RHEL 9.3.0? If so, please help set the ITR.
> > 
> > 
> > I had tried to reproduce this bug on May, but fail to reproduce it through
> > qemu. Can easily reproduce through libvirt.
> > Will go on to find the difference between libvirt and qemu.

Thanks Xiaohui, that'll be very helpful.

> 
> Hi Xiaohui, From Laszlo's comment, the fix for Bug 2018404 should also fix
> this issue. However, I don't know if the commits have been merged upstream
> yet.

It's not landed, and I am not sure this is the same issue.  Laszlo, did I miss something, though?

> Let's check this with Peter.
> Peter, do you think we can do this in 9.3?

Per low priority (which I agree), I'd say we can wait for Xiaohui's result and just postpone it if it won't make it.

> 
> Wrt customer impact, since this is a low priority, we should not rush the
> changes if they are risky.

Comment 19 Laszlo Ersek 2023-07-20 13:43:02 UTC
In comment#14, I still thought that being stuck in postcopy-paused state, due to an unreachable IP address for example, was the problem underlying bug 2018404. Back then my idea was that we should turn that into a hard failure. That would fix the QMP status report (which depended on "failed" state), and potentially this bug as well -- I thought "QEMU reports migration is still running" here was related to the *lost* vmstate change.

Meanwhile though Peter explained that we didn't want to fail migration permanently, under bug 2018404; instead, the socket status/error reporting via QMP should be extended to "postcopy-paused" state (IIUC). In that case the dependency I set here with my comment#14 seems incorrect indeed, so I'm reverting it now.


Note You need to log in before you can comment on or make changes to this bug.