Bug 2209614 - No multifd threads during multifd migration after recovery network packet loss
Summary: No multifd threads during multifd migration after recovery network packet loss
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.3
Hardware: x86_64
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Nitesh Narayan Lal
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-24 09:00 UTC by Li Xiaohui
Modified: 2023-08-08 10:05 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-158039 0 None None None 2023-05-24 09:03:57 UTC

Description Li Xiaohui 2023-05-24 09:00:30 UTC
Description of problem:
Under network packet loss environment, start multifd migration, then recovery packet loss, multifd migration continues and succeeds, but no multifd threads are found


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-316.el9.x86_64 && qemu-kvm-8.0.0-4.el9.x86_64
guest info: kernel-5.14.0-312.el9.x86_64


How reproducible:
100%


Steps to Reproduce:
1.Start a VM on the source host
2.Start a VM on the destination host with appending '-incoming defer'
3.Enable multifd capability on the src and dst host
4.Create network packet loss on the src host:
# tc qdisc add dev switch root netem loss 40%
5.Migrate VM from the src to the dst host
6.Recovery network packet loss, then continue multifd migration
# tc qdisc delete dev switch root netem loss 40%


Actual results:
After network recovery, no multifd threads found when multifd migration is active:
Notes: multifd migration still succeeds, and VM works well after migration.
2023-05-24-04:26:28: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:28: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:28: 0
2023-05-24-04:26:28: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "JytSD1w4"}
2023-05-24-04:26:28: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 325809, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "JytSD1w4"}
...
2023-05-24-04:26:49: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:49: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:49: 0
2023-05-24-04:26:49: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "5vxpJ2gL"}
2023-05-24-04:26:49: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 346983, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "5vxpJ2gL"}
2023-05-24-04:26:50: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:50: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:50: 0
2023-05-24-04:26:50: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "RQ4QbnvV"}
2023-05-24-04:26:50: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 347991, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "RQ4QbnvV"}
2023-05-24-04:26:51: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:51: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:51: 0
2023-05-24-04:26:51: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "Qprg16x2"}
2023-05-24-04:26:51: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 349000, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 122946176, "pages-per-second": 27974, "downtime-bytes": 0, "page-size": 4096, "remaining": 3087269888, "postcopy-bytes": 0, "mbps": 919.48127731092438, "transferred": 125377224, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 2431048, "duplicate": 270097, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 123076608, "normal": 30048}}, "id": "Qprg16x2"}
2023-05-24-04:26:52: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:52: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:52: 0
2023-05-24-04:26:52: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "AxMYJNt4"}
2023-05-24-04:26:52: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 350008, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 241992064, "pages-per-second": 28698, "downtime-bytes": 0, "page-size": 4096, "remaining": 2968485888, "postcopy-bytes": 0, "mbps": 943.2609655172414, "transferred": 244423464, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 2431400, "duplicate": 270121, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 241762304, "normal": 59024}}, "id": "AxMYJNt4"}
...
2023-05-24-04:27:08: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:08: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:08: 0
2023-05-24-04:27:08: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "CrLrH587"}
2023-05-24-04:27:08: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 366150, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 2141654272, "pages-per-second": 40800, "downtime-bytes": 0, "page-size": 4096, "remaining": 667672576, "postcopy-bytes": 0, "mbps": 1101.2626399999999, "transferred": 2144981949, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 3327677, "duplicate": 369462, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2135674880, "normal": 521405}}, "id": "CrLrH587"}
2023-05-24-04:27:09: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:09: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:09: 0
2023-05-24-04:27:09: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "RwwQ4my6"}
2023-05-24-04:27:09: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 367160, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 2262454528, "pages-per-second": 27312, "downtime-bytes": 0, "page-size": 4096, "remaining": 528969728, "postcopy-bytes": 0, "mbps": 881.67276800000002, "transferred": 2265822489, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 3367961, "duplicate": 373922, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2256109568, "normal": 550808}}, "id": "RwwQ4my6"}
2023-05-24-04:27:10: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:10: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:10: 0
2023-05-24-04:27:10: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "RdHaR1Fs"}
2023-05-24-04:27:10: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 368169, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 2, "multifd-bytes": 2376975936, "pages-per-second": 27512, "downtime-bytes": 0, "page-size": 4096, "remaining": 53698560, "postcopy-bytes": 0, "mbps": 904.28323966942162, "transferred": 2381207903, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4231967, "duplicate": 469906, "dirty-pages-rate": 21, "skipped": 0, "normal-bytes": 2370240512, "normal": 578672}}, "id": "RdHaR1Fs"}
2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:11: 0
2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "uR4McgzU"}
2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return": {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time": 368569, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 4, "multifd-bytes": 2409007744, "pages-per-second": 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0, "postcopy-bytes": 0, "mbps": 52.540865252279538, "transferred": 2413308246, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate": 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696, "normal": 586351}}, "id": "uR4McgzU"}


Expected results:
Besides multifd migration succeeds and VM works well after migration, we should also found 2 multifd threads during multifd active


Additional info:

Comment 1 John Ferlan 2023-06-20 16:05:04 UTC
Nitesh - sending this untriaged one your way since this has no assignee for quite some time

Comment 2 Juan Quintela 2023-07-24 14:19:09 UTC
This is normal behaviour.

When using migration (not postcopy)  If we have a big problem on destination (in this case network loss) we just drop the destination qemu, and if user fix the problem, it should launch the qemu on destination again.

The only thing that we should improve is that if the channels die, the whole qemu should also die.

This is not a regression and I think we can drop it or move it next version.

Comment 3 Juan Quintela 2023-07-24 14:23:41 UTC
What is happening here is:

create destination qemu
create source qemu
stop the network
launch migration command
   it get lost because network is dropping packets
fix the network
   the packet don't reappear (it is lost) so migration never starts on destination.

Right thing to do at this point would have been:

migrate_cancel
relaunch migrate command

Comment 4 Juan Quintela 2023-07-24 14:29:31 UTC
> 2023-05-24-04:27:10: [root@host ~]# cat /proc/73671/task/*/comm | grep
> multifdsend | wc -l
> 2023-05-24-04:27:10: 0
> 2023-05-24-04:27:10: Host(10.73.2.80) Sending qmp command   : {"execute":
> "query-migrate", "id": "RdHaR1Fs"}
> 2023-05-24-04:27:10: Host(10.73.2.80) Responding qmp command: {"return":
> {"expected-downtime": 300, "status": "active", "setup-time": 1060,
> "total-time": 368169, "ram": {"total": 4316667904, "postcopy-requests": 0,
> "dirty-sync-count": 2, "multifd-bytes": 2376975936, "pages-per-second":
> 27512, "downtime-bytes": 0, "page-size": 4096, "remaining": 53698560,
> "postcopy-bytes": 0, "mbps": 904.28323966942162, "transferred": 2381207903,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4231967, "duplicate":w
> 469906, "dirty-pages-rate": 21, "skipped": 0, "normal-bytes": 2370240512,
> "normal": 578672}}, "id": "RdHaR1Fs"}
> 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst
> host. -----
> 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep
> multifdsend | wc -l 

Could you double check that 73671 is the right process id?

> 2023-05-24-04:27:11: 0
> 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute":
> "query-migrate", "id": "uR4McgzU"}
> 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return":
> {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time":
> 368569, "ram": {"total": 4316667904, "postcopy-requests": 0,
> "dirty-sync-count": 4, "multifd-bytes": 2409007744, 

Because we are clearly using multifd, otherwise this value should be 0.

> "pages-per-second":
> 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0,
> "postcopy-bytes": 0, "mbps": 52.540865252279538, "transferred": 2413308246,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate":
> 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696,
> "normal": 586351}}, "id": "uR4McgzU"}

Comment 5 Li Xiaohui 2023-07-25 06:44:15 UTC
(In reply to Juan Quintela from comment #2)
> This is normal behaviour.
> 
> When using migration (not postcopy)  If we have a big problem on destination
> (in this case network loss) we just drop the destination qemu, and if user
> fix the problem, it should launch the qemu on destination again.
> 
> The only thing that we should improve is that if the channels die, the whole
> qemu should also die.
> 
> This is not a regression and I think we can drop it or move it next version.

Hi Juan, in one previous relevant bug, we fix one error and support multifd migration to go on after network recovery till it succeeds,
see https://bugzilla.redhat.com/show_bug.cgi?id=2137740#c8
Bug 2137740 - Multifd migration fails under a weak network/socket ordering race


Add one needinfo to Peter here because he fixed that bug.
Peter, please add comments if you have.

Comment 6 Li Xiaohui 2023-07-25 08:48:16 UTC
(In reply to Juan Quintela from comment #4)
...
> > 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst
> > host. -----
> > 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep
> > multifdsend | wc -l 
> 
> Could you double check that 73671 is the right process id?

It should be right because we use automation to obtain the process id and monitor the multifdsend threads.

> 
> > 2023-05-24-04:27:11: 0
> > 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute":
> > "query-migrate", "id": "uR4McgzU"}
> > 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return":
> > {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time":
> > 368569, "ram": {"total": 4316667904, "postcopy-requests": 0,
> > "dirty-sync-count": 4, "multifd-bytes": 2409007744, 
> 
> Because we are clearly using multifd, otherwise this value should be 0.

Comment 7 Juan Quintela 2023-07-26 09:21:24 UTC
(In reply to Li Xiaohui from comment #6)
> (In reply to Juan Quintela from comment #4)
> ...
> > > 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst
> > > host. -----
> > > 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep
> > > multifdsend | wc -l 
> > 
> > Could you double check that 73671 is the right process id?
> 
> It should be right because we use automation to obtain the process id and
> monitor the multifdsend threads.

The can you try to reproduce?

It makes zero sense because we have:

- no multifd_send threads
- all data is sent through multifd threads

So at least one of these two things need to be false.

This is the first output that you show

> After network recovery, no multifd threads found when multifd migration is active:
> Notes: multifd migration still succeeds, and VM works well after migration.
> 2023-05-24-04:26:28: ----- Check the multifd threads on the src and dst host. -----
> 2023-05-24-04:26:28: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l

No multifd send threads.

> 2023-05-24-04:26:28: 0
> 2023-05-24-04:26:28: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "JytSD1w4"}
> 2023-05-24-04:26:28: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300,

> "status": "active",

Migration is still active.


> "setup-time": 1060, "total-time": 325809, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, 
> "multifd-bytes": 7030912,

7MB transferred through multifd

> "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096,
> "remaining": 4301185024,

Still lot of data to transfer

> "postcopy-bytes": 0,
> "mbps": 0.21188531903454091

It is not transferred anything.  0.2MB/s is basically stopped.

> "transferred": 7049591,

basically everything is sent through multifd (as it should):

7049591 - 7030912 = 18679 ~ 18KB sent through the main channel.

>  "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0,
> "normal-bytes": 6995968,
> "normal": 1708}}, "id": "JytSD1w4"}

We go to the last one:

> 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute":
> "query-migrate", "id": "uR4McgzU"}
> 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return":
> {"status": "completed",

migration finished.
>  "setup-time": 1060, "downtime": 156, "total-time":
> 368569, "ram": {"total": 4316667904, "postcopy-requests": 0,
> "dirty-sync-count": 4, "multifd-bytes": 2409007744, 

Lots of multifd traffic (as it should)

> "pages-per-second":
> 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0,
> "postcopy-bytes": 0, "mbps": 52.540865252279538,

52MB/s network speed, reasonable

> "transferred": 2413308246,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate":
> 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696,
> "normal": 586351}}, "id": "uR4McgzU"}

So I am really at a lost why you don't find the multifd_send threads.  That is why I ask to see if there is posibility that there is a testing error.

Comment 8 Juan Quintela 2023-07-26 09:51:30 UTC
(In reply to Li Xiaohui from comment #5)
> (In reply to Juan Quintela from comment #2)
> > This is normal behaviour.
> > 
> > When using migration (not postcopy)  If we have a big problem on destination
> > (in this case network loss) we just drop the destination qemu, and if user
> > fix the problem, it should launch the qemu on destination again.
> > 
> > The only thing that we should improve is that if the channels die, the whole
> > qemu should also die.
> > 
> > This is not a regression and I think we can drop it or move it next version.
> 
> Hi Juan, in one previous relevant bug, we fix one error and support multifd
> migration to go on after network recovery till it succeeds,
> see https://bugzilla.redhat.com/show_bug.cgi?id=2137740#c8
> Bug 2137740 - Multifd migration fails under a weak network/socket ordering
> race

That is a different bug.

The problem there was that the channels (due to network drop of packets) where created in the wrong order.
As I explained in previous comment here we have:
- you are not able to get the multifd threads
- we are transferring all data through multifd threads

Thanks, Juan.
> 
> Add one needinfo to Peter here because he fixed that bug.
> Peter, please add comments if you have.

Comment 9 Peter Xu 2023-07-26 16:13:44 UTC
Xiaohui, I see that the discussion is going on and Juan seems to have answered.  I'm removing the needinfo for now. If you still need me for anything, just do that again or drop me an message.  Thanks!


Note You need to log in before you can comment on or make changes to this bug.