Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2209614

Summary:	No multifd threads during multifd migration after recovery network packet loss
Product:	Red Hat Enterprise Linux 9	Reporter:	Li Xiaohui <xiaohli>
Component:	qemu-kvm	Assignee:	Nitesh Narayan Lal <nilal>
qemu-kvm sub component:	Live Migration	QA Contact:	Li Xiaohui <xiaohli>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	medium
Priority:	medium	CC:	chayang, fjin, jinzhao, juzhang, leobras, nilal, peterx, quintela, virt-maint, yafu
Version:	9.3	Keywords:	Triaged
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-23 13:04:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Li Xiaohui 2023-05-24 09:00:30 UTC

Description of problem:
Under network packet loss environment, start multifd migration, then recovery packet loss, multifd migration continues and succeeds, but no multifd threads are found


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-316.el9.x86_64 && qemu-kvm-8.0.0-4.el9.x86_64
guest info: kernel-5.14.0-312.el9.x86_64


How reproducible:
100%


Steps to Reproduce:
1.Start a VM on the source host
2.Start a VM on the destination host with appending '-incoming defer'
3.Enable multifd capability on the src and dst host
4.Create network packet loss on the src host:
# tc qdisc add dev switch root netem loss 40%
5.Migrate VM from the src to the dst host
6.Recovery network packet loss, then continue multifd migration
# tc qdisc delete dev switch root netem loss 40%


Actual results:
After network recovery, no multifd threads found when multifd migration is active:
Notes: multifd migration still succeeds, and VM works well after migration.
2023-05-24-04:26:28: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:28: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:28: 0
2023-05-24-04:26:28: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "JytSD1w4"}
2023-05-24-04:26:28: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 325809, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "JytSD1w4"}
...
2023-05-24-04:26:49: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:49: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:49: 0
2023-05-24-04:26:49: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "5vxpJ2gL"}
2023-05-24-04:26:49: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 346983, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "5vxpJ2gL"}
2023-05-24-04:26:50: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:50: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:50: 0
2023-05-24-04:26:50: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "RQ4QbnvV"}
2023-05-24-04:26:50: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 347991, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "RQ4QbnvV"}
2023-05-24-04:26:51: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:51: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:51: 0
2023-05-24-04:26:51: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "Qprg16x2"}
2023-05-24-04:26:51: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 349000, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 122946176, "pages-per-second": 27974, "downtime-bytes": 0, "page-size": 4096, "remaining": 3087269888, "postcopy-bytes": 0, "mbps": 919.48127731092438, "transferred": 125377224, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 2431048, "duplicate": 270097, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 123076608, "normal": 30048}}, "id": "Qprg16x2"}
2023-05-24-04:26:52: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:26:52: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:26:52: 0
2023-05-24-04:26:52: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "AxMYJNt4"}
2023-05-24-04:26:52: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 350008, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 241992064, "pages-per-second": 28698, "downtime-bytes": 0, "page-size": 4096, "remaining": 2968485888, "postcopy-bytes": 0, "mbps": 943.2609655172414, "transferred": 244423464, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 2431400, "duplicate": 270121, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 241762304, "normal": 59024}}, "id": "AxMYJNt4"}
...
2023-05-24-04:27:08: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:08: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:08: 0
2023-05-24-04:27:08: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "CrLrH587"}
2023-05-24-04:27:08: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 366150, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 2141654272, "pages-per-second": 40800, "downtime-bytes": 0, "page-size": 4096, "remaining": 667672576, "postcopy-bytes": 0, "mbps": 1101.2626399999999, "transferred": 2144981949, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 3327677, "duplicate": 369462, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2135674880, "normal": 521405}}, "id": "CrLrH587"}
2023-05-24-04:27:09: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:09: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:09: 0
2023-05-24-04:27:09: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "RwwQ4my6"}
2023-05-24-04:27:09: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 367160, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 2262454528, "pages-per-second": 27312, "downtime-bytes": 0, "page-size": 4096, "remaining": 528969728, "postcopy-bytes": 0, "mbps": 881.67276800000002, "transferred": 2265822489, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 3367961, "duplicate": 373922, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2256109568, "normal": 550808}}, "id": "RwwQ4my6"}
2023-05-24-04:27:10: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:10: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:10: 0
2023-05-24-04:27:10: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "RdHaR1Fs"}
2023-05-24-04:27:10: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 368169, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 2, "multifd-bytes": 2376975936, "pages-per-second": 27512, "downtime-bytes": 0, "page-size": 4096, "remaining": 53698560, "postcopy-bytes": 0, "mbps": 904.28323966942162, "transferred": 2381207903, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4231967, "duplicate": 469906, "dirty-pages-rate": 21, "skipped": 0, "normal-bytes": 2370240512, "normal": 578672}}, "id": "RdHaR1Fs"}
2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst host. -----
2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l
2023-05-24-04:27:11: 0
2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "uR4McgzU"}
2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return": {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time": 368569, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 4, "multifd-bytes": 2409007744, "pages-per-second": 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0, "postcopy-bytes": 0, "mbps": 52.540865252279538, "transferred": 2413308246, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate": 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696, "normal": 586351}}, "id": "uR4McgzU"}


Expected results:
Besides multifd migration succeeds and VM works well after migration, we should also found 2 multifd threads during multifd active


Additional info:

Comment 1 John Ferlan 2023-06-20 16:05:04 UTC

Nitesh - sending this untriaged one your way since this has no assignee for quite some time

Comment 2 Juan Quintela 2023-07-24 14:19:09 UTC

This is normal behaviour.

When using migration (not postcopy)  If we have a big problem on destination (in this case network loss) we just drop the destination qemu, and if user fix the problem, it should launch the qemu on destination again.

The only thing that we should improve is that if the channels die, the whole qemu should also die.

This is not a regression and I think we can drop it or move it next version.

Comment 3 Juan Quintela 2023-07-24 14:23:41 UTC

What is happening here is:

create destination qemu
create source qemu
stop the network
launch migration command
   it get lost because network is dropping packets
fix the network
   the packet don't reappear (it is lost) so migration never starts on destination.

Right thing to do at this point would have been:

migrate_cancel
relaunch migrate command

Comment 4 Juan Quintela 2023-07-24 14:29:31 UTC

> 2023-05-24-04:27:10: [root@host ~]# cat /proc/73671/task/*/comm | grep
> multifdsend | wc -l
> 2023-05-24-04:27:10: 0
> 2023-05-24-04:27:10: Host(10.73.2.80) Sending qmp command   : {"execute":
> "query-migrate", "id": "RdHaR1Fs"}
> 2023-05-24-04:27:10: Host(10.73.2.80) Responding qmp command: {"return":
> {"expected-downtime": 300, "status": "active", "setup-time": 1060,
> "total-time": 368169, "ram": {"total": 4316667904, "postcopy-requests": 0,
> "dirty-sync-count": 2, "multifd-bytes": 2376975936, "pages-per-second":
> 27512, "downtime-bytes": 0, "page-size": 4096, "remaining": 53698560,
> "postcopy-bytes": 0, "mbps": 904.28323966942162, "transferred": 2381207903,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4231967, "duplicate":w
> 469906, "dirty-pages-rate": 21, "skipped": 0, "normal-bytes": 2370240512,
> "normal": 578672}}, "id": "RdHaR1Fs"}
> 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst
> host. -----
> 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep
> multifdsend | wc -l 

Could you double check that 73671 is the right process id?

> 2023-05-24-04:27:11: 0
> 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute":
> "query-migrate", "id": "uR4McgzU"}
> 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return":
> {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time":
> 368569, "ram": {"total": 4316667904, "postcopy-requests": 0,
> "dirty-sync-count": 4, "multifd-bytes": 2409007744, 

Because we are clearly using multifd, otherwise this value should be 0.

> "pages-per-second":
> 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0,
> "postcopy-bytes": 0, "mbps": 52.540865252279538, "transferred": 2413308246,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate":
> 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696,
> "normal": 586351}}, "id": "uR4McgzU"}

Comment 5 Li Xiaohui 2023-07-25 06:44:15 UTC

(In reply to Juan Quintela from comment #2)
> This is normal behaviour.
> 
> When using migration (not postcopy)  If we have a big problem on destination
> (in this case network loss) we just drop the destination qemu, and if user
> fix the problem, it should launch the qemu on destination again.
> 
> The only thing that we should improve is that if the channels die, the whole
> qemu should also die.
> 
> This is not a regression and I think we can drop it or move it next version.

Hi Juan, in one previous relevant bug, we fix one error and support multifd migration to go on after network recovery till it succeeds,
see https://bugzilla.redhat.com/show_bug.cgi?id=2137740#c8
Bug 2137740 - Multifd migration fails under a weak network/socket ordering race


Add one needinfo to Peter here because he fixed that bug.
Peter, please add comments if you have.

Comment 6 Li Xiaohui 2023-07-25 08:48:16 UTC

(In reply to Juan Quintela from comment #4)
...
> > 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst
> > host. -----
> > 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep
> > multifdsend | wc -l 
> 
> Could you double check that 73671 is the right process id?

It should be right because we use automation to obtain the process id and monitor the multifdsend threads.

> 
> > 2023-05-24-04:27:11: 0
> > 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute":
> > "query-migrate", "id": "uR4McgzU"}
> > 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return":
> > {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time":
> > 368569, "ram": {"total": 4316667904, "postcopy-requests": 0,
> > "dirty-sync-count": 4, "multifd-bytes": 2409007744, 
> 
> Because we are clearly using multifd, otherwise this value should be 0.

Comment 7 Juan Quintela 2023-07-26 09:21:24 UTC

(In reply to Li Xiaohui from comment #6)
> (In reply to Juan Quintela from comment #4)
> ...
> > > 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst
> > > host. -----
> > > 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep
> > > multifdsend | wc -l 
> > 
> > Could you double check that 73671 is the right process id?
> 
> It should be right because we use automation to obtain the process id and
> monitor the multifdsend threads.

The can you try to reproduce?

It makes zero sense because we have:

- no multifd_send threads
- all data is sent through multifd threads

So at least one of these two things need to be false.

This is the first output that you show

> After network recovery, no multifd threads found when multifd migration is active:
> Notes: multifd migration still succeeds, and VM works well after migration.
> 2023-05-24-04:26:28: ----- Check the multifd threads on the src and dst host. -----
> 2023-05-24-04:26:28: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l

No multifd send threads.

> 2023-05-24-04:26:28: 0
> 2023-05-24-04:26:28: Host(10.73.2.80) Sending qmp command   : {"execute": "query-migrate", "id": "JytSD1w4"}
> 2023-05-24-04:26:28: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300,

> "status": "active",

Migration is still active.


> "setup-time": 1060, "total-time": 325809, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, 
> "multifd-bytes": 7030912,

7MB transferred through multifd

> "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096,
> "remaining": 4301185024,

Still lot of data to transfer

> "postcopy-bytes": 0,
> "mbps": 0.21188531903454091

It is not transferred anything.  0.2MB/s is basically stopped.

> "transferred": 7049591,

basically everything is sent through multifd (as it should):

7049591 - 7030912 = 18679 ~ 18KB sent through the main channel.

>  "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0,
> "normal-bytes": 6995968,
> "normal": 1708}}, "id": "JytSD1w4"}

We go to the last one:

> 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command   : {"execute":
> "query-migrate", "id": "uR4McgzU"}
> 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return":
> {"status": "completed",

migration finished.
>  "setup-time": 1060, "downtime": 156, "total-time":
> 368569, "ram": {"total": 4316667904, "postcopy-requests": 0,
> "dirty-sync-count": 4, "multifd-bytes": 2409007744, 

Lots of multifd traffic (as it should)

> "pages-per-second":
> 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0,
> "postcopy-bytes": 0, "mbps": 52.540865252279538,

52MB/s network speed, reasonable

> "transferred": 2413308246,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate":
> 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696,
> "normal": 586351}}, "id": "uR4McgzU"}

So I am really at a lost why you don't find the multifd_send threads.  That is why I ask to see if there is posibility that there is a testing error.

Comment 8 Juan Quintela 2023-07-26 09:51:30 UTC

(In reply to Li Xiaohui from comment #5)
> (In reply to Juan Quintela from comment #2)
> > This is normal behaviour.
> > 
> > When using migration (not postcopy)  If we have a big problem on destination
> > (in this case network loss) we just drop the destination qemu, and if user
> > fix the problem, it should launch the qemu on destination again.
> > 
> > The only thing that we should improve is that if the channels die, the whole
> > qemu should also die.
> > 
> > This is not a regression and I think we can drop it or move it next version.
> 
> Hi Juan, in one previous relevant bug, we fix one error and support multifd
> migration to go on after network recovery till it succeeds,
> see https://bugzilla.redhat.com/show_bug.cgi?id=2137740#c8
> Bug 2137740 - Multifd migration fails under a weak network/socket ordering
> race

That is a different bug.

The problem there was that the channels (due to network drop of packets) where created in the wrong order.
As I explained in previous comment here we have:
- you are not able to get the multifd threads
- we are transferring all data through multifd threads

Thanks, Juan.
> 
> Add one needinfo to Peter here because he fixed that bug.
> Peter, please add comments if you have.

Comment 9 Peter Xu 2023-07-26 16:13:44 UTC

Xiaohui, I see that the discussion is going on and Juan seems to have answered.  I'm removing the needinfo for now. If you still need me for anything, just do that again or drop me an message.  Thanks!

Comment 11 Li Xiaohui 2023-08-23 13:04:26 UTC

Hi Nitesh,
As auto didn't add 'debug-threads=on' qemu cmd like: '-name "mouse-vm",debug-threads=on', so no multifd threads during multifd migration.


I think we can close this bug as not a bug. I will update the automation case accordingly. 
Thanks very much for your attention.