2137740 – Multifd migration fails under a weak network/socket ordering race

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2137740 - Multifd migration fails under a weak network/socket ordering race

Summary: Multifd migration fails under a weak network/socket ordering race

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Xu
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2169732
TreeView+	depends on / blocked

Reported:	2022-10-26 03:34 UTC by Li Xiaohui
Modified:	2023-05-25 09:56 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qemu-kvm-6.2.0-31.module+el8.8.0+18188+901de023
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2169732 (view as bug list)
Environment:
Last Closed:	2023-05-16 08:16:35 UTC
Type:	---
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/qemu-kvm qemu-kvm merge_requests 258	None	None	None	2023-02-14 16:26:23 UTC
Red Hat Issue Tracker	RHELPLAN-137605	None	None	None	2022-10-26 03:42:32 UTC
Red Hat Product Errata	RHSA-2023:2757	None	None	None	2023-05-16 08:17:42 UTC

Description Li Xiaohui 2022-10-26 03:34:04 UTC

Description of problem:
If the network packet loss happens between src and dst hosts, multifd migration would fail with a error:
"qemu-kvm: failed to receive packet via multifd channel 0: multifd: received packet magic 5145564d expected 11223344"


Version-Release number of selected component (if applicable):
any qemu version should hit this issue. I would reproduce later.
But someone from kubevirt has hit this issue on qemu-kvm-6.2.0-5.module_el8.6.0+1087+b42c8331


How reproducible:
50% 


Steps to Reproduce:
1. Like the Description of problem
2.
3.

Actual results:
Multifd migration would fail.


Expected results:
Multifd migration succeeds


Additional info:
This issue should also happen on RHEL9, I would clone one bug for rhel9 after reproduction

Comment 1 Li Xiaohui 2022-10-27 03:04:35 UTC

Tried to reproduce this bug on qemu-kvm-6.2.0-11.module+el8.6.0+16538+01ea313d.6.x86_64:
1.When the guest is running on the source host, enable multifd capability on source and destination;
2.Before migration, creat network packet loss on the source host:
# tc qdisc add dev switch root netem loss 40%
3.Then start migrating the guest from source to destination host;


After step 3, would get errors on source and destination hmp like below:
(1)src hmp: (qemu) info 2022-10-26T11:34:41.372470Z qemu-kvm: multifd_send_pages: channel 0 has already quit!
(2)dst hmp: (qemu) 2022-10-26T11:34:41.362459Z qemu-kvm: failed to receive packet via multifd channel 1: multifd: received packet magic 5145564d expected 11223344

And source qmp and hmp hangs when executing below two qmp commands, also can't login the guest through the console:
{"execute":"qmp_capabilities", "arguments":{"enable":["oob"]}}
{"return": {}}
{"execute":"query-migrate"}

{"execute": "query-status"}


Regarding the above test result, do I reproduce the bug mentioned by Itamar?

Comment 2 Leonardo Bras 2022-11-08 05:40:08 UTC

Ok, by previous experience debugging, it looks like the dst qemu is reading some packet expecting it to be the header, but due to packet loss, it is not, and it breaks the migration (as expected?). If there is some packet loss, not all data is getting into the destination.

I mean, of course some packet can be lost in TCP transmission, but it should be re-sent by the TCP stack.

Please help me understand:
- What is getting tested here, exactly? 
- Is migration code supposed to re-send data if there is any packet lost? 
- What is the command used to start migration?

Comment 3 Dr. David Alan Gilbert 2022-11-09 09:17:10 UTC

(In reply to Leonardo Bras from comment #2)
> Ok, by previous experience debugging, it looks like the dst qemu is reading
> some packet expecting it to be the header, but due to packet loss, it is
> not, and it breaks the migration (as expected?). If there is some packet
> loss, not all data is getting into the destination.
> 
> I mean, of course some packet can be lost in TCP transmission, but it should
> be re-sent by the TCP stack.
> 
> Please help me understand:
> - What is getting tested here, exactly? 
> - Is migration code supposed to re-send data if there is any packet lost? 

This is an ordering race rather than an actual loss of data.  There are multiple socket connections happening; the 'main' socket and then multiple
sockets for multifd.  The existing code makes the incorrect assumption that the 'main' socket will connect first (and send its 5145564d header)
followed by the 'multifd' sockets (with their 11223344 header).
The artificial packet loss delays the opening of the main socket; so a multifd socket connects first, then the main connection comes along.
The error above is the multifd code expecting to receive a multifd header but actually receiving the main socket header.
Peter has been looking at making that more robust.

> - What is the command used to start migration?

Comment 7 Yanan Fu 2023-02-17 02:58:01 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 8 Li Xiaohui 2023-02-23 11:56:26 UTC

Verify this bug on kernel-4.18.0-472.el8.x86_64 && qemu-kvm-6.2.0-31.module+el8.8.0+18188+901de023.x86_64

Background:
1.When the guest is running on the source host, enable multifd capability on source and destination;
2.Before migration, creat network packet loss on the source host:
# tc qdisc add dev switch root netem loss 40%
3.Then start migrating the guest from source to destination host;


after step 3, migration is active. the progress of migration is very slow and migration can't converge. 
But anyway, guest still works well, no erros from qemu, and qemu won't hang.

Then test below scenarios:
1) cancel migration when multifd migration can't converge during network packet loss: 
Result: Cancel migration successfully, VM works well on src host. 
Below is some info after canceling migration.

src hmp:
(qemu) migrate_cancel 
(qemu) 2023-02-23T06:37:36.993227Z qemu-kvm: multifd_send_pages: channel 1 has already quit!
2023-02-23T06:37:36.993316Z qemu-kvm: multifd_send_sync_main: multifd_send_pages fail
2023-02-23T06:37:36.993326Z qemu-kvm: failed to save SaveStateEntry with id(name): 1(ram): -1
2023-02-23T06:37:36.995036Z qemu-kvm: Unable to write to socket: Broken pipe

dst hmp:
(qemu) 2023-02-23T06:37:36.757048Z qemu-kvm: check_section_footer: Read section footer failed: -5
2023-02-23T06:37:36.758076Z qemu-kvm: load of migration failed: Invalid argument

2) recovery network packet loss, then continue multifd migration.
# tc qdisc delete dev switch root netem loss 40%
Result: Migration succeeds, VM works well after migration.

Comment 9 Li Xiaohui 2023-02-23 12:11:29 UTC

Hi Peter, I tested this bug in Comment 8, do you think the above results are expected as our fix?

Comment 10 Peter Xu 2023-02-23 15:15:06 UTC

(In reply to Li Xiaohui from comment #9)
> Hi Peter, I tested this bug in Comment 8, do you think the above results are
> expected as our fix?

Yes, I think so.  Thanks!

Comment 11 Li Xiaohui 2023-02-24 03:54:01 UTC

Thank you.


So mark this bug verified per Comment 8. Will add one case in polarion later.

Comment 15 errata-xmlrpc 2023-05-16 08:16:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2757

Note You need to log in before you can comment on or make changes to this bug.