2178376 – The migration port is not released if use it again for recovering postcopy immediately

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2178376 - The migration port is not released if use it again for recovering postcopy immediately

Summary: The migration port is not released if use it again for recovering postcopy im...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Cédric Le Goater
QA Contact:	Li Xiaohui
Docs Contact:	Daniel Vozenilek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-03-15 02:51 UTC by Li Xiaohui
Modified:	2023-09-22 13:10 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	.Recovering an interrupted post-copy VM migration might fail If a post-copy migration of a virtual machine (VM) is interrupted and then immediately resumed on the same incoming port, the migration might fail with the following error: `Address already in use` To work around this problem, wait at least 10 seconds before resuming the post-copy migration or switch to another port for migration recovery.
Clone Of:
Environment:
Last Closed:	2023-09-22 13:10:45 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2096143	medium	CLOSED	The migration port is not released if use it again for recovering postcopy migration	2023-09-06 20:02:36 UTC
Red Hat Issue Tracker	RHEL-7096	None	Migrated	None	2023-09-22 13:10:37 UTC
Red Hat Issue Tracker	RHELPLAN-151856	None	None	None	2023-03-15 02:53:36 UTC

Description Li Xiaohui 2023-03-15 02:51:59 UTC

Description of problem:
The migration port is not released if use it again for recovering postcopy immediately


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-284.el9.s390x && qemu-kvm-7.2.0-12.el9_2.s390x
guest info: kernel-5.14.0-284.el9.s390x


How reproducible:
2/3


Steps to Reproduce:
1.Boot a guest on src host with qemu commands[1]
2.Run stressapptest in src guest
# stressapptest -M 400 -s 100000 > /dev/null &
3.Boot a guest on dst host with appending "-incoming defer" 
4.Enable postcopy-ram on src and dst host, and Set max-postcopy-bandwidth to 5M on src host
{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "postcopy-ram", "state": true}]}, "id": "kcR6zlRt"}
{"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth": 5242880}, "id": "qJO8RB7x"}
5.Migrate guest to dst host, during migration, change into postcopy mode
dst qmp: 
{"execute": "migrate-incoming", "arguments": {"uri": "tcp:[::]:4000"}, "id": "beuGfoRj"}

src qmp: 
{"execute": "migrate", "arguments": {"uri": "tcp:10.0.160.22:4000", "blk": false, "inc": false, "detach": true, "resume": false}, "id": "wAeA93G3"}
{"execute": "query-migrate", "id": "Ux4M0tUE"}
{"return": {"expected-downtime": 300, "status": "active", "setup-time": 389, "total-time": 401, "ram": {"total": 4294967296, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "pages-per-second": 0, "downtime-bytes": 0, "page-size": 4096, "remaining": 4294914048, "postcopy-bytes": 0, "mbps": 0, "transferred": 16495, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 20599, "duplicate": 7, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 16384, "normal": 4}}, "id": "Ux4M0tUE"}
{"execute": "migrate-start-postcopy", "id": "DeHSiUGi"}
{"execute": "query-migrate", "id": "hgC8wXzd"}
{"return": {"expected-downtime": 300, "status": "postcopy-active", "setup-time": 389, "total-time": 500, "ram": {"total": 4294967296, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "pages-per-second": 8113, "downtime-bytes": 0, "page-size": 4096, "remaining": 4281475072, "postcopy-bytes": 0, "mbps": 264.52334975369456, "transferred": 13424415, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 13424415, "duplicate": 23, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 13398016, "normal": 3271}}, "id": "hgC8wXzd"}

dst qmp: 
{"execute": "query-migrate", "id": "1IvIYeLw"}
{"return": {"status": "postcopy-active", "socket-address": [{"port": "4000", "ipv6": true, "host": "::", "type": "inet"}]}, "id": "1IvIYeLw"}

6.During postcopy phase, pause migration
src qmp:
{"execute": "migrate-pause", "id": "ujcerLtX"}
{"execute": "query-migrate", "id": "GOOG9F7q"}
{"return": {"expected-downtime": 300, "status": "postcopy-paused", "setup-time": 389, "total-time": 7839, "ram": {"total": 4294967296, "postcopy-requests": 85, "dirty-sync-count": 2, "multifd-bytes": 0, "pages-per-second": 1370, "downtime-bytes": 0, "page-size": 4096, "remaining": 4237402112, "postcopy-bytes": 39311230, "mbps": 42.032879999999999, "transferred": 52735645, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 13424415, "duplicate": 1213, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 52621312, "normal": 12847}}, "id": "GOOG9F7q"}

dst qmp:
{"execute": "query-migrate", "id": "l9XIQfO0"}
{"return": {"status": "postcopy-paused", "socket-address": [{"port": "4000", "ipv6": true, "host": "::", "type": "inet"}]}, "id": "l9XIQfO0"}

7.Recover postcopy migration immediately after getting postcopy-paused, use same migration port (4000)
dst qmp:
{"exec-oob": "migrate-recover", "arguments": {"uri": "tcp:[::]:4000"}, "id": "243Y88Wk"}
{"id": "243Y88Wk", "error": {"class": "GenericError", "desc": "Failed to find an available port: Address already in use"}}


Actual results:
As step 7


Expected results:
Recovery migration successfully


Additional info:
1. If wait 10s after step 6, then usually execute step 7 successfully.

Comment 1 Li Xiaohui 2023-03-15 02:55:44 UTC

Qemu command [1]:
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox on \
-machine s390-ccw-virtio,memory-backend=mem-machine_mem \
-cpu host \
-nodefaults  \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device '{"driver":"virtio-scsi-ccw","id":"virtio_scsi_ccw0","devno":"fe.0.0000"}' \
-device '{"driver":"scsi-hd","id":"image1","device_id":"drive-image1","drive":"drive_image1","bus":"virtio_scsi_ccw0.0","bootindex":0,"write-cache":"on"}' \
-device '{"driver":"virtio-net-ccw","mac":"9a:fc:6e:3c:a6:f0","id":"net0","netdev":"tap0","devno":"fe.0.0001"}' \
-device '{"driver":"virtio-mouse-ccw","id":"input_mouse1","devno":"fe.0.0002"}' \
-device '{"driver":"virtio-keyboard-ccw","id":"input_keyboard1","devno":"fe.0.0003"}' \
-device '{"driver":"virtio-balloon-ccw","id":"balloon0","devno":"fe.0.0004"}' \
-blockdev '{"driver":"file","auto-read-only":true,"discard":"unmap","aio":"threads","cache":{"direct":true,"no-flush":false},"filename":"/mnt/xiaohli/rhel920-s390x-virtio-scsi.qcow2","node-name":"drive_sys1"}' \
-blockdev '{"driver":"qcow2","node-name":"drive_image1","read-only":false,"cache":{"direct":true,"no-flush":false},"file":"drive_sys1"}' \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-object '{"qom-type":"memory-backend-ram","id":"mem-machine_mem","size":4294967296}' \
-smp 1,maxcpus=1,cores=1,threads=1,sockets=1 \
-nographic  \
-rtc base=utc,clock=host,driftfix=slew \
-boot strict=on \
-enable-kvm  \
-qmp tcp:0:3333,server=on,wait=off \
-qmp tcp:0:9999,server=on,wait=off \
-qmp tcp:0:9888,server=on,wait=off \
-serial tcp:0:4444,server=on,wait=off \
-monitor stdio \

Comment 2 Li Xiaohui 2023-03-15 02:58:37 UTC

And also tried this case on x86 with repeating 300 times, didn't reproduce this bug.

Comment 3 Li Xiaohui 2023-03-15 07:14:00 UTC

we can refer to https://bugzilla.redhat.com/show_bug.cgi?id=2096143 which has fixed similar issue on RHEL 9.1

Comment 4 Thomas Huth 2023-03-16 14:31:57 UTC

Li, do you know whether this is a regression? I.e. did this used to work in the past? If so, do you know which qemu-kvm version was still OK?

Comment 5 Thomas Huth 2023-03-16 14:33:39 UTC

Peter, are you aware of any recent issues that could trigger this problem?

Comment 6 Peter Xu 2023-03-16 14:43:51 UTC

(In reply to Thomas Huth from comment #5)
> Peter, are you aware of any recent issues that could trigger this problem?

Yes we used to have bz2096143 (also quoted in comment 3) and we fixed it for all archs.

For this one it seems more like something special in s390 where the socket can be released slightly slower than the other archs?  Per Xiaohui's comment above, it doesn't happen on other archs (e.g. x86), and also on s390 the port will be released if we wait for 10 secs.

Copying Jason too just in case Jason has any input from the network kernel stack pov on why this could be happening too, or whether there's easy way to debug / justifity why it was releasing slower.  As I don't remember there's any arch-specific code in this path.

For whatever cause, the workaround will be extremely easy even if triggered in any customer - we just wait 10 seconds.  Not so bad for postcopy recovery since we don't even expect to use it unless there's a network failure already.

Comment 7 Li Xiaohui 2023-03-19 11:29:19 UTC

(In reply to Thomas Huth from comment #4)
> Li, do you know whether this is a regression? I.e. did this used to work in
> the past? If so, do you know which qemu-kvm version was still OK?

I have tried on qemu-kvm-6.2.0-13.el9 and qemu-kvm-7.0.0-13.el9.s390x.rpm, all hit this bug on s390x

Comment 14 Li Xiaohui 2023-06-06 10:02:27 UTC

Hi Peter,
I tried to test the case of this bug on x86 with RHEL 9.3, also hit this issue (kernel-5.14.0-316.el9.x86_64 & qemu-kvm-8.0.0-4.el9.x86_64).
But only hit once. Retry 10 times, didn't reproduce. But I would track this bug on x86 later again.

So I believe this bug should be for all hardware

Comment 15 Peter Xu 2023-06-06 12:50:40 UTC

Xiaohui,

Thanks for further providing the results on x86.  Maybe it's not really a bug, but just that the system will take time to shut off a port, and before that reusing is probably just not possible on the same one.  Then it seems for some reason s390 is just even slower on releasing so it's easier to reproduce.

Here for qmp "migrate-pause" it does a shutdown() syscall on src socket.  I'd boldly guess the port can be released even later if the network was interrupted not as graceful.

Daniel, according to above comment, do we still have chance to touch up the document update, removing the s390 specific wordings?

Thanks!
Peter

Comment 16 Daniel Vozenilek 2023-06-06 14:40:46 UTC

Hi Peter, that shouldn't be a problem, we can update the release notes pretty much at any point if there is some correction.
So, just keeping it general like this should work, right?

----
.Recovering an interrupted post-copy VM migration might fail
If a post-copy migration of a virtual machine (VM) is interrupted and then immediately resumed on the same incoming port, the migration might fail with the following error: `Address already in use`

To work around this problem, wait at least 10 seconds before resuming the post-copy migration or switch to another port for migration recovery.
----

Thanks for letting me know about it!
Daniel

Comment 17 Peter Xu 2023-06-06 15:08:00 UTC

ACK, thanks!

Comment 18 Daniel Vozenilek 2023-06-06 15:45:28 UTC

As per the previous discussion, I updated the release note text to be generic and removed the mention of any specific architecture.

Comment 19 Li Xiaohui 2023-06-08 06:21:22 UTC

(In reply to Peter Xu from comment #15)
> Xiaohui,
> 
> Thanks for further providing the results on x86.  Maybe it's not really a
> bug, but just that the system will take time to shut off a port, and before
> that reusing is probably just not possible on the same one.  Then it seems
> for some reason s390 is just even slower on releasing so it's easier to
> reproduce.

Thank you Peter.

Then shall we update the case in our test plan to add some sleep time if we think it's not really a bug?
How do you think if I add 10 seconds for it?

> 
> Here for qmp "migrate-pause" it does a shutdown() syscall on src socket. 
> I'd boldly guess the port can be released even later if the network was
> interrupted not as graceful.
> 
> Daniel, according to above comment, do we still have chance to touch up the
> document update, removing the s390 specific wordings?
> 
> Thanks!
> Peter

Comment 20 Li Xiaohui 2023-06-08 06:24:48 UTC

Update the hardware to all since also reproduce this bug on x86.

Comment 21 Peter Xu 2023-06-08 14:33:29 UTC

(In reply to Li Xiaohui from comment #19)
> Then shall we update the case in our test plan to add some sleep time if we
> think it's not really a bug?
> How do you think if I add 10 seconds for it?

A delay makes sense. Depending on whether you want to make it more efficient, instead of waiting 10s you can make a maximum timeout (say, 60sec?), but retry every 2 sec?

Comment 22 Li Xiaohui 2023-06-09 01:21:28 UTC

(In reply to Peter Xu from comment #21)
> (In reply to Li Xiaohui from comment #19)
> > Then shall we update the case in our test plan to add some sleep time if we
> > think it's not really a bug?
> > How do you think if I add 10 seconds for it?
> 
> A delay makes sense. Depending on whether you want to make it more
> efficient, instead of waiting 10s you can make a maximum timeout (say,
> 60sec?), but retry every 2 sec?

Good suggestion. I will do so. Thanks

Comment 23 RHEL Program Management 2023-09-22 13:08:53 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 24 RHEL Program Management 2023-09-22 13:10:45 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.