2097652 – The migration port is not released if use it again for recovering postcopy migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2097652 - The migration port is not released if use it again for recovering postcopy migration

Summary: The migration port is not released if use it again for recovering postcopy mi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Xu
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:	2096143
Blocks:	2089955
TreeView+	depends on / blocked

Reported:	2022-06-16 08:49 UTC by Li Xiaohui
Modified:	2022-11-08 09:45 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qemu-kvm-6.2.0-17.module+el8.7.0+15924+b11d8c3f
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2096143
Environment:
Last Closed:	2022-11-08 09:20:10 UTC
Type:	---
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/qemu-kvm qemu-kvm merge_requests 195	None	None	None	2022-06-20 21:19:17 UTC
Red Hat Issue Tracker	RHELPLAN-125435	None	None	None	2022-06-16 08:49:41 UTC
Red Hat Product Errata	RHSA-2022:7472	None	None	None	2022-11-08 09:21:14 UTC

Description Li Xiaohui 2022-06-16 08:49:14 UTC

+++ This bug was initially created as a clone of Bug #2096143 +++

Description of problem:
Start migration using one migration port, switch to postcopy and then hit network issue, postcopy becomes paused, after fix network issue, use the migration port again on destination to recover postcopy migration, find the port isn't released and can't use again:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1234"}}
{"timestamp": {"seconds": 1655039272, "microseconds": 206642}, "event": "MIGRATION", "data": {"status": "setup"}}
{"error": {"class": "GenericError", "desc": "Failed to find an available port: Address already in use"}}


Version-Release number of selected component (if applicable):
hosts: kernel-5.14.0-105.el9.x86_64 && qemu-kvm-7.0.0-5.el9.x86_64


How reproducible:
100%


Steps to Reproduce:
1.Boot a guest on src host;
2.Boot a guest with same commands as src host but append '-incoming defer' on dst host;
3.Enable qmp capabilities on src and dst host with 'oob':
{"execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }
4.Enable postcopy-ram on src and dst host:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}
5.Set postcopy speed on src host:
{"execute":"migrate-set-parameters","arguments":{"max-postcopy-bandwidth": 5242880}}
6.Set migrate incoming on dst host:
{"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:1234"}}
7.Start migration, when it's active, change into postcopy mode on src host:
{"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.222:1234"}}
{"execute":"migrate-start-postcopy"}
8.Down migration network:
# nmcli connection down enp96s0f1np1 
9.Recover network on dst host:
# ifconfig enp96s0f1np1 192.168.130.222 netmask 255.255.255.0
10.Recover postcopy migration on dst host:
(dst qemu) {"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1234"}}


Actual results:
Hit error when execute step 10:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1234"}}
{"timestamp": {"seconds": 1655039272, "microseconds": 206642}, "event": "MIGRATION", "data": {"status": "setup"}}
{"error": {"class": "GenericError", "desc": "Failed to find an available port: Address already in use"}}

Found the migration port isn't released after the first use. So fail to use it again for recovering postcopy migration
[root@dell-per740xd-03 ipv4]# netstat -tunap|grep 1234
tcp6       0      0 :::1234                 :::*                    LISTEN      292903/qemu-kvm  


Expected results:
can use the migration port again


Additional info:
Will confirm whether RHEL 8.7.0 can reproduce and it's a regression issue.

--- Additional comment from Peter Xu on 2022-06-13 21:44:33 CST ---

Xiaohui, a few questions/points:

1. What's the network issue you were encountering?
2. Have you tried waiting for some longer time before sending the "migrate-recover" command?  E.g., 10 min.
   I remember TCP has some timeout mechanism for some kind of network failures, I don't remember the details but I think 10 min will cover that.
3. Yes it'll be great if you could figure out whether it's a regression (as you mentioned in the "Additional Info").

Thanks,

--- Additional comment from Li Xiaohui on 2022-06-13 22:41:46 CST ---

(In reply to Peter Xu from comment #1)
> Xiaohui, a few questions/points:
> 
> 1. What's the network issue you were encountering?

I tried two methods to make postcopy active into pause status:
1) make network issue like down the migration network on destination host: 
# nmcli connection down enp96s0f1np1 
2) use '{"exec-oob":"migrate-pause"}' to pause the postcopy by manual on source host.

Above two ways all hit the bug.

> 2. Have you tried waiting for some longer time before sending the
> "migrate-recover" command?  E.g., 10 min.
>    I remember TCP has some timeout mechanism for some kind of network
> failures, I don't remember the details but I think 10 min will cover that.

I didn't tried to wait long time such as 10 mins. If need I can do it.
But I rem when we down the network, we need to wait such as 30 mins to tcp timeout so that we can get postcopy pause. 
For network recovery, I don't think we need wait(I also didn't wait in previous tests but they pass and didn't hit such a bug)

> 3. Yes it'll be great if you could figure out whether it's a regression (as
> you mentioned in the "Additional Info").

I will try tmr.

> 
> Thanks,

--- Additional comment from Peter Xu on 2022-06-13 23:50:27 CST ---

(In reply to Li Xiaohui from comment #2)
> (In reply to Peter Xu from comment #1)
> > Xiaohui, a few questions/points:
> > 
> > 1. What's the network issue you were encountering?
> 
> I tried two methods to make postcopy active into pause status:
> 1) make network issue like down the migration network on destination host: 
> # nmcli connection down enp96s0f1np1 
> 2) use '{"exec-oob":"migrate-pause"}' to pause the postcopy by manual on
> source host.
> 
> Above two ways all hit the bug.

Oh that's weird, then obviously I cannot reproduce this with upstream QEMU because I use 2) a lot.

> 
> > 2. Have you tried waiting for some longer time before sending the
> > "migrate-recover" command?  E.g., 10 min.
> >    I remember TCP has some timeout mechanism for some kind of network
> > failures, I don't remember the details but I think 10 min will cover that.
> 
> I didn't tried to wait long time such as 10 mins. If need I can do it.
> But I rem when we down the network, we need to wait such as 30 mins to tcp
> timeout so that we can get postcopy pause. 
> For network recovery, I don't think we need wait(I also didn't wait in
> previous tests but they pass and didn't hit such a bug)

Yes, migrate-pause is the safe/fast/clean way to close it and IIUC it does not require any major timeout.  But I think I used to try fail the network in strange ways and conditionally I could have dead tcp port which requires the timeout, I don't think that's as long as 30min but it could be that I mis-remembered.

So maybe turning down the NIC may hit it.  I didn't really check.  Anyway, if it can recover after a few tens of minutes then at least it means it's not QEMU who leaked it.  Not urgently needed to verify, as..

> 
> > 3. Yes it'll be great if you could figure out whether it's a regression (as
> > you mentioned in the "Additional Info").
> 
> I will try tmr.

.. this could help us to figure out something already, since I'd hope rhel8 doesn't have this issue.

--- Additional comment from Li Xiaohui on 2022-06-14 21:10:37 CST ---

Also hit this bug on RHEL 8.7.0 (kernel-4.18.0-393.el8.x86_64 && qemu-kvm-6.2.0-14.module+el8.7.0+15289+26b4351e.x86_64) with {"exec-oob":"migrate-pause"} method:

waiting 2 hours after postcopy paused, try again using the same migration port, still hit error:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:$dst_host_ip:1234"}}
{"error": {"class": "GenericError", "desc": "Failed to find an available port: Address already in use"}}

--- Additional comment from Li Xiaohui on 2022-06-14 21:42:55 CST ---

Didn't hit this bug on same hosts with qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2.x86_64. So it should be a regression bug.


Shall I clone this bug for RHEL 8.7.0?

--- Additional comment from Peter Xu on 2022-06-16 07:06:15 CST ---

(In reply to Li Xiaohui from comment #5)
> Didn't hit this bug on same hosts with
> qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2.x86_64. So it should be a
> regression bug.

How about rhel8.6?  

Would it be easy for you to figure out which package got it regressed first (aka, an initial bisection)?

> 
> Shall I clone this bug for RHEL 8.7.0?

Feel free to.  Thanks,

--- Additional comment from Li Xiaohui on 2022-06-16 16:45:54 CST ---

(In reply to Peter Xu from comment #6)
> (In reply to Li Xiaohui from comment #5)
> > Didn't hit this bug on same hosts with
> > qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2.x86_64. So it should be a
> > regression bug.
> 
> How about rhel8.6?  

qemu-kvm-6.1.0-1.module+el8.6.0+12535+4e2af250.x86_64 brought out this bug. Can't reproduce on/before qemu-kvm-6.0.0-29.module+el8.6.0+12490+ec3e565c.x86_64.

> 
> Would it be easy for you to figure out which package got it regressed first
> (aka, an initial bisection)?

See above.

> 
> > 
> > Shall I clone this bug for RHEL 8.7.0?
> 
> Feel free to.  Thanks,

Thanks

Comment 3 Camilla Conte 2022-07-12 13:07:36 UTC

Fix included in qemu-kvm-6.2.0-17.el8

Fixed by merge request 'migration: Allow migrate-recover to run multiple times' ( https://gitlab.com/redhat/rhel/src/qemu-kvm/qemu-kvm/-/merge_requests/195 )

Comment 7 Li Xiaohui 2022-07-12 15:26:10 UTC

Hi Camilla, Why don't we have "qemu-kvm-6.2.0-17.el8" in "Fixed In Version"

Comment 8 Yanan Fu 2022-07-13 09:14:23 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 9 Li Xiaohui 2022-07-15 11:59:54 UTC

Verify this bug on hosts (kernel-4.18.0-408.el8.x86_64 & qemu-kvm-6.2.0-17.module+el8.7.0+15924+b11d8c3f.x86_64), it passes.


Mark verified per test result, and remove 'SanityOnly' from 'Verified' since we have test steps to reproduce bug.

Comment 11 errata-xmlrpc 2022-11-08 09:20:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7472

Note You need to log in before you can comment on or make changes to this bug.