2096143 – The migration port is not released if use it again for recovering postcopy migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2096143 - The migration port is not released if use it again for recovering postcopy migration

Summary: The migration port is not released if use it again for recovering postcopy mi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Xu
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2091421 2097652
TreeView+	depends on / blocked

Reported:	2022-06-13 03:23 UTC by Li Xiaohui
Modified:	2022-11-15 10:19 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qemu-kvm-7.0.0-8.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2097652 (view as bug list)
Environment:
Last Closed:	2022-11-15 09:54:42 UTC
Type:	---
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/centos-stream/src qemu-kvm merge_requests 104	None	opened	migration: Allow migrate-recover to run multiple times	2022-06-20 18:49:11 UTC
Red Hat Issue Tracker	RHELPLAN-125056	None	None	None	2022-06-13 03:27:21 UTC
Red Hat Product Errata	RHSA-2022:7967	None	None	None	2022-11-15 09:55:16 UTC

Internal Links: 2178376

Description Li Xiaohui 2022-06-13 03:23:36 UTC

Description of problem:
Start migration using one migration port, switch to postcopy and then hit network issue, postcopy becomes paused, after fix network issue, use the migration port again on destination to recover postcopy migration, find the port isn't released and can't use again:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1234"}}
{"timestamp": {"seconds": 1655039272, "microseconds": 206642}, "event": "MIGRATION", "data": {"status": "setup"}}
{"error": {"class": "GenericError", "desc": "Failed to find an available port: Address already in use"}}


Version-Release number of selected component (if applicable):
hosts: kernel-5.14.0-105.el9.x86_64 && qemu-kvm-7.0.0-5.el9.x86_64


How reproducible:
100%


Steps to Reproduce:
1.Boot a guest on src host;
2.Boot a guest with same commands as src host but append '-incoming defer' on dst host;
3.Enable qmp capabilities on src and dst host with 'oob':
{"execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }
4.Enable postcopy-ram on src and dst host:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}
5.Set postcopy speed on src host:
{"execute":"migrate-set-parameters","arguments":{"max-postcopy-bandwidth": 5242880}}
6.Set migrate incoming on dst host:
{"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:1234"}}
7.Start migration, when it's active, change into postcopy mode on src host:
{"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.222:1234"}}
{"execute":"migrate-start-postcopy"}
8.Down migration network:
# nmcli connection down enp96s0f1np1 
9.Recover network on dst host:
# ifconfig enp96s0f1np1 192.168.130.222 netmask 255.255.255.0
10.Recover postcopy migration on dst host:
(dst qemu) {"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1234"}}


Actual results:
Hit error when execute step 10:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1234"}}
{"timestamp": {"seconds": 1655039272, "microseconds": 206642}, "event": "MIGRATION", "data": {"status": "setup"}}
{"error": {"class": "GenericError", "desc": "Failed to find an available port: Address already in use"}}

Found the migration port isn't released after the first use. So fail to use it again for recovering postcopy migration
[root@dell-per740xd-03 ipv4]# netstat -tunap|grep 1234
tcp6       0      0 :::1234                 :::*                    LISTEN      292903/qemu-kvm  


Expected results:
can use the migration port again


Additional info:
Will confirm whether RHEL 8.7.0 can reproduce and it's a regression issue.

Comment 1 Peter Xu 2022-06-13 13:44:33 UTC

Xiaohui, a few questions/points:

1. What's the network issue you were encountering?
2. Have you tried waiting for some longer time before sending the "migrate-recover" command?  E.g., 10 min.
   I remember TCP has some timeout mechanism for some kind of network failures, I don't remember the details but I think 10 min will cover that.
3. Yes it'll be great if you could figure out whether it's a regression (as you mentioned in the "Additional Info").

Thanks,

Comment 2 Li Xiaohui 2022-06-13 14:41:46 UTC

(In reply to Peter Xu from comment #1)
> Xiaohui, a few questions/points:
> 
> 1. What's the network issue you were encountering?

I tried two methods to make postcopy active into pause status:
1) make network issue like down the migration network on destination host: 
# nmcli connection down enp96s0f1np1 
2) use '{"exec-oob":"migrate-pause"}' to pause the postcopy by manual on source host.

Above two ways all hit the bug.

> 2. Have you tried waiting for some longer time before sending the
> "migrate-recover" command?  E.g., 10 min.
>    I remember TCP has some timeout mechanism for some kind of network
> failures, I don't remember the details but I think 10 min will cover that.

I didn't tried to wait long time such as 10 mins. If need I can do it.
But I rem when we down the network, we need to wait such as 30 mins to tcp timeout so that we can get postcopy pause. 
For network recovery, I don't think we need wait(I also didn't wait in previous tests but they pass and didn't hit such a bug)

> 3. Yes it'll be great if you could figure out whether it's a regression (as
> you mentioned in the "Additional Info").

I will try tmr.

> 
> Thanks,

Comment 3 Peter Xu 2022-06-13 15:50:27 UTC

(In reply to Li Xiaohui from comment #2)
> (In reply to Peter Xu from comment #1)
> > Xiaohui, a few questions/points:
> > 
> > 1. What's the network issue you were encountering?
> 
> I tried two methods to make postcopy active into pause status:
> 1) make network issue like down the migration network on destination host: 
> # nmcli connection down enp96s0f1np1 
> 2) use '{"exec-oob":"migrate-pause"}' to pause the postcopy by manual on
> source host.
> 
> Above two ways all hit the bug.

Oh that's weird, then obviously I cannot reproduce this with upstream QEMU because I use 2) a lot.

> 
> > 2. Have you tried waiting for some longer time before sending the
> > "migrate-recover" command?  E.g., 10 min.
> >    I remember TCP has some timeout mechanism for some kind of network
> > failures, I don't remember the details but I think 10 min will cover that.
> 
> I didn't tried to wait long time such as 10 mins. If need I can do it.
> But I rem when we down the network, we need to wait such as 30 mins to tcp
> timeout so that we can get postcopy pause. 
> For network recovery, I don't think we need wait(I also didn't wait in
> previous tests but they pass and didn't hit such a bug)

Yes, migrate-pause is the safe/fast/clean way to close it and IIUC it does not require any major timeout.  But I think I used to try fail the network in strange ways and conditionally I could have dead tcp port which requires the timeout, I don't think that's as long as 30min but it could be that I mis-remembered.

So maybe turning down the NIC may hit it.  I didn't really check.  Anyway, if it can recover after a few tens of minutes then at least it means it's not QEMU who leaked it.  Not urgently needed to verify, as..

> 
> > 3. Yes it'll be great if you could figure out whether it's a regression (as
> > you mentioned in the "Additional Info").
> 
> I will try tmr.

.. this could help us to figure out something already, since I'd hope rhel8 doesn't have this issue.

Comment 4 Li Xiaohui 2022-06-14 13:10:37 UTC

Also hit this bug on RHEL 8.7.0 (kernel-4.18.0-393.el8.x86_64 && qemu-kvm-6.2.0-14.module+el8.7.0+15289+26b4351e.x86_64) with {"exec-oob":"migrate-pause"} method:

waiting 2 hours after postcopy paused, try again using the same migration port, still hit error:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:$dst_host_ip:1234"}}
{"error": {"class": "GenericError", "desc": "Failed to find an available port: Address already in use"}}

Comment 5 Li Xiaohui 2022-06-14 13:42:55 UTC

Didn't hit this bug on same hosts with qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2.x86_64. So it should be a regression bug.


Shall I clone this bug for RHEL 8.7.0?

Comment 6 Peter Xu 2022-06-15 23:06:15 UTC

(In reply to Li Xiaohui from comment #5)
> Didn't hit this bug on same hosts with
> qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2.x86_64. So it should be a
> regression bug.

How about rhel8.6?  

Would it be easy for you to figure out which package got it regressed first (aka, an initial bisection)?

> 
> Shall I clone this bug for RHEL 8.7.0?

Feel free to.  Thanks,

Comment 7 Li Xiaohui 2022-06-16 08:45:54 UTC

(In reply to Peter Xu from comment #6)
> (In reply to Li Xiaohui from comment #5)
> > Didn't hit this bug on same hosts with
> > qemu-kvm-6.0.0-33.module+el8.5.0+13740+349232b6.2.x86_64. So it should be a
> > regression bug.
> 
> How about rhel8.6?  

qemu-kvm-6.1.0-1.module+el8.6.0+12535+4e2af250.x86_64 brought out this bug. Can't reproduce on/before qemu-kvm-6.0.0-29.module+el8.6.0+12490+ec3e565c.x86_64.

> 
> Would it be easy for you to figure out which package got it regressed first
> (aka, an initial bisection)?

See above.

> 
> > 
> > Shall I clone this bug for RHEL 8.7.0?
> 
> Feel free to.  Thanks,

Thanks

Comment 8 Peter Xu 2022-06-16 15:10:16 UTC

thanks Xiaohui.

I think it's because of this commit that was introduced in v6.0.0-v6.1.0:

a59136f3b1 migration/socket: Close the listener at the end

And the latest upstream doesn't hit that because there's already a fix:

08401c0426 migration: Allow migrate-recover to run multiple times
(though we may also need to fetch some of its deps)

I'll cook a rhel9 build soon to verify.

Comment 10 Li Xiaohui 2022-06-17 07:24:41 UTC

I have downloaded the qemu-kvm scratch from Comment 9, will test it in next week because I'm on PTO today.

Comment 11 Li Xiaohui 2022-06-20 08:06:52 UTC

Retest this bug according to the Description, migrate-recover succeeds with qemu-img-7.0.0-6.el9.postcopy_port_reset.x86_64:
1.After postcopy is paused, check migration port 1234 on dst host, still exists:
[root@lenovo-sr630-01 ~]# netstat -tunap | grep 1234
tcp        0      0 10.73.178.67:1234       0.0.0.0:*               LISTEN      2145614/qemu-kvm   
2.Recover postcopy migration, relaunch the migration port 1234 successfully on dst host:
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:$dst_host_ip:1234"}}
{"return": {}}
3.Start postcopy again on src host:
{"execute":"migrate", "arguments":{"uri":"tcp:$dst_host_ip:1234", "resume":true}}
{"return": {}}

Comment 12 Peter Xu 2022-06-20 18:04:44 UTC

Thanks, Xiaohui.  I'll prepare a MR soon.

Comment 16 Li Xiaohui 2022-06-23 04:10:58 UTC

Note: I can reproduce this bug on s390x (kernel-5.14.0-114.el9.s390x && qemu-kvm-7.0.0-6.el9.s390x), so it's not x86 only bug

Comment 17 Peter Xu 2022-06-23 13:11:28 UTC

(In reply to Li Xiaohui from comment #16)
> Note: I can reproduce this bug on s390x (kernel-5.14.0-114.el9.s390x &&
> qemu-kvm-7.0.0-6.el9.s390x), so it's not x86 only bug

Yes, it's a generic QEMU bug.

Comment 19 Li Xiaohui 2022-07-06 10:47:49 UTC

Verify this bug on the latest RHEL 9.1.0 (kernel-5.14.0-121.el9.x86_64 & qemu-kvm-7.0.0-8.el9.x86_64)


Test steps same as Comment 11, after recover postcopy and the status of migration is postcopy-active, found 1234(migration incoming port) on destination host like:
# netstat -tunap | grep 1234
tcp        0      0 10.73.130.69:1234       0.0.0.0:*               LISTEN      38502/qemu-kvm      
tcp        0      0 10.73.130.69:1234       10.73.130.67:37408      ESTABLISHED 38502/qemu-kvm 

By the way, postcopy recovery succeeds with reused migration port, and migration finishes successfully.


So I will mark this bug as verified per above test result after we get verified:Tested.


Peter, can we adjust the postcopy recovery case:
1) current scenario: use real network issue to make postcopy pause. it's not much convenient for test and it's not automated
2) I want to update the scenario, use '{"exec-oob":"migrate-pause"}' qmp command to make postcopy pause. It will help test and automation.

Comment 20 Peter Xu 2022-07-07 15:13:36 UTC

(In reply to Li Xiaohui from comment #19)
> Peter, can we adjust the postcopy recovery case:
> 1) current scenario: use real network issue to make postcopy pause. it's not
> much convenient for test and it's not automated
> 2) I want to update the scenario, use '{"exec-oob":"migrate-pause"}' qmp
> command to make postcopy pause. It will help test and automation.

I think it's okay to use "migrate-pause" to trigger it for auto test.  Note that I think you don't even need to use OOB message for the pause command, only the "migrate-recover" command requires to be an OOB message.

It's just that "migrate-pause" will cleanly shutdown the sockets so it's so friendly.  Some other auto-able options: (1) use one "nc" process to forward the migration, then kill the "nc" program anytime using SIGKILL, or (2) turn down the port of migration using nmcli.

But we can leave any option for later.

Comment 21 Yanan Fu 2022-07-11 06:30:26 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 22 Li Xiaohui 2022-07-11 06:50:52 UTC

We have stable recurrence steps for this bug, so remove 'SanityOnly' from Verified.

Comment 23 Li Xiaohui 2022-07-13 13:11:29 UTC

Hi Camilla, can you help move the bug to on_qa according to Comment 21? 

The verification for this bug (see Comment 19) has been finished by me, I'm waiting to mark it as verified after this bug is in errata

Comment 26 Li Xiaohui 2022-07-14 04:02:16 UTC

Mark verfied per Comment 19

Comment 28 errata-xmlrpc 2022-11-15 09:54:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7967

Note You need to log in before you can comment on or make changes to this bug.