2210788 – postcopy/postcopy-preempt can't recover after handle network failure

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2210788 - postcopy/postcopy-preempt can't recover after handle network failure

Summary: postcopy/postcopy-preempt can't recover after handle network failure

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Xu
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-29 13:51 UTC by Li Xiaohui
Modified:	2023-09-22 17:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-22 17:54:30 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7539	0	None	Migrated	None	2023-09-22 17:54:26 UTC
Red Hat Issue Tracker	RHELPLAN-158444	0	None	None	None	2023-05-29 13:53:12 UTC

Description Li Xiaohui 2023-05-29 13:51:24 UTC

Description of problem:
postcopy preempt can't recover after handle network failure


Version-Release number of selected component (if applicable):
RHEL 9.3.0(kernel-5.14.0-316.el9.x86_64 && qemu-kvm-8.0.0-4.el9.x86_64)


How reproducible:
100%


Steps to Reproduce:
1.Boot a VM on src host with qemu cmd[1], and run stressapptest in VM:
# stressapptest -M 10000 -s 10000000
2.Boot a VM on dst host with appending '-incoming defer'
3.Enable postcopy and postcopy-preempt capabilities both on src&dst host, and set max-postcopy-bandwidth to 5M on src host
4.Migrate guest from src to dst host, during migration, change migration into postcopy phase
(dst host) # {"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:1234"}}
(src host) # {"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.26:1234"}}

5.During postcopy phase, down migration network on dst host
# ip -f inet addr delete 192.168.130.26/24 dev ibs2f0
6. Recover network on dst host
# ifconfig ibs2f0 192.168.130.26 netmask 255.255.255.0
7.Recover postcopy migration
(dst host) # {"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.26:1234"}}

(src host) # {"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.26:1234", "resume":true}}


Actual results:
After step 7, postcopy-preempt would recover to be postcopy-active for a moment (maybe no more than 1 minute), then automatically change to be postcopy-paused status.
I tried to recover migration again, can't recover successfully.


Expected results:
Postcopy-preempt recover after handle network failure.


Additional info:

Comment 1 Li Xiaohui 2023-05-29 13:53:54 UTC

Qemu cmd:
/usr/libexec/qemu-kvm  \
-name "mouse-vm" \
-sandbox on \
-machine q35,memory-backend=pc.ram,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars \
-cpu EPYC-Milan \
-nodefaults  \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device '{"driver":"pcie-root-port","id":"root0","multifunction":true,"bus":"pcie.0","addr":"0x2","chassis":1}' \
-device '{"driver":"pcie-root-port","id":"root1","port":11,"addr":"0x2.0x1","bus":"pcie.0","chassis":2}' \
-device '{"driver":"pcie-root-port","id":"root2","port":12,"addr":"0x2.0x2","bus":"pcie.0","chassis":3}' \
-device '{"driver":"pcie-root-port","id":"root3","port":13,"addr":"0x2.0x3","bus":"pcie.0","chassis":4}' \
-device '{"driver":"pcie-root-port","id":"root4","port":14,"addr":"0x2.0x4","bus":"pcie.0","chassis":5}' \
-device '{"driver":"pcie-root-port","id":"root5","port":15,"addr":"0x2.0x5","bus":"pcie.0","chassis":6}' \
-device '{"driver":"pcie-root-port","id":"root6","port":16,"addr":"0x2.0x6","bus":"pcie.0","chassis":7}' \
-device '{"driver":"pcie-root-port","id":"root7","port":17,"addr":"0x2.0x7","bus":"pcie.0","chassis":8}' \
-device '{"driver":"pcie-root-port","id":"extra_root0","multifunction":true,"bus":"pcie.0","addr":"0x3","chassis":21}' \
-device '{"driver":"pcie-root-port","id":"extra_root1","port":21,"addr":"0x3.0x1","bus":"pcie.0","chassis":22}' \
-device '{"driver":"pcie-root-port","id":"extra_root2","port":22,"addr":"0x3.0x2","bus":"pcie.0","chassis":23}' \
-device '{"driver":"nec-usb-xhci","id":"usb1","bus":"root0","addr":"0x0"}' \
-device '{"driver":"virtio-scsi-pci","id":"virtio_scsi_pci0","bus":"root1","addr":"0x0"}' \
-device '{"driver":"scsi-hd","id":"image1","device_id":"drive-image1","drive":"drive_image1","bus":"virtio_scsi_pci0.0","channel":0,"scsi-id":0,"lun":0,"bootindex":0,"write-cache":"on"}' \
-device '{"driver":"virtio-net-pci","mac":"9a:8a:8b:8c:8d:8e","id":"net0","netdev":"tap0","bus":"root2","addr":"0x0"}' \
-device '{"driver":"usb-tablet","id":"usb-tablet1","bus":"usb1.0","port":"1"}' \
-device '{"driver":"virtio-balloon-pci","id":"balloon0","bus":"root3","addr":"0x0"}' \
-device '{"driver":"VGA","id":"video0","vgamem_mb":16,"bus":"pcie.0","addr":"0x1"}' \
-blockdev '{"driver":"file","auto-read-only":true,"discard":"unmap","aio":"threads","cache":{"direct":true,"no-flush":false},"filename":"/mnt/xiaohli/rhel930-64-virtio-scsi-ovmf.qcow2","node-name":"drive_sys1"}' \
-blockdev '{"driver":"qcow2","node-name":"drive_image1","read-only":false,"cache":{"direct":true,"no-flush":false},"file":"drive_sys1"}' \
-blockdev '{"node-name":"file_ovmf_code","driver":"file","filename":"/usr/share/OVMF/OVMF_CODE.secboot.fd","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"drive_ovmf_code","driver":"raw","read-only":true,"file":"file_ovmf_code"}' \
-blockdev '{"node-name":"file_ovmf_vars","driver":"file","filename":"/mnt/xiaohli/rhel930-64-virtio-scsi-ovmf.qcow2_VARS.fd","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"drive_ovmf_vars","driver":"raw","read-only":false,"file":"file_ovmf_vars"}' \
-netdev tap,id=tap0,vhost=on \
-m 20480 \
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":21474836480}' \
-smp 40,maxcpus=40,cores=20,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host \
-boot menu=off,strict=off,order=cdn,once=c \
-enable-kvm  \
-qmp tcp:0:3333,server=on,wait=off \
-qmp tcp:0:9999,server=on,wait=off \
-qmp tcp:0:9888,server=on,wait=off \
-serial tcp:0:4444,server=on,wait=off \
-monitor stdio \
-msg timestamp=on \

Comment 2 Peter Xu 2023-05-30 16:06:03 UTC

Hi, Xiaohui,

(In reply to Li Xiaohui from comment #0)
> 5.During postcopy phase, down migration network on dst host
> # ip -f inet addr delete 192.168.130.26/24 dev ibs2f0

Two questions:

(1) what happens with vanilla postcopy?
(2) is this something special in how network failure happens?  E.g. IIRC you used to use migrate-pause, so would that also trigger this bug?

Thanks,
Peter

Comment 3 Li Xiaohui 2023-05-31 02:20:29 UTC

(In reply to Peter Xu from comment #2)
> Hi, Xiaohui,
> 
> (In reply to Li Xiaohui from comment #0)
> > 5.During postcopy phase, down migration network on dst host
> > # ip -f inet addr delete 192.168.130.26/24 dev ibs2f0
> 
> Two questions:
> 
> (1) what happens with vanilla postcopy?

It works well, vanilla postcopy recover after handle network failure.

> (2) is this something special in how network failure happens?  E.g. IIRC you
> used to use migrate-pause, so would that also trigger this bug?

Yes, it should be special in network failure.
Postcopy-preempt works well when use migrate-pause.


Sorry I forget to update the results of the above two questions in this bug.
I did it in https://bugzilla.redhat.com/show_bug.cgi?id=2046606#c25

Peter, can you also check https://bugzilla.redhat.com/show_bug.cgi?id=2046606#c25 to see if the test results of postcopy-preempt is ok?

Comment 4 Peter Xu 2023-05-31 14:40:52 UTC

(In reply to Li Xiaohui from comment #3)
> (In reply to Peter Xu from comment #2)
> > Hi, Xiaohui,
> > 
> > (In reply to Li Xiaohui from comment #0)
> > > 5.During postcopy phase, down migration network on dst host
> > > # ip -f inet addr delete 192.168.130.26/24 dev ibs2f0
> > 
> > Two questions:
> > 
> > (1) what happens with vanilla postcopy?
> 
> It works well, vanilla postcopy recover after handle network failure.
> 
> > (2) is this something special in how network failure happens?  E.g. IIRC you
> > used to use migrate-pause, so would that also trigger this bug?
> 
> Yes, it should be special in network failure.
> Postcopy-preempt works well when use migrate-pause.

I see, thanks.  I'll have a look soonish.

> 
> Sorry I forget to update the results of the above two questions in this bug.
> I did it in https://bugzilla.redhat.com/show_bug.cgi?id=2046606#c25
> 
> Peter, can you also check
> https://bugzilla.redhat.com/show_bug.cgi?id=2046606#c25 to see if the test
> results of postcopy-preempt is ok?

Yes, I'll reply there.

Comment 5 Peter Xu 2023-06-14 20:07:47 UTC

(In reply to Li Xiaohui from comment #0)
> 5.During postcopy phase, down migration network on dst host
> # ip -f inet addr delete 192.168.130.26/24 dev ibs2f0
> 6. Recover network on dst host
> # ifconfig ibs2f0 192.168.130.26 netmask 255.255.255.0

[...]

> I tried to recover migration again, can't recover successfully.

Two more follow up questions:

1. What's the error when reaching here?

2. Have you started NetworkManager (systemctl status NetworkManager)?  When this happens, are you _sure_ the IP is still there?  Asking because if NetworkManager is active, I _think_ the IP added by your ifconfig cmd can be erased soon because it can be racy with NetworkManager.  So if you want to use ifconfig (rather than nmcli) you'd need to make sure NetworkManager off (or more reliably - just use nmcli).

Comment 6 Li Xiaohui 2023-07-06 06:32:47 UTC

Hi Peter, 
Sorry for the late. I tried this bug on the latest RHEL 9.3 (kernel-5.14.0-332.el9.x86_64 && qemu-kvm-8.0.0-6.el9.x86_64), got some different results.


(In reply to Peter Xu from comment #5)
> (In reply to Li Xiaohui from comment #0)
> > 5.During postcopy phase, down migration network on dst host
> > # ip -f inet addr delete 192.168.130.26/24 dev ibs2f0
> > 6. Recover network on dst host
> > # ifconfig ibs2f0 192.168.130.26 netmask 255.255.255.0
> 
> [...]
> 
> > I tried to recover migration again, can't recover successfully.
> 
> Two more follow up questions:
> 
> 1. What's the error when reaching here?

postcopy preempt always can be recovered, but after postcopy preempt switch to active, no more than 1 minute, it switches to postcopy-paused automatically. I tried 4 times to recover postcopy preempt, but it turns to be postcopy-paused at last.


> 
> 2. Have you started NetworkManager (systemctl status NetworkManager)?  When
> this happens, are you _sure_ the IP is still there?  Asking because if
> NetworkManager is active, I _think_ the IP added by your ifconfig cmd can be
> erased soon because it can be racy with NetworkManager.  So if you want to
> use ifconfig (rather than nmcli) you'd need to make sure NetworkManager off
> (or more reliably - just use nmcli).

To be clear, I checked NetworkManager service, it's active by default on src and dst host.
I did some tests to confirm if the NetworkManager service would affect the IP added by ifconfig cmd when NetworkManager is active.

Keeping NetworkManager active, I tried to ping the src host IP (192.168.130.25) by all night on the dst host. it didn't lose any packets:
[root@dell-per7525-26 home]# ping 192.168.130.25
...
--- 192.168.130.25 ping statistics ---
44212 packets transmitted, 44212 received, 0% packet loss, time 45272066ms
rtt min/avg/max/mdev = 0.089/0.250/0.801/0.118 ms

Comment 7 Li Xiaohui 2023-08-14 04:57:22 UTC

Hi Peter, 
I also hit this bug when testing vanilla postcopy on qemu-kvm-8.0.0-11.el9.x86_64.

Then I change the bug summary both for postcopy and postcopy

Comment 8 Peter Xu 2023-08-14 20:00:56 UTC

(In reply to Li Xiaohui from comment #7)
> Hi Peter, 
> I also hit this bug when testing vanilla postcopy on
> qemu-kvm-8.0.0-11.el9.x86_64.
> 
> Then I change the bug summary both for postcopy and postcopy

Xiaohui, does it also happen 100% on postcopy non-preempt mode?

I'd still suggest we don't use "ip" command when NetworkManager is enabled.  Could you try disable NetworkManager service, or use nmcli to setup new IP addresses (rather than "ip" command")?

Comment 9 RHEL Program Management 2023-09-22 17:52:26 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 10 RHEL Program Management 2023-09-22 17:54:30 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.