Bug 2038087 - Sometimes destination dst has no multifd recv threads when do multifd migration
Summary: Sometimes destination dst has no multifd recv threads when do multifd migration
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.0
Hardware: aarch64
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Juan Quintela
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks: 1924294
TreeView+ depends on / blocked
 
Reported: 2022-01-07 08:57 UTC by Li Xiaohui
Modified: 2023-07-06 05:45 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-04 12:55:00 UTC
Type: Bug
Target Upstream Version:
Embargoed:
xiaohli: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-107070 0 None None None 2022-01-07 09:00:47 UTC

Description Li Xiaohui 2022-01-07 08:57:26 UTC
Description of problem:
Set multifd channel to 4 on src&dst, then start multifd migration, during migration is active, cancel migration.
Then change multifd channel to 2, and restart multifd migration, sometimes there's no multifd recv threads on dst host.


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-39.el9.aarch64 & qemu-kvm-6.2.0-1.el9.aarch64
guest info: kerne-5.14.0-39.el9.aarch64


How reproducible:
1/3


Steps to Reproduce:
1.Like description, will repeat for 3 times in case rhel_186122


Actual results:
Sometimes rhel_186122 case will fail in the 1st time or the 3rd times, please check log:
http://10.0.136.47/xiaohli/bug/bz_2038087/short_debug.log

And when check migration data during migration is active, we can find no data transfer in 10 minutes, seems multifd migration hang:
2022-01-07-02:24:47: Host(10.19.241.87) Sending qmp command   : {"execute": "query-migrate", "id": "buybq8v7"}
2022-01-07-02:24:47: Host(10.19.241.87) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 4, "total-time": 282, "ram": {"total": 4429328384, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0, "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred": 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1568768, "normal": 383}}, "id": "buybq8v7"}
......
2022-01-07-02:34:46: Host(10.19.241.87) Sending qmp command   : {"execute": "query-migrate", "id": "is63KwAi"}
2022-01-07-02:34:47: Host(10.19.241.87) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 4, "total-time": 598390, "ram": {"total": 4429328384, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0, "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred": 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1568768, "normal": 383}}, "id": "is63KwAi"}


Expected results:
Multifd migration succeeds after migrate_cancel and change multifd channels


Additional info:
Will try more times on x86 to confirm whether this issue only happens on arm.

Comment 1 Li Xiaohui 2022-01-07 09:03:24 UTC
Qemu cmdline:
/usr/libexec/qemu-kvm  \
-name "mouse-vm",debug-threads=on \
-sandbox on \
-machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars,memory-backend=mach-virt.ram \
-cpu host \
-nodefaults  \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
-device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
-device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
-device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
-device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
-device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=6 \
-device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=7 \
-device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=8 \
-device pcie-root-port,id=pcie_extra_root_port_2,addr=0x2.0x2,bus=pcie.0,chassis=9 \
-device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0,addr=0x0 \
-device virtio-gpu-pci,id=video0,max_outputs=1,bus=pcie-root-port-1,addr=0x0 \
-device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-device usb-kbd,id=usb-kbd1,bus=usb1.0,port=2 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0,write-cache=on \
-device virtio-net-pci,mac=9a:0a:71:f3:69:7d,rombar=0,id=idv2eapv,netdev=tap0,bus=pcie-root-port-4,addr=0x0 \
-device virtio-balloon-pci,id=balloon0,bus=pcie-root-port-5,addr=0x0 \
-blockdev driver=file,auto-read-only=on,discard=unmap,aio=threads,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel900-aarch64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,read-only=off,cache.direct=on,cache.no-flush=off,file=drive_sys1 \
-blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \
-blockdev node-name=file_aavmf_vars,driver=file,filename=/mnt/nfs/rhel900-aarch64-virtio-scsi.qcow2.fd,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-object memory-backend-ram,id=mach-virt.ram,size=4096M \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host,driftfix=slew \
-enable-kvm  \
-qmp tcp:0:3333,server=on,wait=off \
-qmp tcp:0:9999,server=on,wait=off \
-qmp tcp:0:9888,server=on,wait=off \
-serial tcp:0:4444,server=on,wait=off \
-monitor stdio \
-msg timestamp=on \

Comment 2 Li Xiaohui 2022-01-10 03:12:32 UTC
Didn't hit this bug when repeat rhel_186122 case for 30 times on x86 with same host and guest versions.
So I set this problem to ARM only for the time being.

Comment 3 Li Xiaohui 2022-01-13 03:27:52 UTC
(In reply to Li Xiaohui from comment #0)
> Description of problem:
> Set multifd channel to 4 on src&dst, then start multifd migration, during
> migration is active, cancel migration.
> Then change multifd channel to 2, and restart multifd migration, sometimes
> there's no multifd recv threads on dst host.
> 
> 
> Version-Release number of selected component (if applicable):
> hosts info: kernel-5.14.0-39.el9.aarch64 & qemu-kvm-6.2.0-1.el9.aarch64
> guest info: kerne-5.14.0-39.el9.aarch64
> 
> 
> How reproducible:
> 1/3
> 
> 
> Steps to Reproduce:
> 1.Like description, will repeat for 3 times in case rhel_186122
> 
> 
> Actual results:
> Sometimes rhel_186122 case will fail in the 1st time or the 3rd times,
> please check log:
> http://10.0.136.47/xiaohli/bug/bz_2038087/short_debug.log
> 
> And when check migration data during migration is active, we can find no
> data transfer in 10 minutes, seems multifd migration hang:
> 2022-01-07-02:24:47: Host(10.19.241.87) Sending qmp command   : {"execute":
> "query-migrate", "id": "buybq8v7"}
> 2022-01-07-02:24:47: Host(10.19.241.87) Responding qmp command: {"return":
> {"expected-downtime": 300, "status": "active", "setup-time": 4,
> "total-time": 282, "ram": {"total": 4429328384, "postcopy-requests": 0,
> "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0,
> "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred":
> 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0,
> "normal-bytes": 1568768, "normal": 383}}, "id": "buybq8v7"}
> ......
> 2022-01-07-02:34:46: Host(10.19.241.87) Sending qmp command   : {"execute":
> "query-migrate", "id": "is63KwAi"}
> 2022-01-07-02:34:47: Host(10.19.241.87) Responding qmp command: {"return":
> {"expected-downtime": 300, "status": "active", "setup-time": 4,
> "total-time": 598390, "ram": {"total": 4429328384, "postcopy-requests": 0,
> "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0,
> "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred":
> 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0,
> "normal-bytes": 1568768, "normal": 383}}, "id": "is63KwAi"}
> 
> 
> Expected results:
> Multifd migration succeeds after migrate_cancel and change multifd channels
> 
> 
> Additional info:
> Will try more times on x86 to confirm whether this issue only happens on arm.

Comment 4 Luiz Capitulino 2022-01-13 21:26:51 UTC
I discussed this BZ with Gavin (who usually works on migration for ARM) and we think that it would be better for a migration engineer to take a look a first at what could be happening that's arch specific.

Comment 5 Li Xiaohui 2022-01-14 04:15:25 UTC
Having tested some related scenarios on arm, found when we do basic multifd migration(without migrate cancel test and without changing multifd channel) for many times as below 1)->5), also hit this bug, it's a recurrence rate of 2/20. So I would update the bug description:
1)Boot a guest on src host;
2)Boot a guest on dst host with '-incoming defer';
3)Enable multifd capabilities on src and dst host;
4)Start multifd migration
5)when multifd migration is active, check multifd channels on src and dst host


And qemu seems sometimes hit core dump on arm and x86 with rhel9: 
1)on arm, hit qemu core dump with 1/20 rate. 
2)on x86, hit qemu core dump with 1/100 rate.
As I don't config qemu coredump, no qemu coredump file generated. I will try to collect qemu core dump file to verify hit qemu core dump when do multifd migration.

Comment 6 Juan Quintela 2022-02-08 13:05:45 UTC
Hi

This makes more sense (that it fails both on x86 and arm), multifd don't have any arch specific code.
I have a fix for ARM because they have some issues, but it should not explain the difference between arm and x86.

It would be great if you could get the coredump and the stacktrace.

Will gave you a new brew build to test.

Later, Juan.

Comment 13 Li Xiaohui 2022-11-13 13:00:45 UTC
Hi Meirav, 

Still can reproduce this bug on the latest rhel 9.2.0 (kernel-5.14.0-191.el9.aarch64 & qemu-kvm-7.1.0-4.el9.aarch64) with 6/100 rate

Comment 16 Juan Quintela 2023-05-08 14:37:29 UTC
Hi Xiaouhi
Could you share one coredump (just the stack traces) where it fails.

Can you apply this to the coredups (with debuginfo packages installed):


 thread apply all backtrace

On both ARM and x86_64 when it fails?

THanks, Juan.

Comment 18 Li Xiaohui 2023-07-02 09:30:48 UTC
(In reply to Juan Quintela from comment #16)
> Hi Xiaouhi
> Could you share one coredump (just the stack traces) where it fails.
> 
> Can you apply this to the coredups (with debuginfo packages installed):
> 
> 
>  thread apply all backtrace
> 
> On both ARM and x86_64 when it fails?
> 
> THanks, Juan.

Hi Juan,
I tried more than 300 times on latest rhel 9.3 (kernel-5.14.0-327.el9.aarch64+64k && qemu-kvm-8.0.0-5.el9.aarch64), but did not hit this bug, also qemu didn't core dump.


I thought this bug shall be fixed already. I suggest closing this bug as currentrelease, how about you?

Comment 19 Nitesh Narayan Lal 2023-07-04 12:55:00 UTC
Closing this BZ and clearing the needinfo for Juan as he is on PTO.
We can always re-open if it is reproduced again.


Note You need to log in before you can comment on or make changes to this bug.