Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2038087

Summary:	Sometimes destination dst has no multifd recv threads when do multifd migration
Product:	Red Hat Enterprise Linux 9	Reporter:	Li Xiaohui <xiaohli>
Component:	qemu-kvm	Assignee:	Juan Quintela <quintela>
qemu-kvm sub component:	Live Migration	QA Contact:	Li Xiaohui <xiaohli>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	chayang, gshan, leobras, lijin, mdean, nilal, peterx, quintela, smitterl, virt-maint
Version:	9.0	Keywords:	Triaged
Target Milestone:	rc	Flags:	xiaohli: needinfo- pm-rhel: mirror+
Target Release:	---
Hardware:	aarch64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-07-04 12:55:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1924294

Description Li Xiaohui 2022-01-07 08:57:26 UTC

Description of problem:
Set multifd channel to 4 on src&dst, then start multifd migration, during migration is active, cancel migration.
Then change multifd channel to 2, and restart multifd migration, sometimes there's no multifd recv threads on dst host.


Version-Release number of selected component (if applicable):
hosts info: kernel-5.14.0-39.el9.aarch64 & qemu-kvm-6.2.0-1.el9.aarch64
guest info: kerne-5.14.0-39.el9.aarch64


How reproducible:
1/3


Steps to Reproduce:
1.Like description, will repeat for 3 times in case rhel_186122


Actual results:
Sometimes rhel_186122 case will fail in the 1st time or the 3rd times, please check log:
http://10.0.136.47/xiaohli/bug/bz_2038087/short_debug.log

And when check migration data during migration is active, we can find no data transfer in 10 minutes, seems multifd migration hang:
2022-01-07-02:24:47: Host(10.19.241.87) Sending qmp command   : {"execute": "query-migrate", "id": "buybq8v7"}
2022-01-07-02:24:47: Host(10.19.241.87) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 4, "total-time": 282, "ram": {"total": 4429328384, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0, "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred": 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1568768, "normal": 383}}, "id": "buybq8v7"}
......
2022-01-07-02:34:46: Host(10.19.241.87) Sending qmp command   : {"execute": "query-migrate", "id": "is63KwAi"}
2022-01-07-02:34:47: Host(10.19.241.87) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 4, "total-time": 598390, "ram": {"total": 4429328384, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0, "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred": 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 1568768, "normal": 383}}, "id": "is63KwAi"}


Expected results:
Multifd migration succeeds after migrate_cancel and change multifd channels


Additional info:
Will try more times on x86 to confirm whether this issue only happens on arm.

Comment 1 Li Xiaohui 2022-01-07 09:03:24 UTC

Qemu cmdline:
/usr/libexec/qemu-kvm  \
-name "mouse-vm",debug-threads=on \
-sandbox on \
-machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars,memory-backend=mach-virt.ram \
-cpu host \
-nodefaults  \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
-device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
-device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
-device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
-device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
-device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=6 \
-device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=7 \
-device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=8 \
-device pcie-root-port,id=pcie_extra_root_port_2,addr=0x2.0x2,bus=pcie.0,chassis=9 \
-device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0,addr=0x0 \
-device virtio-gpu-pci,id=video0,max_outputs=1,bus=pcie-root-port-1,addr=0x0 \
-device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-device usb-kbd,id=usb-kbd1,bus=usb1.0,port=2 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0,write-cache=on \
-device virtio-net-pci,mac=9a:0a:71:f3:69:7d,rombar=0,id=idv2eapv,netdev=tap0,bus=pcie-root-port-4,addr=0x0 \
-device virtio-balloon-pci,id=balloon0,bus=pcie-root-port-5,addr=0x0 \
-blockdev driver=file,auto-read-only=on,discard=unmap,aio=threads,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel900-aarch64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,read-only=off,cache.direct=on,cache.no-flush=off,file=drive_sys1 \
-blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \
-blockdev node-name=file_aavmf_vars,driver=file,filename=/mnt/nfs/rhel900-aarch64-virtio-scsi.qcow2.fd,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-object memory-backend-ram,id=mach-virt.ram,size=4096M \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host,driftfix=slew \
-enable-kvm  \
-qmp tcp:0:3333,server=on,wait=off \
-qmp tcp:0:9999,server=on,wait=off \
-qmp tcp:0:9888,server=on,wait=off \
-serial tcp:0:4444,server=on,wait=off \
-monitor stdio \
-msg timestamp=on \

Comment 2 Li Xiaohui 2022-01-10 03:12:32 UTC

Didn't hit this bug when repeat rhel_186122 case for 30 times on x86 with same host and guest versions.
So I set this problem to ARM only for the time being.

Comment 3 Li Xiaohui 2022-01-13 03:27:52 UTC

(In reply to Li Xiaohui from comment #0)
> Description of problem:
> Set multifd channel to 4 on src&dst, then start multifd migration, during
> migration is active, cancel migration.
> Then change multifd channel to 2, and restart multifd migration, sometimes
> there's no multifd recv threads on dst host.
> 
> 
> Version-Release number of selected component (if applicable):
> hosts info: kernel-5.14.0-39.el9.aarch64 & qemu-kvm-6.2.0-1.el9.aarch64
> guest info: kerne-5.14.0-39.el9.aarch64
> 
> 
> How reproducible:
> 1/3
> 
> 
> Steps to Reproduce:
> 1.Like description, will repeat for 3 times in case rhel_186122
> 
> 
> Actual results:
> Sometimes rhel_186122 case will fail in the 1st time or the 3rd times,
> please check log:
> http://10.0.136.47/xiaohli/bug/bz_2038087/short_debug.log
> 
> And when check migration data during migration is active, we can find no
> data transfer in 10 minutes, seems multifd migration hang:
> 2022-01-07-02:24:47: Host(10.19.241.87) Sending qmp command   : {"execute":
> "query-migrate", "id": "buybq8v7"}
> 2022-01-07-02:24:47: Host(10.19.241.87) Responding qmp command: {"return":
> {"expected-downtime": 300, "status": "active", "setup-time": 4,
> "total-time": 282, "ram": {"total": 4429328384, "postcopy-requests": 0,
> "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0,
> "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred":
> 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0,
> "normal-bytes": 1568768, "normal": 383}}, "id": "buybq8v7"}
> ......
> 2022-01-07-02:34:46: Host(10.19.241.87) Sending qmp command   : {"execute":
> "query-migrate", "id": "is63KwAi"}
> 2022-01-07-02:34:47: Host(10.19.241.87) Responding qmp command: {"return":
> {"expected-downtime": 300, "status": "active", "setup-time": 4,
> "total-time": 598390, "ram": {"total": 4429328384, "postcopy-requests": 0,
> "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0,
> "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred":
> 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0,
> "normal-bytes": 1568768, "normal": 383}}, "id": "is63KwAi"}
> 
> 
> Expected results:
> Multifd migration succeeds after migrate_cancel and change multifd channels
> 
> 
> Additional info:
> Will try more times on x86 to confirm whether this issue only happens on arm.

Comment 4 Luiz Capitulino 2022-01-13 21:26:51 UTC

I discussed this BZ with Gavin (who usually works on migration for ARM) and we think that it would be better for a migration engineer to take a look a first at what could be happening that's arch specific.

Comment 5 Li Xiaohui 2022-01-14 04:15:25 UTC

Having tested some related scenarios on arm, found when we do basic multifd migration(without migrate cancel test and without changing multifd channel) for many times as below 1)->5), also hit this bug, it's a recurrence rate of 2/20. So I would update the bug description:
1)Boot a guest on src host;
2)Boot a guest on dst host with '-incoming defer';
3)Enable multifd capabilities on src and dst host;
4)Start multifd migration
5)when multifd migration is active, check multifd channels on src and dst host


And qemu seems sometimes hit core dump on arm and x86 with rhel9: 
1)on arm, hit qemu core dump with 1/20 rate. 
2)on x86, hit qemu core dump with 1/100 rate.
As I don't config qemu coredump, no qemu coredump file generated. I will try to collect qemu core dump file to verify hit qemu core dump when do multifd migration.

Comment 6 Juan Quintela 2022-02-08 13:05:45 UTC

Hi

This makes more sense (that it fails both on x86 and arm), multifd don't have any arch specific code.
I have a fix for ARM because they have some issues, but it should not explain the difference between arm and x86.

It would be great if you could get the coredump and the stacktrace.

Will gave you a new brew build to test.

Later, Juan.

Comment 13 Li Xiaohui 2022-11-13 13:00:45 UTC

Hi Meirav, 

Still can reproduce this bug on the latest rhel 9.2.0 (kernel-5.14.0-191.el9.aarch64 & qemu-kvm-7.1.0-4.el9.aarch64) with 6/100 rate

Comment 16 Juan Quintela 2023-05-08 14:37:29 UTC

Hi Xiaouhi
Could you share one coredump (just the stack traces) where it fails.

Can you apply this to the coredups (with debuginfo packages installed):


 thread apply all backtrace

On both ARM and x86_64 when it fails?

THanks, Juan.

Comment 18 Li Xiaohui 2023-07-02 09:30:48 UTC

(In reply to Juan Quintela from comment #16)
> Hi Xiaouhi
> Could you share one coredump (just the stack traces) where it fails.
> 
> Can you apply this to the coredups (with debuginfo packages installed):
> 
> 
>  thread apply all backtrace
> 
> On both ARM and x86_64 when it fails?
> 
> THanks, Juan.

Hi Juan,
I tried more than 300 times on latest rhel 9.3 (kernel-5.14.0-327.el9.aarch64+64k && qemu-kvm-8.0.0-5.el9.aarch64), but did not hit this bug, also qemu didn't core dump.


I thought this bug shall be fixed already. I suggest closing this bug as currentrelease, how about you?

Comment 19 Nitesh Narayan Lal 2023-07-04 12:55:00 UTC

Closing this BZ and clearing the needinfo for Juan as he is on PTO.
We can always re-open if it is reproduced again.