Bug 2038087
| Summary: | Sometimes destination dst has no multifd recv threads when do multifd migration | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> |
| Component: | qemu-kvm | Assignee: | Juan Quintela <quintela> |
| qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | chayang, gshan, leobras, lijin, mdean, nilal, peterx, quintela, smitterl, virt-maint |
| Version: | 9.0 | Keywords: | Triaged |
| Target Milestone: | rc | Flags: | xiaohli:
needinfo-
|
| Target Release: | --- | ||
| Hardware: | aarch64 | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-07-04 12:55:00 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1924294 | ||
|
Description
Li Xiaohui
2022-01-07 08:57:26 UTC
Qemu cmdline: /usr/libexec/qemu-kvm \ -name "mouse-vm",debug-threads=on \ -sandbox on \ -machine virt,gic-version=host,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars,memory-backend=mach-virt.ram \ -cpu host \ -nodefaults \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server=on,wait=off \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server=on,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \ -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=6 \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x2,chassis=7 \ -device pcie-root-port,id=pcie_extra_root_port_1,addr=0x2.0x1,bus=pcie.0,chassis=8 \ -device pcie-root-port,id=pcie_extra_root_port_2,addr=0x2.0x2,bus=pcie.0,chassis=9 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0,addr=0x0 \ -device virtio-gpu-pci,id=video0,max_outputs=1,bus=pcie-root-port-1,addr=0x0 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device usb-kbd,id=usb-kbd1,bus=usb1.0,port=2 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-3,addr=0x0 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0,write-cache=on \ -device virtio-net-pci,mac=9a:0a:71:f3:69:7d,rombar=0,id=idv2eapv,netdev=tap0,bus=pcie-root-port-4,addr=0x0 \ -device virtio-balloon-pci,id=balloon0,bus=pcie-root-port-5,addr=0x0 \ -blockdev driver=file,auto-read-only=on,discard=unmap,aio=threads,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel900-aarch64-virtio-scsi.qcow2,node-name=drive_sys1 \ -blockdev driver=qcow2,node-name=drive_image1,read-only=off,cache.direct=on,cache.no-flush=off,file=drive_sys1 \ -blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \ -blockdev node-name=file_aavmf_vars,driver=file,filename=/mnt/nfs/rhel900-aarch64-virtio-scsi.qcow2.fd,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \ -netdev tap,id=tap0,vhost=on \ -m 4096 \ -object memory-backend-ram,id=mach-virt.ram,size=4096M \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -vnc :10 \ -rtc base=utc,clock=host,driftfix=slew \ -enable-kvm \ -qmp tcp:0:3333,server=on,wait=off \ -qmp tcp:0:9999,server=on,wait=off \ -qmp tcp:0:9888,server=on,wait=off \ -serial tcp:0:4444,server=on,wait=off \ -monitor stdio \ -msg timestamp=on \ Didn't hit this bug when repeat rhel_186122 case for 30 times on x86 with same host and guest versions. So I set this problem to ARM only for the time being. (In reply to Li Xiaohui from comment #0) > Description of problem: > Set multifd channel to 4 on src&dst, then start multifd migration, during > migration is active, cancel migration. > Then change multifd channel to 2, and restart multifd migration, sometimes > there's no multifd recv threads on dst host. > > > Version-Release number of selected component (if applicable): > hosts info: kernel-5.14.0-39.el9.aarch64 & qemu-kvm-6.2.0-1.el9.aarch64 > guest info: kerne-5.14.0-39.el9.aarch64 > > > How reproducible: > 1/3 > > > Steps to Reproduce: > 1.Like description, will repeat for 3 times in case rhel_186122 > > > Actual results: > Sometimes rhel_186122 case will fail in the 1st time or the 3rd times, > please check log: > http://10.0.136.47/xiaohli/bug/bz_2038087/short_debug.log > > And when check migration data during migration is active, we can find no > data transfer in 10 minutes, seems multifd migration hang: > 2022-01-07-02:24:47: Host(10.19.241.87) Sending qmp command : {"execute": > "query-migrate", "id": "buybq8v7"} > 2022-01-07-02:24:47: Host(10.19.241.87) Responding qmp command: {"return": > {"expected-downtime": 300, "status": "active", "setup-time": 4, > "total-time": 282, "ram": {"total": 4429328384, "postcopy-requests": 0, > "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0, > "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred": > 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0, > "normal-bytes": 1568768, "normal": 383}}, "id": "buybq8v7"} > ...... > 2022-01-07-02:34:46: Host(10.19.241.87) Sending qmp command : {"execute": > "query-migrate", "id": "is63KwAi"} > 2022-01-07-02:34:47: Host(10.19.241.87) Responding qmp command: {"return": > {"expected-downtime": 300, "status": "active", "setup-time": 4, > "total-time": 598390, "ram": {"total": 4429328384, "postcopy-requests": 0, > "dirty-sync-count": 1, "multifd-bytes": 1053952, "pages-per-second": 0, > "page-size": 4096, "remaining": 4426407936, "mbps": 0, "transferred": > 1056927, "duplicate": 329, "dirty-pages-rate": 0, "skipped": 0, > "normal-bytes": 1568768, "normal": 383}}, "id": "is63KwAi"} > > > Expected results: > Multifd migration succeeds after migrate_cancel and change multifd channels > > > Additional info: > Will try more times on x86 to confirm whether this issue only happens on arm. I discussed this BZ with Gavin (who usually works on migration for ARM) and we think that it would be better for a migration engineer to take a look a first at what could be happening that's arch specific. Having tested some related scenarios on arm, found when we do basic multifd migration(without migrate cancel test and without changing multifd channel) for many times as below 1)->5), also hit this bug, it's a recurrence rate of 2/20. So I would update the bug description: 1)Boot a guest on src host; 2)Boot a guest on dst host with '-incoming defer'; 3)Enable multifd capabilities on src and dst host; 4)Start multifd migration 5)when multifd migration is active, check multifd channels on src and dst host And qemu seems sometimes hit core dump on arm and x86 with rhel9: 1)on arm, hit qemu core dump with 1/20 rate. 2)on x86, hit qemu core dump with 1/100 rate. As I don't config qemu coredump, no qemu coredump file generated. I will try to collect qemu core dump file to verify hit qemu core dump when do multifd migration. Hi This makes more sense (that it fails both on x86 and arm), multifd don't have any arch specific code. I have a fix for ARM because they have some issues, but it should not explain the difference between arm and x86. It would be great if you could get the coredump and the stacktrace. Will gave you a new brew build to test. Later, Juan. Hi Meirav, Still can reproduce this bug on the latest rhel 9.2.0 (kernel-5.14.0-191.el9.aarch64 & qemu-kvm-7.1.0-4.el9.aarch64) with 6/100 rate Hi Xiaouhi Could you share one coredump (just the stack traces) where it fails. Can you apply this to the coredups (with debuginfo packages installed): thread apply all backtrace On both ARM and x86_64 when it fails? THanks, Juan. (In reply to Juan Quintela from comment #16) > Hi Xiaouhi > Could you share one coredump (just the stack traces) where it fails. > > Can you apply this to the coredups (with debuginfo packages installed): > > > thread apply all backtrace > > On both ARM and x86_64 when it fails? > > THanks, Juan. Hi Juan, I tried more than 300 times on latest rhel 9.3 (kernel-5.14.0-327.el9.aarch64+64k && qemu-kvm-8.0.0-5.el9.aarch64), but did not hit this bug, also qemu didn't core dump. I thought this bug shall be fixed already. I suggest closing this bug as currentrelease, how about you? Closing this BZ and clearing the needinfo for Juan as he is on PTO. We can always re-open if it is reproduced again. |