Description of problem: dst qemu core dumped when do rdma migration with Mellanox IB QDR card Version-Release number of selected component (if applicable): host version: qemu-kvm-3.1.0-4.module+el8+2681+819ab34d.x86_64 kernel-4.18.0-60.el8.x86_64 seabios-1.11.1-3.module+el8+2529+a9686a4d.x86_64 virtio-win-prewhql-0.1-163 Guest:Win2019 How reproducible: 5/5 Steps to Reproduce: 1.Boot guest in src host 2.Boot guest with rdma protocol listening in des host -incoming rdma:0:5555 3.In src end,set migration transfer speed (qemu) migrate_set_speed 40G 4.In src and dst,enable rdma-pin-all (qemu) migrate_set_capability rdma-pin-all on 5.do migration (qemu)migrate -d rdma:192.168.0.21:5555 Actual results: After step 5,qemu core dumped in dst Expected results: After step 5,no core dump and guest works well in dst Additional info: (1)pc + seabios no hit this issue. (2)boot a guest with cmd /usr/libexec/qemu-kvm \ -M q35,accel=kvm,kernel-irqchip=split \ -device intel-iommu,intremap=on \ -cpu Haswell-noTSX,enforce \ -nodefaults -rtc base=utc \ -m 4G \ -smp 2,sockets=2,cores=1,threads=1 \ -enable-kvm \ -uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \ -k en-us \ -nodefaults \ -boot menu=on \ -qmp tcp:0:6667,server,nowait \ -vga qxl \ -device pcie-root-port,bus=pcie.0,id=root0,slot=1 \ -object secret,id=sec0,data=redhat \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root0 \ -blockdev driver=luks,cache.direct=off,cache.no-flush=on,file.filename=/mnt/back.qcow2,node-name=my_disk,file.driver=file,key-secret=sec0 \ -device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \ -device pcie-root-port,bus=pcie.0,id=root1,slot=2 \ -device virtio-net-pci,netdev=tap10,mac=9a:6a:6b:6c:6d:6e,bus=root1 -netdev tap,id=tap10 \ -device pcie-root-port,bus=pcie.0,id=root2,slot=3 \ -device nec-usb-xhci,id=usb1,bus=root2 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -monitor stdio \ -vnc :10 \
I can't reproduce this bug with slow train. host version: qemu-kvm-2.12.0-57.module+el8+2683+02b3b955.x86_64 kernel-4.18.0-60.el8.x86_64 seabios-1.11.1-3.module+el8+2529+a9686a4d.x86_64
Hi, Does this happen only for windows guests, or does it also happen on a Linux guest? Please attach a full backtrace for crashing bugs. Thanks.
(In reply to Dr. David Alan Gilbert from comment #4) > Hi, > Does this happen only for windows guests, or does it also happen on a > Linux guest? No,it also happen on a Linux guest(rhel8 guest) > Please attach a full backtrace for crashing bugs. backtrace: (gdb) bt #0 0x00007f72318bbfcc in rdma_get_cm_event.part () from /lib64/librdmacm.so.1 #1 0x00005617cd22ced4 in rdma_cm_poll_handler () #2 0x00005617cd340d22 in aio_dispatch_handlers () #3 0x00005617cd34162c in aio_dispatch () #4 0x00005617cd33e1d2 in aio_ctx_dispatch () #5 0x00007f7231f3989d in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #6 0x00005617cd3408a8 in main_loop_wait () #7 0x00005617cd133e99 in main_loop () #8 0x00005617ccff43f4 in main ()
Yes, reproduced here going 7->8 on virtlab 414->413: /usr/libexec/qemu-kvm -M pc-q35-rhel7.6.0,accel=kvm,kernel-irqchip=split -device intel-iommu,intremap=on -cpu host -m 4G -smp 2 -enable-kvm -vga qxl -device pcie-root-port,bus=pcie.0,id=root0,slot=1 -object secret,id=sec0,data=redhat -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root0 -drive if=none,file=/home/vms/f27.qcow2,cache=none,id=disk -device scsi-hd,drive=disk,bus=virtio_scsi_pci0.0 -monitor stdio (gdb) bt full #0 0x00007ffff7034fcc in rdma_get_cm_event.part () at /lib64/librdmacm.so.1 #1 0x0000555555a75ed4 in rdma_cm_poll_handler (opaque=0x7fffe806b010) at migration/rdma.c:3236 rdma = 0x7fffe806b010 ret = <optimized out> cm_event = 0x5555564cc3e0 mis = 0x5555564e1ee0 (gdb) p mis->state $3 = 8 which I think is 'completed' (gdb) p rdma->channel $5 = (struct rdma_event_channel *) 0x0
Broke somewhere between 3.0.0 and 3.1.0 upstream
git bisect says: 6ef3771c0d070e8f16e12f21e4fbf1ec6459eff6 fails (double check) 6c97ec5f5ad6f65f8a6a9be044c2b875972406e4 good (double check) and I've double checked them; so this points to: 6ef3771c0d070e8f16e12f21e4fbf1ec6459eff6 is the first bad commit commit 6ef3771c0d070e8f16e12f21e4fbf1ec6459eff6 Author: Xiao Guangrong <xiaoguangrong> Date: Tue Aug 21 16:10:23 2018 +0800 migration: drop the return value of do_compress_ram_page It is not used and cleans the code up a little Reviewed-by: Peter Xu <peterx> Signed-off-by: Xiao Guangrong <xiaoguangrong> Reviewed-by: Juan Quintela <quintela> Signed-off-by: Juan Quintela <quintela> but the patch looks fine to me. hmm.
It's nothing to do with where that bisect ended up, it's a race so a lot of things can change it, so the bisect isn't valid; fix posted upstream: Subject: [PATCH] migration/rdma: unegister fd handler
Merged upstream as fbbaacab2758cb3f32a07524710533b1d6422be4
Defining ITR as 8.0.0.0 please change this in case it's not accurate.
Fix included in qemu-kvm-3.1.0-9.module+el8+2731+e40e7b84
Verify: qemu-kvm-3.1.0-15.module+el8+2792+e33e01a0 Guest works well after rdma migration, no core dumped.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1293