Bug 1666601 - [q35] dst qemu core dumped when do rdma migration with Mellanox IB QDR card
Summary: [q35] dst qemu core dumped when do rdma migration with Mellanox IB QDR card
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.0
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: rc
: 8.0
Assignee: Dr. David Alan Gilbert
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-16 07:47 UTC by Yiqian Wei
Modified: 2019-11-12 00:14 UTC (History)
14 users (show)

Fixed In Version: qemu-kvm-3.1.0-9.module+el8+2731+e40e7b84
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-29 16:05:29 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1293 0 None None None 2019-05-29 16:05:53 UTC

Description Yiqian Wei 2019-01-16 07:47:14 UTC
Description of problem:

dst qemu core dumped when do rdma migration with Mellanox IB QDR card

Version-Release number of selected component (if applicable):
host version:
qemu-kvm-3.1.0-4.module+el8+2681+819ab34d.x86_64
kernel-4.18.0-60.el8.x86_64
seabios-1.11.1-3.module+el8+2529+a9686a4d.x86_64
virtio-win-prewhql-0.1-163
Guest:Win2019

How reproducible:
5/5

Steps to Reproduce:
1.Boot guest in src host

2.Boot guest with rdma protocol listening in des host
  
  -incoming rdma:0:5555  

3.In src end,set migration transfer speed

(qemu) migrate_set_speed 40G

4.In src and dst,enable rdma-pin-all

(qemu) migrate_set_capability rdma-pin-all on

5.do migration

(qemu)migrate -d rdma:192.168.0.21:5555

Actual results:
After step 5,qemu core dumped in dst

Expected results:
After step 5,no core dump and guest works well in dst

Additional info:
(1)pc + seabios no hit this issue.
(2)boot a guest with cmd 
/usr/libexec/qemu-kvm \
-M q35,accel=kvm,kernel-irqchip=split \
-device intel-iommu,intremap=on \
-cpu Haswell-noTSX,enforce \
-nodefaults -rtc base=utc \
-m 4G \
-smp 2,sockets=2,cores=1,threads=1 \
-enable-kvm \
-uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \
-k en-us \
-nodefaults \
-boot menu=on \
-qmp tcp:0:6667,server,nowait \
-vga qxl \
-device pcie-root-port,bus=pcie.0,id=root0,slot=1 \
-object secret,id=sec0,data=redhat \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root0 \
-blockdev driver=luks,cache.direct=off,cache.no-flush=on,file.filename=/mnt/back.qcow2,node-name=my_disk,file.driver=file,key-secret=sec0 \
-device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \
-device pcie-root-port,bus=pcie.0,id=root1,slot=2 \
-device virtio-net-pci,netdev=tap10,mac=9a:6a:6b:6c:6d:6e,bus=root1 -netdev tap,id=tap10 \
-device pcie-root-port,bus=pcie.0,id=root2,slot=3 \
-device nec-usb-xhci,id=usb1,bus=root2 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-monitor stdio \
-vnc :10 \

Comment 3 Yiqian Wei 2019-01-17 06:32:12 UTC
I can't reproduce this bug with slow train.

host version:
qemu-kvm-2.12.0-57.module+el8+2683+02b3b955.x86_64
kernel-4.18.0-60.el8.x86_64
seabios-1.11.1-3.module+el8+2529+a9686a4d.x86_64

Comment 4 Dr. David Alan Gilbert 2019-01-17 14:53:26 UTC
Hi,
  Does this happen only for windows guests, or does it also happen on a Linux guest?
  Please attach a full backtrace for crashing bugs.

Thanks.

Comment 5 Yiqian Wei 2019-01-18 03:00:28 UTC
(In reply to Dr. David Alan Gilbert from comment #4)
> Hi,
>   Does this happen only for windows guests, or does it also happen on a
> Linux guest?

    No,it also happen on a Linux guest(rhel8 guest)

>   Please attach a full backtrace for crashing bugs.

backtrace:
(gdb) bt
#0  0x00007f72318bbfcc in rdma_get_cm_event.part () from /lib64/librdmacm.so.1
#1  0x00005617cd22ced4 in rdma_cm_poll_handler ()
#2  0x00005617cd340d22 in aio_dispatch_handlers ()
#3  0x00005617cd34162c in aio_dispatch ()
#4  0x00005617cd33e1d2 in aio_ctx_dispatch ()
#5  0x00007f7231f3989d in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#6  0x00005617cd3408a8 in main_loop_wait ()
#7  0x00005617cd133e99 in main_loop ()
#8  0x00005617ccff43f4 in main ()

Comment 6 Dr. David Alan Gilbert 2019-01-18 12:29:11 UTC
Yes, reproduced here going 7->8 on virtlab 414->413:

/usr/libexec/qemu-kvm -M pc-q35-rhel7.6.0,accel=kvm,kernel-irqchip=split -device intel-iommu,intremap=on -cpu host -m 4G -smp 2 -enable-kvm -vga qxl -device pcie-root-port,bus=pcie.0,id=root0,slot=1 -object secret,id=sec0,data=redhat -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root0 -drive if=none,file=/home/vms/f27.qcow2,cache=none,id=disk -device scsi-hd,drive=disk,bus=virtio_scsi_pci0.0 -monitor stdio

(gdb) bt full
#0  0x00007ffff7034fcc in rdma_get_cm_event.part () at /lib64/librdmacm.so.1
#1  0x0000555555a75ed4 in rdma_cm_poll_handler (opaque=0x7fffe806b010) at migration/rdma.c:3236
        rdma = 0x7fffe806b010
        ret = <optimized out>
        cm_event = 0x5555564cc3e0
        mis = 0x5555564e1ee0

(gdb) p mis->state
$3 = 8
which I think is 'completed'
(gdb) p rdma->channel
$5 = (struct rdma_event_channel *) 0x0

Comment 7 Dr. David Alan Gilbert 2019-01-18 15:43:16 UTC
Broke somewhere between 3.0.0 and 3.1.0 upstream

Comment 8 Dr. David Alan Gilbert 2019-01-18 17:06:22 UTC
git bisect says:

6ef3771c0d070e8f16e12f21e4fbf1ec6459eff6 fails (double check)
6c97ec5f5ad6f65f8a6a9be044c2b875972406e4 good (double check)

and I've double checked them; so this points to:
6ef3771c0d070e8f16e12f21e4fbf1ec6459eff6 is the first bad commit
commit 6ef3771c0d070e8f16e12f21e4fbf1ec6459eff6
Author: Xiao Guangrong <xiaoguangrong>
Date:   Tue Aug 21 16:10:23 2018 +0800
 
    migration: drop the return value of do_compress_ram_page
    
    It is not used and cleans the code up a little
 
    Reviewed-by: Peter Xu <peterx>
    Signed-off-by: Xiao Guangrong <xiaoguangrong>
    Reviewed-by: Juan Quintela <quintela>
    Signed-off-by: Juan Quintela <quintela>

but the patch looks fine to me. hmm.

Comment 10 Dr. David Alan Gilbert 2019-01-22 17:49:01 UTC
It's nothing to do with where that bisect ended up, it's a race so a lot of things can change it, so the bisect isn't valid; fix posted upstream:

Subject: [PATCH] migration/rdma: unegister fd handler

Comment 14 Dr. David Alan Gilbert 2019-01-24 15:24:50 UTC
Merged upstream as fbbaacab2758cb3f32a07524710533b1d6422be4

Comment 16 Danilo de Paula 2019-01-24 16:56:51 UTC
Defining ITR as 8.0.0.0

please change this in case it's not accurate.

Comment 17 Danilo de Paula 2019-01-29 14:10:23 UTC
Fix included in qemu-kvm-3.1.0-9.module+el8+2731+e40e7b84

Comment 18 Yumei Huang 2019-02-19 06:48:56 UTC
Verify:
qemu-kvm-3.1.0-15.module+el8+2792+e33e01a0

Guest works well after rdma migration, no core dumped.

Comment 20 errata-xmlrpc 2019-05-29 16:05:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1293


Note You need to log in before you can comment on or make changes to this bug.