Bug 1029743

Summary: qemu-kvm core dump after hot plug/unplug 82576 PF about 100 times
Product: Red Hat Enterprise Linux 7 Reporter: Xu Han <xuhan>
Component: qemu-kvmAssignee: Alex Williamson <alex.williamson>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 7.0CC: acathrow, areis, bdas, chayang, hhuang, juzhang, mazhang, michen, virt-maint, xfu, xuhan
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-1.5.3-22.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-13 10:11:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Xu Han 2013-11-13 05:24:25 UTC
Description of problem:
Test three times.
The first time occurred after 115th hot plug.
The second time occurred after 112th hot plug.
The third time occurred after 98th hot plug.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-1.5.3-18.el7.x86_64
kernel-3.10.0-48.el7.x86_64

How reproducible:
3/3

Steps to Reproduce:
1. boot guest.
# /usr/libexec/qemu-kvm -nodefaults -M pc -m 4G -cpu Opteron_G3 -smp 4,cores=2,threads=2,sockets=1 -boot menu=on -monitor stdio -rtc base=localtime,clock=host,driftfix=slew -qmp tcp:0:5555,server,nowait -vga qxl -drive file=/home/win2008r2.qcow2_v3,format=qcow2,id=guest-img,if=none,werror=stop,rerror=stop -device virtio-blk-pci,scsi=off,drive=guest-img,id=os-disk -spice disable-ticketing,port=5931 -device virtio-balloon,id=balloon -netdev tap,id=hostnet0,script=/etc/ovs-ifup,downscript=/etc/ovs-ifdown -device e1000,netdev=hostnet0,id=net0,mac=00:1a:4a:42:0b:00 -monitor unix:/tmp/guest-sock,server,nowait

2. do hot plug/unplug PF in a loop via hmp
# cat start.sh
for i in $(seq 500)
do
    echo "device_add vfio-pci,host=01:00.0,id=pf0" | nc -U /tmp/guest-sock
    sleep 10
    echo "device_del pf0" | nc -U /tmp/guest-sock
    sleep 10
done

Actual results:
after , qemu-kvm core dump.
qemu-kvm: vfio: Error: event_notifier_init failed
qemu-kvm: /builddir/build/BUILD/qemu-1.5.3/kvm-all.c:967: kvm_irqchip_commit_routes: Assertion `ret == 0' failed.

Expected results:
can do repeatedly hot plug/unplug with no error.

Additional info:
(gdb) bt
#0  0x00007ffff30db999 in raise () from /lib64/libc.so.6
#1  0x00007ffff30dd0a8 in abort () from /lib64/libc.so.6
#2  0x00007ffff30d4906 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff30d49b2 in __assert_fail () from /lib64/libc.so.6
#4  0x00005555557c0c20 in kvm_irqchip_commit_routes (s=<optimized out>)
    at /usr/src/debug/qemu-1.5.3/kvm-all.c:967
#5  0x00005555557c0ee1 in kvm_irqchip_add_msi_route (s=0x555556502460, msg=...)
    at /usr/src/debug/qemu-1.5.3/kvm-all.c:1178
#6  0x00005555557ac5e9 in vfio_msix_vector_do_use (pdev=0x555556755c00, nr=5, 
    msg=msg@entry=0x7fffeab06a00, 
    handler=handler@entry=0x5555557ab0b0 <vfio_msi_interrupt>)
    at /usr/src/debug/qemu-1.5.3/hw/misc/vfio.c:660
#7  0x00005555557ac78e in vfio_msix_vector_use (pdev=<optimized out>, 
    nr=<optimized out>, msg=...)
    at /usr/src/debug/qemu-1.5.3/hw/misc/vfio.c:715
#8  0x00005555556ba27a in msix_set_notifier_for_vector (vector=5, 
    dev=0x555556755c00) at hw/pci/msix.c:508
#9  msix_set_vector_notifiers (dev=dev@entry=0x555556755c00, 
    use_notifier=use_notifier@entry=0x5555557ac760 <vfio_msix_vector_use>, 
    release_notifier=release_notifier@entry=0x5555557ab350 <vfio_msix_vector_release>, poll_notifier=poll_notifier@entry=0x0) at hw/pci/msix.c:535
#10 0x00005555557aff91 in vfio_enable_msix (vdev=0x555556755c00)
    at /usr/src/debug/qemu-1.5.3/hw/misc/vfio.c:792
#11 vfio_pci_write_config (pdev=0x555556755c00, addr=114, val=32777, len=2)
    at /usr/src/debug/qemu-1.5.3/hw/misc/vfio.c:2022
#12 0x00005555557c4142 in access_with_adjusted_size (addr=addr@entry=2, 
    value=value@entry=0x7fffeab06b58, size=2, access_size_min=<optimized out>, 
    access_size_max=<optimized out>, 
    access=access@entry=0x5555557c4700 <memory_region_write_accessor>, 
    opaque=opaque@entry=0x5555566b6888)
    at /usr/src/debug/qemu-1.5.3/memory.c:364
#13 0x00005555557c5617 in memory_region_iorange_write (
    iorange=<optimized out>, offset=2, width=2, data=32777)
    at /usr/src/debug/qemu-1.5.3/memory.c:439
#14 0x00005555557c2ef5 in kvm_handle_io (count=1, size=2, direction=1, 
    data=<optimized out>, port=3326)
    at /usr/src/debug/qemu-1.5.3/kvm-all.c:1500
#15 kvm_cpu_exec (env=env@entry=0x55555669a250)
    at /usr/src/debug/qemu-1.5.3/kvm-all.c:1649
#16 0x000055555576e6f5 in qemu_kvm_cpu_thread_fn (arg=0x55555669a250)
    at /usr/src/debug/qemu-1.5.3/cpus.c:793
#17 0x00007ffff6055de3 in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff319c1ad in clone () from /lib64/libc.so.6

Comment 1 Xu Han 2013-11-13 05:27:31 UTC
Found call trace in host dmesg.

[11150.023832] ------------[ cut here ]------------
[11150.023843] WARNING: at net/core/dev.c:5018 rollback_registered_many+0x1e2/0x210()
[11150.023845] Modules linked in: tun igbvf vfio_pci vfio_iommu_type1 vfio xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables openvswitch vxlan ip_tunnel gre sg ipmi_devintf dcdbas kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd microcode serio_raw pcspkr amd64_edac_mod edac_mce_amd fam15h_power k10temp edac_core sp5100_tco i2c_piix4 igb
[11150.023888]  ipmi_si ptp bnx2 pps_core ipmi_msghandler dca shpchp acpi_power_meter mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod crc_t10dif crct10dif_common sr_mod cdrom ata_generic pata_acpi mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm ahci drm mptsas pata_atiixp libahci scsi_transport_sas crc32c_intel mptscsih libata i2c_core mptbase dm_mirror dm_region_hash dm_log dm_mod
[11150.023918] CPU: 1 PID: 4947 Comm: qemu-kvm Not tainted 3.10.0-48.el7.x86_64 #1
[11150.023920] Hardware name: Dell Inc. PowerEdge R415/0YFWR8, BIOS 1.8.6 12/06/2011
[11150.023921]  0000000000000009 ffff8807616d3b38 ffffffff815abfff ffff8807616d3b70
[11150.023926]  ffffffff810589d1 ffff8808226b4000 ffff8807616d3bb8 ffff8807616d3bb8
[11150.023930]  ffff8808287006c0 0000000000000001 ffff8807616d3b80 ffffffff81058aaa
[11150.023933] Call Trace:
[11150.023940]  [<ffffffff815abfff>] dump_stack+0x19/0x1b
[11150.023946]  [<ffffffff810589d1>] warn_slowpath_common+0x61/0x80
[11150.023949]  [<ffffffff81058aaa>] warn_slowpath_null+0x1a/0x20
[11150.023951]  [<ffffffff8149eaf2>] rollback_registered_many+0x1e2/0x210
[11150.023954]  [<ffffffff8149eb51>] rollback_registered+0x31/0x40
[11150.023956]  [<ffffffff8149fc08>] unregister_netdevice_queue+0x48/0x90
[11150.023961]  [<ffffffffa065a31e>] __tun_detach+0x10e/0x2a0 [tun]
[11150.023964]  [<ffffffffa065a4dd>] tun_chr_close+0x2d/0x50 [tun]
[11150.023968]  [<ffffffff8119f3a9>] __fput+0xe9/0x270
[11150.023971]  [<ffffffff8119f65e>] ____fput+0xe/0x10
[11150.023987]  [<ffffffff8107b544>] task_work_run+0xc4/0xe0
[11150.024004]  [<ffffffff8105e68b>] do_exit+0x2bb/0xa30
[11150.024017]  [<ffffffff8105ee7f>] do_group_exit+0x3f/0xa0
[11150.024028]  [<ffffffff8106dab0>] get_signal_to_deliver+0x1d0/0x6e0
[11150.024033]  [<ffffffff81012408>] do_signal+0x48/0x5e0
[11150.024037]  [<ffffffff81012a11>] do_notify_resume+0x71/0xc0
[11150.024040]  [<ffffffff815bba52>] int_signal+0x12/0x17
[11150.024043] ---[ end trace c68788361608cdef ]---

Comment 3 Xu Han 2013-11-13 05:45:38 UTC
Call trace in Comment 1 caused by openvswitch.

Comment 4 Xu Han 2013-11-14 01:56:27 UTC
Test with 82576 VF.

Result:
After 333th hot plug, qemu-kvm core dump.

Comment 5 Alex Williamson 2013-11-14 02:38:24 UTC
I don't think comment 1 is related to the problem described in comment 0.  Have you seen comment 1 consistently in relation to the core dump?  My guess is that we're leaking irq routes somewhere in qemu-kvm and after enough hotplugs we ask for too many, kvm says no, and due to the spectacular error handling in qemu-kvm, we give up with an assert.  Please confirm whether or not the host dmesg in comment 1 is repeatable in relation to the original problem.

Comment 6 Xu Han 2013-11-14 03:02:51 UTC
Sorry, my mistake. As what I said in Comment 3, this call trace actually caused by some error of openvswitch, and was not related to this issue. So please ignore it.

Comment 7 Alex Williamson 2013-11-14 03:13:02 UTC
Sorry, I skimmed right over comment 3.  Thanks.

Comment 8 Alex Williamson 2013-11-20 21:00:20 UTC
Can QE provide access to a system that produces this?  I'm unable to reproduce in my environment.

Comment 12 Alex Williamson 2013-11-22 19:35:19 UTC
Patch posted upstream:

http://lists.nongnu.org/archive/html/qemu-devel/2013-11/msg02767.html

Comment 17 mazhang 2013-12-25 06:03:15 UTC
Reproduced this bug.

Host:
qemu-kvm-rhev-1.5.3-18.el7.x86_64
kernel-3.10.0-61.el7.x86_64

Guest:
win2008r2-64

Cli:
gdb --args /usr/libexec/qemu-kvm -nodefaults -M pc -m 4G -cpu Opteron_G1 -smp 4,cores=2,threads=2,sockets=1 -boot menu=on -monitor stdio -rtc base=localtime,clock=host,driftfix=slew -qmp tcp:0:5555,server,nowait -vga qxl -drive file=/home/win2008r2-64.qcow2,format=qcow2,id=guest-img,if=none,werror=stop,rerror=stop -device virtio-blk-pci,scsi=off,drive=guest-img,id=os-disk -spice disable-ticketing,port=5900 -device virtio-balloon,id=balloon -monitor unix:/tmp/guest-sock,server,nowait

Steps:
Start qemu-kvm, do hotplug/hotunplug pf.
[root@dell-per415-03 ~]# cat hotplug.sh 
#!/bin/bash
for i in $(seq 500)
do
    echo "device_add vfio-pci,host=01:00.0,id=pf0" | nc -U /tmp/guest-sock
    sleep 1
    echo "device_del pf0" | nc -U /tmp/guest-sock
    sleep 5
done


Result:
Qemu-kvm will aborted.

qemu-kvm: /builddir/build/BUILD/qemu-1.5.3/kvm-all.c:967: kvm_irqchip_commit_routes: Assertion `ret == 0' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffeab07700 (LWP 17852)]
0x00007ffff30db979 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff30db979 in raise () from /lib64/libc.so.6
#1  0x00007ffff30dd088 in abort () from /lib64/libc.so.6
#2  0x00007ffff30d48e6 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff30d4992 in __assert_fail () from /lib64/libc.so.6
#4  0x00005555557c0c20 in kvm_irqchip_commit_routes ()
#5  0x00005555557c0ee1 in kvm_irqchip_add_msi_route ()
#6  0x00005555557ac5e9 in vfio_msix_vector_do_use ()
#7  0x00005555557ac78e in vfio_msix_vector_use ()
#8  0x00005555556ba27a in msix_set_vector_notifiers ()
#9  0x00005555557aff91 in vfio_pci_write_config ()
#10 0x00005555557c4142 in access_with_adjusted_size ()
#11 0x00005555557c5617 in memory_region_iorange_write ()
#12 0x00005555557c2ef5 in kvm_cpu_exec ()
#13 0x000055555576e6f5 in qemu_kvm_cpu_thread_fn ()
#14 0x00007ffff6055de3 in start_thread () from /lib64/libpthread.so.0
#15 0x00007ffff319c16d in clone () from /lib64/libc.so.6

Comment 18 mazhang 2013-12-25 07:11:52 UTC
Verified this bug with qemu-kvm-1.5.3-28.el7.x86_64

Host:
qemu-kvm-1.5.3-28.el7.x86_64
kernel-3.10.0-61.el7.x86_64

Guest:
win2008r2-64

Steps:
Hotplug/hotunplug PF 500 times, guest works well, no crash.

So this bug has been fixed.

Comment 20 Ludek Smid 2014-06-13 10:11:56 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.