Bug 738519

Summary: Core dump when hotplug/hotunplug usb controller more than 1000 times
Product: Red Hat Enterprise Linux 6 Reporter: FuXiangChun <xfu>
Component: qemu-kvmAssignee: Alex Williamson <alex.williamson>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: acathrow, juzhang, knoel, michen, minovotn, mkenneth, qzhou, shu, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-0.12.1.2-2.231.el6 Doc Type: Bug Fix
Doc Text:
Cause: Run a guest and then hot-plug/hot-unplug USB controller more than 1000 times. Consequence: Qemu-kvm core dumps Fix: Implemented unregistering of MMIO BARs. The BARs were present and never unregistered which caused leak. Results: Qemu-kvm keeps running and USB controller hot-plug and hot-unplug keeps working.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 11:34:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description FuXiangChun 2011-09-15 04:48:30 UTC
Description of problem:
boot guest and hotplug/hotunplug usb controller >1000 times. qemu will core dump. 

Version-Release number of selected component (if applicable):
host info:
# uname -r
2.6.32-191.el6.x86_64
# rpm -qa|grep kvm
qemu-kvm-0.12.1.2-2.188.el6.x86_64

guest info:
rhel6.2 (64 bit)

How reproducible:
always

Steps to Reproduce:
1.unbind a usb controller in host
2.boot guest without usb controller
/usr/libexec/qemu-kvm  -m 4G -smp 4 -netdev tap,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:94:a3:8b -uuid 7c73a852-c316-4d61-b913-9dde17367a30  -drive file=/dev/migrate/data2,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,drive=drive-virtio-disk0,id=virtio-blk-pci0 -boot c -spice disable-ticketing,port=5911  -vga qxl -qmp tcp:0:6666,server,nowait 

3.hotplug/hotunplug usb controller 2000 times
  (1)device_add driver=pci-assign host=00:1d.0 id=usb100 iommu=1
  (2)device_del id=usb100

Actual results:
qemu core dump

Expected results:
guest work well

Additional info:

bt trace message:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7037700 (LWP 31970)]
0x0000000000470cc4 in slow_bar_readl (opaque=0x2157298, addr=44) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/device-assignment.c:195

(gdb) bt
#0  0x0000000000470cc4 in slow_bar_readl (opaque=0x2157298, addr=44) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/device-assignment.c:195
#1  0x00000000004eca2c in cpu_physical_memory_rw (addr=<value optimized out>, buf=<value optimized out>, len=4, is_write=0) at /usr/src/debug/qemu-kvm-0.12.1.2/exec.c:3546
#2  0x000000000042bd1c in handle_mmio (env=0x10903b0) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:868
#3  kvm_run (env=0x10903b0) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1020
#4  0x000000000042c009 in kvm_cpu_exec (env=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1699
#5  0x000000000042ce5f in kvm_main_loop_cpu (_env=0x10903b0) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1968
#6  ap_main_loop (_env=0x10903b0) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2018
#7  0x000000340f6077e1 in start_thread () from /lib64/libpthread.so.0
#8  0x000000340eee578d in clone () from /lib64/libc.so.6

Comment 2 Gerd Hoffmann 2011-09-15 14:26:06 UTC
"device_add driver=pci-assign host=00:1d.0 id=usb100 iommu=1"

That looks more a pci passthru than a usb emulation issue, reassigning ...

Comment 3 FuXiangChun 2011-09-16 10:39:19 UTC
added sleep 5 seconds between hotplug and hot-unplug, and add sleep 5 seconds before every times hot-plug as well.  but it is still core dump

Comment 4 Alex Williamson 2011-09-20 19:03:49 UTC
Can you please provide output of 'sudo lspci -vvv -s 1d.0' to identify the USB device being used?

Comment 5 FuXiangChun 2011-09-21 01:42:50 UTC
(In reply to comment #4)
> Can you please provide output of 'sudo lspci -vvv -s 1d.0' to identify the USB
> device being used?

# lspci -vvv -s 00:1a.0
00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 04) (prog-if 20 [EHCI])
	Subsystem: Dell Device 0498
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at dad70000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Debug port: BAR=1 offset=00a0
	Capabilities: [98] PCI Advanced Features
		AFCap: TP+ FLR+
		AFCtrl: FLR-
		AFStatus: TP-
	Kernel driver in use: ehci_hcd

Comment 6 Alex Williamson 2011-09-21 02:23:02 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Can you please provide output of 'sudo lspci -vvv -s 1d.0' to identify the USB
> > device being used?
> 
> # lspci -vvv -s 00:1a.0
> 00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family

Comment 0 indicates device 00:1d.0 is being used, can you please confirm which device caused the problem, or maybe they both can trigger the bug?  Thanks.

Comment 7 FuXiangChun 2011-09-21 02:45:35 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > Can you please provide output of 'sudo lspci -vvv -s 1d.0' to identify the USB
> > > device being used?
> > 
> > # lspci -vvv -s 00:1a.0
> > 00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family
> 
> Comment 0 indicates device 00:1d.0 is being used, can you please confirm which
> device caused the problem, or maybe they both can trigger the bug?  Thanks.

sorry, just confirmed it again. device 00:1d.0 is being used.

# lspci -vvv -s 00:1d.0
00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 04) (prog-if 20 [EHCI])
    Subsystem: Dell Device 0498
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 17
    Region 0: Memory at dad50000 (32-bit, non-prefetchable) [size=1K]
    Capabilities: [50] Power Management version 2
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Debug port: BAR=1 offset=00a0
    Capabilities: [98] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-
    Kernel driver in use: ehci_hcd

Comment 10 Alex Williamson 2012-02-06 21:00:00 UTC
Please re-test with this qemu-kvm rpm:

https://brewweb.devel.redhat.com/taskinfo?taskID=4012380

I was able to reproduce the result, but not the exact scenario you describe in comment 0.  The bug I found is a resource leak that results in a segfault once we overflow an internal resource.  The inconsistency with your report is that this will occur at ~500 hotplug/unplug operations, not 1000 or 2000 as indicated here.  Were you only able to get these high counts when not using a sleep between each hotplug and hotunplug operation?  In comment 3 you indicate you added a sleep 5 for each, did you then get a failure after approximately 500 operations?

Comment 11 FuXiangChun 2012-02-13 09:38:33 UTC
(In reply to comment #10)
> Please re-test with this qemu-kvm rpm:
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=4012380
> 
> I was able to reproduce the result, but not the exact scenario you describe in
> comment 0.  The bug I found is a resource leak that results in a segfault once
> we overflow an internal resource.  The inconsistency with your report is that
> this will occur at ~500 hotplug/unplug operations, not 1000 or 2000 as
> indicated here.  Were you only able to get these high counts when not using a
> sleep between each hotplug and hotunplug operation?  In comment 3 you indicate
> you added a sleep 5 for each, did you then get a failure after approximately
> 500 operations?

Sorry, so late reply to you.  since I cann't reproduce this bug except SandBridge host. I will as soon as possible to take SandBridge host and re-test this bug.

Comment 12 FuXiangChun 2012-02-16 05:30:07 UTC
(In reply to comment #10)
> Please re-test with this qemu-kvm rpm:
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=4012380
> 
> I was able to reproduce the result, but not the exact scenario you describe in
> comment 0.  The bug I found is a resource leak that results in a segfault once
> we overflow an internal resource.  The inconsistency with your report is that
> this will occur at ~500 hotplug/unplug operations, not 1000 or 2000 as
> indicated here.  Were you only able to get these high counts when not using a
> sleep between each hotplug and hotunplug operation?  In comment 3 you indicate
> you added a sleep 5 for each, did you then get a failure after approximately
> 500 operations?

testing scenarios:
1.I re-tested this bug with below qemu. test result: qemu works well(no core dump)
 https://brewweb.devel.redhat.com/taskinfo?taskID=4012380
  
2.without sleep between each hotplug and hotunplug operation
 sometimes(not 100%) can reproduce it,it still need to hotplug/unhotplug about 1000 times when reproducing.

Comment 13 Alex Williamson 2012-02-16 05:47:04 UTC
(In reply to comment #12)
> 
> 2.without sleep between each hotplug and hotunplug operation
>  sometimes(not 100%) can reproduce it,it still need to hotplug/unhotplug about
> 1000 times when reproducing.

This is not a realistic usage scenario test.  PCI device hotplug occurs asynchronous to the device_del command, so you could very well be trying to add the device back before it's been removed.  All hotplug testing should currently be done with a delay between each operation.

Comment 14 FuXiangChun 2012-02-16 07:36:36 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > 
> > 2.without sleep between each hotplug and hotunplug operation
> >  sometimes(not 100%) can reproduce it,it still need to hotplug/unhotplug about
> > 1000 times when reproducing.
> 
> This is not a realistic usage scenario test.  PCI device hotplug occurs
> asynchronous to the device_del command, so you could very well be trying to add
> the device back before it's been removed.  All hotplug testing should currently
> be done with a delay between each operation.

if delay 1 second or 2 seconds between each operation. testing get the same result(about 1000 times).

Comment 16 Alex Williamson 2012-02-16 13:20:40 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #12)
> > > 
> > > 2.without sleep between each hotplug and hotunplug operation
> > >  sometimes(not 100%) can reproduce it,it still need to hotplug/unhotplug about
> > > 1000 times when reproducing.
> > 
> > This is not a realistic usage scenario test.  PCI device hotplug occurs
> > asynchronous to the device_del command, so you could very well be trying to add
> > the device back before it's been removed.  All hotplug testing should currently
> > be done with a delay between each operation.
> 
> if delay 1 second or 2 seconds between each operation. testing get the same
> result(about 1000 times).

Is it also a segfault?  Can you run in gdb and provide the backtrace to see if it's the same as Comment 0?

Comment 19 FuXiangChun 2012-02-17 01:28:51 UTC
(In reply to comment #16)
> (In reply to comment #14)
> > (In reply to comment #13)
> > > (In reply to comment #12)
> > > > 
> > > > 2.without sleep between each hotplug and hotunplug operation
> > > >  sometimes(not 100%) can reproduce it,it still need to hotplug/unhotplug about
> > > > 1000 times when reproducing.
> > > 
> > > This is not a realistic usage scenario test.  PCI device hotplug occurs
> > > asynchronous to the device_del command, so you could very well be trying to add
> > > the device back before it's been removed.  All hotplug testing should currently
> > > be done with a delay between each operation.
> > 
> > if delay 1 second or 2 seconds between each operation. testing get the same
> > result(about 1000 times).
> 
> Is it also a segfault?  Can you run in gdb and provide the backtrace to see if
> it's the same as Comment 0?

Sorry my previous comments confuse you, clarification. works well after 1000 times hot plug/unplug with your build.

Comment 20 FuXiangChun 2012-02-17 09:46:10 UTC
verify bug with qemu-kvm-0.12.1.2-2.231.el6
qemu and guest work well.

so this bug is fixed.

Comment 22 Michal Novotny 2012-05-03 17:38:28 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
Run a guest and then hot-plug/hot-unplug USB controller more than 1000 times.

Consequence:
Qemu-kvm core dumps

Fix:
Implemented unregistering of MMIO BARs. The BARs were present and never unregistered which caused leak.

Results:
Qemu-kvm keeps running and USB controller hot-plug and hot-unplug keeps working.

Comment 23 errata-xmlrpc 2012-06-20 11:34:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0746.html