Bug 513317

Summary: PCI passthrough with kvm guest cause libvirtd dead
Product: Red Hat Enterprise Linux 5 Reporter: zhanghaiyan <yoyzhang>
Component: libvirtAssignee: Daniel Berrangé <berrange>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: berrange, llim, markmc, mshao, sghosh, veillard, virt-maint, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 09:22:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gdb.log
none
test1-kvm.xml
none
test1.log
none
nodedev-list
none
nodedev-dumpxml none

Description zhanghaiyan 2009-07-23 02:05:26 UTC
Description of problem:
PCI passthrough with kvm guest fail and can cause libvirtd dead

Version-Release number of selected component (if applicable):
- libvirt-0.6.3-15.el5
- xen-3.0.3-90.el5
- kvm-83-90.el5
- rhel-5.4 (2.6.18-158.el5)

How reproducible:
100%

Steps to Reproduce:
1.# virsh nodedev-dettach pci_8086_10bd
Device pci_8086_10bd dettached

2.# virsh nodedev-reset pci_8086_10bd
Device pci_8086_10bd reset

3.# virsh edit demo
Domain demo XML configuration edited.

        <hostdev mode='subsystem' type='pci' managed='no'>
          <source>
           <address bus='0x00' slot='0x25' function='0x00'/>
          </source>
        </hostdev>

4. # virsh start demo
error: Failed to start domain demo
error: server closed connection

5.# virsh
error: unable to connect to '/var/run/libvirt/libvirt-sock': Connection refused
error: failed to connect to the hypervisor

6.# service libvirtd status
libvirtd dead but pid file exists

Actual results:
PCI passthrough fail and cause libvirtd dead

Expected results:
PCI passthrough success

Additional info:

Comment 1 Daniel Berrangé 2009-07-23 10:03:44 UTC
I can't reproduce this. Can you attempt to capture a stack trace,

- Install libvirt-debuginfo RPM
- Run 'service libvirtd start'
- Run 'ps -auxfw | grep libvirtd' to find the PID of the libvirtd process
- Start 'gdb'
- In the gdb console, type 'attach <PID-OF-LIBVIRTD>' and then 'cont'


Now in another console attempt to run your test to make libvirtd crash.
When it crashes, go back to the GDB console and type

 'thread apply all backtrace'

And then upload all the data from that as an attachment to this bug.


Can you also provide the output of

 'virsh nodedev-list --tree'

And 

  'virsh nodedev-dumpxml pci_8086_10bd'

And finally, the full XML config of the guest, and any /var/log/libvirt/qemu/demo.log that may exist

Comment 2 zhanghaiyan 2009-07-24 05:21:57 UTC
Now, the test result is a little different
after step4 # virsh start demo
It hangs.

Attached gdb.log
         nodedev-list
         nodedev-dumpxml
         kvm-test1.xml
         test1.log

Comment 3 zhanghaiyan 2009-07-24 05:23:42 UTC
Created attachment 354972 [details]
gdb.log

Comment 4 zhanghaiyan 2009-07-24 05:24:12 UTC
Created attachment 354973 [details]
test1-kvm.xml

Comment 5 zhanghaiyan 2009-07-24 05:24:33 UTC
Created attachment 354974 [details]
test1.log

Comment 6 zhanghaiyan 2009-07-24 05:25:02 UTC
Created attachment 354975 [details]
nodedev-list

Comment 7 zhanghaiyan 2009-07-24 05:25:26 UTC
Created attachment 354976 [details]
nodedev-dumpxml

Comment 8 Mark McLoughlin 2009-07-24 11:23:37 UTC
excerpt from stack trace:

#0  0x000000386747268e in free () from /lib64/libc.so.6
#1  0x0000003a66e1865c in virFree (ptrptr=<value optimized out>)
    at memory.c:177
#2  0x0000003a66e1890a in pciReadDeviceID (dev=<value optimized out>, 
---Type <return> to continue, or q <return> to quit---
    id_name=<value optimized out>) at pci.c:839
#3  0x0000003a66e18a01 in pciGetDevice (conn=<value optimized out>, 
    domain=<value optimized out>, bus=<value optimized out>, 
    slot=<value optimized out>, function=<value optimized out>) at pci.c:875
#4  0x0000000000420d15 in qemudStartVMDaemon (conn=0x5e30f50, 
    driver=0x5d8f950, vm=0x5e31090, migrateFrom=0x0, stdin_fd=-1)
    at qemu_driver.c:1251

Comment 9 Daniel Berrangé 2009-07-24 11:28:44 UTC
Ah, that will probably be this upstream bug fix

http://libvirt.org/git/?p=libvirt.git;a=commit;h=4a7acedd3c59a6a750576cb8680bc3f08fe0b52c


IIRC it triggers if you configure a PCI device that does not actually exist on the host

Comment 10 Daniel Berrangé 2009-07-24 12:17:20 UTC
Yep, confirmed here

The device being attached has a slot '25'  (decimal)

<device>
  <name>pci_8086_10bd</name>
  <parent>computer</parent>
  <capability type='pci'>
    <domain>0</domain>
    <bus>0</bus>
    <slot>25</slot>
    <function>0</function>
    <product id='0x10bd'>82566DM-2 Gigabit Network Connection</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
  </capability>
</device>


The guest XML has been configure with slot 0x25 (hexidecimal).

    <source>
        <address bus='0x00' slot='0x25' function='0x00'/>
    </source>

It is a shame the node-device XML prints decimal, but not hexidecimal when dumping XML, but that's life. Upon reading XML both domain & nodedevice XML accept any number base.


So changing the XML to

   <source>
        <address bus='00' slot='25' function='0'/>
    </source>

should avoid the crash, but clearly we should still fix this.

Comment 11 zhanghaiyan 2009-07-27 09:32:40 UTC
I tried with XML
   <source>
        <address bus='00' slot='25' function='0'/>
   </source>

YES, can passthrough PCI successfully.

Comment 13 Daniel Veillard 2009-07-28 16:03:26 UTC
libvirt-0.6.3-17.el5 has been built in dist-5E-qu-candidate with the fix

Daniel

Comment 16 Yewei Shao 2009-07-29 07:32:09 UTC
Verified on libvirt-0.6.3-15.el5, cannot reproduce this bug

Comment 17 zhanghaiyan 2009-07-29 07:33:17 UTC
Update comment #16.
Verified on libvirt-0.6.3-17.el5, cannot reproduce this bug

Comment 19 errata-xmlrpc 2009-09-02 09:22:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1269.html