Bug 1245004 - VFIO device hotplug fails and dumpxml does not update for hot unplug
Summary: VFIO device hotplug fails and dumpxml does not update for hot unplug
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.2
Hardware: ppc64le
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Andrea Bolognani
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: RHEV3.6PPC 1277183 1277184
TreeView+ depends on / blocked
 
Reported: 2015-07-21 02:25 UTC by Dan Zheng
Modified: 2016-02-21 11:07 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-19 06:48:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:2202 0 normal SHIPPED_LIVE libvirt bug fix and enhancement update 2015-11-19 08:17:58 UTC

Description Dan Zheng 2015-07-21 02:25:10 UTC
Description of problems:
Two problems:
1.Dumpxml does not get updated when hot-unplug a physical PCI device.
2.PF hotplug to guest fails.


Version-Release number of selected component (if applicable):
libvirt-1.2.17-2.el7.ppc64le
qemu-kvm-rhev-2.3.0-9.el7.ppc64le
kernel-3.10.0-292.el7.ppc64le

0. Prepare the environment on host
#modprobe vfio
#modprobe vfio_spapr_eeh
#modprobe vfio_iommu_spapr_tce
#modprobe vfio_pci

1. Start a guest with 3 PCI devices passthrough to the guest.

# virsh start dzhengvm2
Domain dzhengvm2 started

# virsh dumpxml dzhengvm2|grep hostdev -A5
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>

2. Check within the guest, and three PCI devices are displayed.
[root@localhost ~]# lspci
00:00.0 USB controller: Apple Inc. KeyLargo/Intrepid USB
00:01.0 Communication controller: Red Hat, Inc Virtio console
00:03.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
00:04.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:05.0 Unclassified device [00ff]: Red Hat, Inc Virtio RNG
00:06.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:07.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:08.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)

3. Check guest log

2015-07-20T07:49:55.781878Z qemu-kvm: -device vfio-pci,host=0002:01:00.0,id=hostdev0,bus=pci.0,addr=0x6: Failed to create KVM VFIO device: No such device

Note: this is already filed a bug. See https://bugzilla.redhat.com/show_bug.cgi?id=1237034

4. Unplug one PCI device pci_0002_01_00_1
unplugPF.xml:
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
     <address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/>
  </source>
</hostdev>

# virsh detach-device dzhengvm2 unplugPF.xml
Device detached successfully

5. Check within the guest and the device is removed already.
[root@localhost ~]# lspci
00:00.0 USB controller: Apple Inc. KeyLargo/Intrepid USB
00:01.0 Communication controller: Red Hat, Inc Virtio console
00:03.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
00:04.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:05.0 Unclassified device [00ff]: Red Hat, Inc Virtio RNG
00:06.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:08.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)

6. On host, check the dumpxml, but the <hostdev> entry for this device is not removed.
# virsh dumpxml dzhengvm2 |grep hostdev -A5
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>

7. Check the qemu command line and the detached device information is still included in it.
-device vfio-pci,host=0002:01:00.0,id=hostdev0,bus=pci.0,addr=0x6 
***-device vfio-pci,host=0002:01:00.1,id=hostdev1,bus=pci.0,addr=0x7*** 
-device vfio-pci,host=0002:01:00.3,id=hostdev2,bus=pci.0,addr=0x8


8. Attach another PCI device and fail.
hotplug.xml:
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
     <address domain='0x0002' bus='0x01' slot='0x00' function='0x4'/>
  </source>
</hostdev>

# virsh attach-device dzhengvm2 hotplug.xml
error: Failed to attach device from hotplug.xml
error: Requested operation is not valid: PCI device 0002:01:00.0 is in use by driver QEMU, domain dzhengvm2

9. Insert hostplug.xml content into XML and restart the guest, then check within the guest, the device can be passthrough.
[root@localhost ~]# lspci
00:06.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:07.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:08.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:09.0 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator (Lancer) (rev 10)



Actual result:
The <hostdev> entry for this detached device is not removed from the dumpxml. And hotplug PF does not work.

Expect result:
Corresponding <hostdev> entry for the detached PCI device should be removed from the dumpxml. And hotplug PF should work.

Additional information:
# virsh nodedev-dumpxml pci_0002_01_00_1

<device>
  <name>pci_0002_01_00_1</name>
  <path>/sys/devices/pci0002:00/0002:00:00.0/0002:01:00.1</path>
  <parent>pci_0002_00_00_0</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>2</domain>
    <bus>1</bus>
    <slot>0</slot>
    <function>1</function>
    <product id='0xe220'>OneConnect NIC (Lancer)</product>
    <vendor id='0x10df'>Emulex Corporation</vendor>
    <iommuGroup number='1'>
      <address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/>
      <address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/>
      <address domain='0x0002' bus='0x01' slot='0x00' function='0x2'/>
      <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
      <address domain='0x0002' bus='0x01' slot='0x00' function='0x4'/>
      <address domain='0x0002' bus='0x01' slot='0x00' function='0x5'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='8' width='8'/>
      <link validity='sta' speed='8' width='8'/>
    </pci-express>
  </capability>
</device>

Comment 1 Dan Zheng 2015-07-21 02:26:33 UTC
The problems are only on ppc64le. They work well on X86.

Comment 3 Dan Zheng 2015-07-27 01:35:31 UTC
The dumped XML also does not get updated when I hot-unplug a disk which have been configured in guest xml before I start it. It doesn't happen while hotplug/hot-unplug disk for a running guest.

Is the same root cause for this problem and this BZ problem?

Comment 4 Dan Zheng 2015-07-28 08:11:39 UTC
Please ignore comment 3.

Comment 5 Dan Zheng 2015-07-28 08:18:36 UTC
The dumped XML also does not get updated when I hot-unplug a disk.
Steps:
1. Configure a disk in XML and start the guest
 <disk type='file' device='disk'>
      <driver name='qemu' type='raw' iothread='3'/>
      <source file='/var/lib/virt_test/images/test3.img'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>.
2.  The disk can be listed via fdisk in guest
3.  unplug the disk vdb
# virsh detach-disk dzhengvm1 vdb
Disk detached successfully
4. Check the dumpxml, the <disk> is not removed. But it can not be listed via fdisk in guest already.
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' iothread='1'/>
      <source file='/var/lib/virt_test/images/test2.qcow2'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>

The problem is very similar to this bug. Could you help confirm if the root cause is same?

Comment 6 Karen Noel 2015-09-03 18:14:29 UTC
Dan, Please do not report 2 problems in one BZ:

> Two problems:
> 1.Dumpxml does not get updated when hot-unplug a physical PCI device.
> 2.PF hotplug to guest fails.

This BZ is for #1 the dumpxml update problem.  

Can you confirm if the dumpxml update bug is Power specific? Or, does it also happen on X86_64?

Can you please file another BZ for #2 the PF hotplug problem, or indicate if it has already been reported (with BZ #)? It might be related to Bug 1259556 or 1241886. 

Please retest with the latest qemu-kvm-rhev before opening a new BZ. Thanks!

Comment 7 Dan Zheng 2015-09-06 10:27:42 UTC
(In reply to Karen Noel from comment #6)
> Dan, Please do not report 2 problems in one BZ:
> 
> > Two problems:
> > 1.Dumpxml does not get updated when hot-unplug a physical PCI device.
> > 2.PF hotplug to guest fails.
> 
> This BZ is for #1 the dumpxml update problem.  
> 
> Can you confirm if the dumpxml update bug is Power specific? Or, does it
> also happen on X86_64?
> 
> Can you please file another BZ for #2 the PF hotplug problem, or indicate if
> it has already been reported (with BZ #)? It might be related to Bug 1259556
> or 1241886. 
> 
> Please retest with the latest qemu-kvm-rhev before opening a new BZ. Thanks!

This problem is only on PPC. See Comment 1.

I tried to reproduce the two problems today, but some new problems blocked me. I will get it updated when I am able to draw a conclusion.

Comment 8 Dan Zheng 2015-09-07 09:12:48 UTC
Currently test result is that the guest will get paused after unplugging a PCI device and  unrecoverable error is detected.

Packages: 
  kernel-3.10.0-313.el7.ppc64le
  qemu-kvm-rhev-2.3.0-22.el7.ppc64le
  libvirt-daemon-1.2.17-7.el7.ppc64le
Guest: kernel-3.10.0-313.el7.ppc64le


Steps:
0. Prepare the environment on host
#modprobe vfio
#modprobe vfio_spapr_eeh
#modprobe vfio_iommu_spapr_tce
#modprobe vfio_pci

1. Start a guest with 3 PCI devices passthrough to the guest.

# virsh start dzhengvm2
Domain dzhengvm2 started

# virsh dumpxml dzhengvm2|grep hostdev -A5
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </hostdev>


2. Check within the guest, and three PCI devices are displayed.
[root@localhost ~]# lspci
...
00:08.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:09.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
00:0a.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)



3. Unplug one PCI device pci_0002_01_00_1
unplugPF.xml:
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
     <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
  </source>
</hostdev>

# virsh detach-device dzhengvm2 unplugPF.xml
Device detached successfully

4. Check guest state
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 21    d1                             paused

5. Check qemu log

2015-09-06T08:19:44.390589Z qemu-kvm: vfio_err_notifier_handler(0002:01:00.1) Unrecoverable error detected.  Please collect any data possible and then kill the guest
2015-09-06T08:19:44.413427Z qemu-kvm: vfio_err_notifier_handler(0002:01:00.0) Unrecoverable error detected.  Please collect any data possible and then kill the guest


This result is different from the original one in the description of this bug, and also blocks me to reproduce the problems for this bug.


Should I file a new bug for this issue?

Comment 9 Karen Noel 2015-09-09 12:02:37 UTC
(In reply to Dan Zheng from comment #8)
> Currently test result is that the guest will get paused after unplugging a
> PCI device and  unrecoverable error is detected.
> 
> Packages: 
>   kernel-3.10.0-313.el7.ppc64le
>   qemu-kvm-rhev-2.3.0-22.el7.ppc64le
>   libvirt-daemon-1.2.17-7.el7.ppc64le
> Guest: kernel-3.10.0-313.el7.ppc64le
> 
> 
> Steps:
> 0. Prepare the environment on host
> #modprobe vfio
> #modprobe vfio_spapr_eeh
> #modprobe vfio_iommu_spapr_tce
> #modprobe vfio_pci
> 
> 1. Start a guest with 3 PCI devices passthrough to the guest.
> 
> # virsh start dzhengvm2
> Domain dzhengvm2 started
> 
> # virsh dumpxml dzhengvm2|grep hostdev -A5
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/>
>       </source>
>       <alias name='hostdev0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x08'
> function='0x0'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/>
>       </source>
>       <alias name='hostdev1'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x09'
> function='0x0'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
>       </source>
>       <alias name='hostdev2'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x0a'
> function='0x0'/>
>     </hostdev>
> 
> 
> 2. Check within the guest, and three PCI devices are displayed.
> [root@localhost ~]# lspci
> ...
> 00:08.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev
> 10)
> 00:09.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev
> 10)
> 00:0a.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev
> 10)
> 
> 
> 
> 3. Unplug one PCI device pci_0002_01_00_1
> unplugPF.xml:
> <hostdev mode='subsystem' type='pci' managed='yes'>
>   <source>
>      <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
>   </source>
> </hostdev>
> 
> # virsh detach-device dzhengvm2 unplugPF.xml
> Device detached successfully
> 
> 4. Check guest state
> # virsh list --all
>  Id    Name                           State
> ----------------------------------------------------
>  21    d1                             paused
> 
> 5. Check qemu log
> 
> 2015-09-06T08:19:44.390589Z qemu-kvm:
> vfio_err_notifier_handler(0002:01:00.1) Unrecoverable error detected. 
> Please collect any data possible and then kill the guest
> 2015-09-06T08:19:44.413427Z qemu-kvm:
> vfio_err_notifier_handler(0002:01:00.0) Unrecoverable error detected. 
> Please collect any data possible and then kill the guest
> 
> 
> This result is different from the original one in the description of this
> bug, and also blocks me to reproduce the problems for this bug.
> 
> 
> Should I file a new bug for this issue?

Yes, please file a new BZ. This BZ is for the XML update issue.

Comment 10 Laine Stump 2015-09-09 15:08:07 UTC
Note that libvirt will not remove a device from the XML status until the guest has notified libvirt (via a DEVICE_DELETED event from qemu), so if something "goes wrong" with the device detach, it is expected that the device will still be listed in the status. This is done on purpose because if the guest doesn't properly support hot unplug, the device_del will never be 100% complete, and it won't be safe/useful to try to re-plug the host device to another guest, or to try and plug another host device into the same slot in the guest.

There could be several reasons for an incomplete device deletion:

1) guest OS isn't running, so it can't process the device_del, so qemu is never notified.

2) guest OS doesn't support (or has buggy support for) hot unplug, so qemu is never notified.

3) qemu binary doesn't support (or has buggy support for) hot unplug, causing guest to encounter an error during processing of the hot unplug request (so qemu is never notified)

4) guest OS finished responding to unplug request, and notifies qemu, but the qemu binary doesn't support (or has buggy support for) processing of the notification from the guest and generating the DEVICE_DELETED event.

(NB: libvirt's virDomainDetachDeviceFlags() API will wait for 5 seconds for the DEVICE_DELETED to show up (if it's supported by qemu), but if there is a timeout it will still report success (but of course not remove the device from the status XML) - it can't report failure because it's entirely possible that the guest is just delaying, and the operation could still complete. If the operation completes asynchronously, the device will be removed from the status at that time.)

5) libvirt is receiving the DEVICE_DELETED event, but isn't processing it properly.

To see if libvirt is receiving the DEVICE_DELETED command you can run this command as root (from the libvirt source main directory) - it will show you all exchanges between libvirt and all qemu processes:

   stap examples/systemtap/qemu-monitor.stp 

If DEVICE_DELETED never shows up at libvirt, then you will need to search further down the stack for the source of the problem.

(Again, note that DEVICE_DELETED is asynchronous, so virsh may reported that the device is detached even though the detach hasn't yet completed.)

Comment 11 David Gibson 2015-09-09 22:24:25 UTC
At least some of the problems here should go away with the fix for bug 1259556.  That's currently in POST state, but there is a brewed test qemu you could try.

Comment 12 David Gibson 2015-09-09 22:30:23 UTC
To expand on that.  The original test scenario has AFAICT emulated and VFIO devices mixed on the same guest PCI host bridge.  Before the fix for bug 1259556, that's not a valid setup in qemu.

Unfortunately, that bad scenariou won't be reported immediately by a qemu error on startup, or even (depending on the exact devices) an error during guest boot.  Instead you'll just get failures when the VFIO devices attempt to DMA - that's probably what caused the "Unrecoverable error" messages.

This is the stuff I discussed with some of you in Seattle - correctly setting up vfio is rather more complicated than on x86, because of the need for multiple virtual host bridges.  Fixing bug 1259556, removing the limitations in qemu, is probably going to be much simpler than implementing the necessary workaround logic in libvirt (and possible higher layers).

Comment 15 Dan Zheng 2015-10-10 06:04:03 UTC
Tested with below packages:
libvirt-1.2.17-13.el7.ppc64le
qemu-kvm-rhev-2.3.0-29.el7.ppc64le
kernel-3.10.0-320.el7.ppc64le
Guest kernel: kernel-3.10.0-313.el7.ppc64le

0. Detach pci_0003_09_00_0 from the host.
# virsh nodedev-dumpxml pci_0003_09_00_0
<device>
  <name>pci_0003_09_00_0</name>
  <path>/sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:09.0/0003:09:00.0</path>
  <parent>pci_0003_02_09_0</parent>
  <driver>
    <name>tg3</name>
  </driver>
  <capability type='pci'>
    <domain>3</domain>
    <bus>9</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x1657'>NetXtreme BCM5719 Gigabit Ethernet PCIe</product>
    <vendor id='0x14e4'>Broadcom Corporation</vendor>
    <iommuGroup number='1'>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='2.5' width='4'/>
      <link validity='sta' speed='2.5' width='4'/>
    </pci-express>
  </capability>
</device>
# virsh nodedev-detach pci_0003_09_00_0
# virsh nodedev-reset pci_0003_09_00_0
1. Start the guest with 3 Host PCI devices. And the guest is running. 

     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
       </source>
     </hostdev>
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
       </source>
     </hostdev>
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
       </source>
     </hostdev>
2. Check the PCI devices are displayed in the guest and Yes.
3. Detach a PCI device from the guest.
unplug.xml:
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
       </source>
     </hostdev>
# virsh detach-device virt-tests-vm1 unplug.xml
...
Then the host crashes. Below is the console log. It is related to the kernel bug 1256718.

[137316.987994] vfio-pci 0003:09:00.1: No device request channel registered, blocked until released by user
[137316.988156] tg3 0003:09:00.1: enabling device (0400 -> 0402)
[137317.206901] tg3 0003:09:00.1: Using 32-bit DMA via iommu
[137317.207018] Unable to handle kernel paging request for data at address 0x00000000
[137317.207085] Faulting instruction address: 0xc0000000004bb51c
[137317.207143] Oops: Kernel access of bad area, sig: 11 [#1]
[137317.207188] SMP NR_CPUS=2048 NUMA PowerNV
[137317.207245] Modules linked in: vfio_pci vfio_iommu_spapr_tce vfio_spapr_eeh vfio xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vhost_net vhost macvtap macvlan tun loop bridge stp llc kvm_hv kvm ses enclosure sg rtc_opal shpchp powernv_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom crc_t10dif crct10dif_generic crct10dif_common ipr tg3 libata ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
[137317.208041] CPU: 0 PID: 92901 Comm: kworker/0:2 Not tainted 3.10.0-322.el7.ppc64le #1
[137317.208112] Workqueue: events work_for_cpu_fn
[137317.208168] task: c000000fe191e910 ti: c000000fde5d0000 task.ti: c000000fde5d0000
[137317.208235] NIP: c0000000004bb51c LR: c0000000004b7040 CTR: 0000000000000000
[137317.208302] REGS: c000000fde5d34e0 TRAP: 0300   Not tainted  (3.10.0-322.el7.ppc64le)
[137317.208368] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 84482828  XER: 00000000
[137317.208535] CFAR: c000000000009368 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 0 
GPR00: c0000000004dca74 c000000fde5d3760 c0000000011230b0 0000000000000000 
GPR04: ffffffffffffffff 0000000000000000 0000000000000000 0000000000000000 
GPR08: ffffffffffffffc0 ffffffffffffffff 0000000000000000 0000000000000000 
GPR12: 0000000000002200 c000000007b20000 c000000000110538 0000000000000000 
GPR16: 0000000020000000 0000000000000000 0000000000000000 0000000000000001 
GPR20: 0000000000000000 0000000000000010 00000000ffffffff 0000000000000000 
GPR24: 00000000000fffff 0000000000000010 fffffffffffffff0 000000000000000f 
GPR28: 0000000000000000 ffffffffffffffff ffffffffffffffff c000000fe571cc80 
[137317.209422] NIP [c0000000004bb51c] find_next_zero_bit+0x2c/0x110
[137317.209479] LR [c0000000004b7040] bitmap_find_next_zero_area+0x70/0x100
[137317.209534] Call Trace:
[137317.209559] [c000000fde5d3760] [c000000fde5d3800] 0xc000000fde5d3800 (unreliable)
[137317.209638] [c000000fde5d37c0] [c0000000004dca74] iommu_area_alloc+0x154/0x1a0
[137317.209718] [c000000fde5d3830] [c0000000000452dc] iommu_range_alloc+0x18c/0x410
[137317.209796] [c000000fde5d38f0] [c00000000004739c] iommu_alloc_coherent+0x13c/0x280
[137317.209874] [c000000fde5d3990] [c000000000044f5c] dma_iommu_alloc_coherent+0x3c/0x60
[137317.209954] [c000000fde5d39b0] [d000000019a32b1c] tg3_test_dma+0x8c/0xc00 [tg3]
[137317.210033] [c000000fde5d3ab0] [d000000019a41b7c] tg3_init_one.part.90+0x90c/0x1870 [tg3]
[137317.210171] [c000000fde5d3b90] [c00000000050a834] local_pci_probe+0x64/0x130
[137317.210306] [c000000fde5d3c20] [c0000000000fef40] work_for_cpu_fn+0x30/0x50
[137317.210440] [c000000fde5d3c50] [c000000000103cec] process_one_work+0x1dc/0x680
[137317.210595] [c000000fde5d3cf0] [c00000000010457c] worker_thread+0x3ec/0x500
[137317.210728] [c000000fde5d3d80] [c00000000011061c] kthread+0xec/0x100
[137317.210862] [c000000fde5d3e30] [c00000000000a474] ret_from_kernel_thread+0x5c/0x68
[137317.211019] Instruction dump:
[137317.211107] 60420000 7fa52040 409c00dc 78a606a1 78a7d182 78e71f24 78a50664 7d433a14 
[137317.211328] 7d252050 40820080 79280665 41820038 <e90a0000> 394a0008 2fa8ffff 419e0018 
[137317.211554] ---[ end trace 0dde6fc0e0713c7b ]---
[137317.212856] 
[137317.212904] Sending IPI to other CPUs
[137317.213977] IPI complete
[    0.000000] OPAL V3 detected !
[    0.000000] Using PowerNV machine description
[    0.000000] Page sizes from device-tree:
[    0.000000] base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
[    0.000000] base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7

Comment 16 David Gibson 2015-10-11 23:44:40 UTC
The host crash doesn't look related to bug 1256718 - that bug causes guest crashes, not host crashes and the symptoms don't look similar either.

The crash is in the iommu code, which suggests that the VFIO code hasn't properly cleaned up the iommu infrastructure state, leading to a crash after the device is back in the guest.

Can you please file a new bug for the host crash, and get a vmcore if you're able to.

Comment 17 Dan Zheng 2015-10-12 03:03:50 UTC
File a bug 1270636 to track above issue.

Comment 18 Dan Zheng 2015-10-13 02:24:21 UTC
Test with packages below:
libvirt-1.2.17-13.el7.ppc64le
qemu-kvm-rhev-2.3.0-29.el7.ppc64le
kernel-3.10.0-322.el7.ppc64le
Guest kernel: kernel-3.10.0-322.el7.ppc64le


1.Detach a device pci_0003_09_00_0 from the host.
# virsh nodedev-dumpxml pci_0003_09_00_0
<device>
  <name>pci_0003_09_00_0</name>
  <path>/sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:09.0/0003:09:00.0</path>
  <parent>pci_0003_02_09_0</parent>
  <driver>
    <name>tg3</name>
  </driver>
  <capability type='pci'>
    <domain>3</domain>
    <bus>9</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x1657'>NetXtreme BCM5719 Gigabit Ethernet PCIe</product>
    <vendor id='0x14e4'>Broadcom Corporation</vendor>
    <iommuGroup number='1'>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
      <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='2.5' width='4'/>
      <link validity='sta' speed='2.5' width='4'/>
    </pci-express>
  </capability>
</device>
# virsh nodedev-detach pci_0003_09_00_0
...Successful.
# virsh nodedev-reset pci_0003_09_00_0
...Successful.

2. Start the guest with 3 Host PCI devices. And the guest is running. 

     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
       </source>
     </hostdev>
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
       </source>
     </hostdev>
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
       </source>
     </hostdev>
3. Check the PCI devices are displayed in the guest and Yes.
4. Detach/attach a PCI device from/to the guest.
unplug.xml:
     <hostdev mode='subsystem' type='pci' managed='yes'>
       <driver name='vfio'/>
       <source>
         <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
       </source>
     </hostdev>
# virsh detach-device virt-tests-vm1 unplug.xml ---> using pci_0003_09_00_1
Successful.
# virsh attach-device virt-tests-vm1 unplug.xml ---> using pci_0003_09_00_0
Successful.
5. Check dumpxml of the guest and it does get updated.
6. Check the lspci within the guest and it does get updated.
7. Repeat step 4 - 6 to use other pci devices in same iommu group, like pci_0003_09_00_3, pci_0003_09_00_2, and it works as expected, except the unexpected guest crashing and rebooting which is tracked by bug 1270636.

In addition, detach-disk, attach-disk a virtio disk for comment 5 are also tested and working as expected.

So I mark this as verified as the dumpxml can be updated in detaching/attaching scenarios.

Comment 20 errata-xmlrpc 2015-11-19 06:48:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html


Note You need to log in before you can comment on or make changes to this bug.