Description of problem: libvirt's management of host PCI devices which can be assigned to guests using VFIO can lead to host kernel drivers and the vfio-pci driver simultaneously binding to devices in the same IOMMU group. That's an invalid state, which can lead to a host crash (see bug 1270636). IMO, that's a host kernel bug, but apparently it's not one that will be fixed soon. Version-Release number of selected component (if applicable): libvirt-daemon-1.2.17-13.el7.ppc64le How reproducible: 100% Steps to Reproduce: 1. Locate a system which has at least two PCI devices in the same IOMMU group, call these devices DevA and DevB [On a POWER8 machine, the different functions of a Broadcom tg3 multi-port NIC are suitable. On x86, this may require a PCI-E to PCI bridge] 2. Ensure that there are host kernel drivers which bind to both DevA and DevB 3. In libvirt configuration , but both DevA and DevB in "mananged" mode 4. Assign DevA and DevB to the same guest 5. Start the guest 5. Hot unplug just DevB from the guest. Actual results: DevB is rebound to the host kernel driver, while DevA is still bound to vfio-pci. This leads to a host kernel crash. Depending on the details of hardware and driver, this could be a crash during DMA initialization in the host DevB driver, or a BUG_ON() from the vfio code when it detects a different host driver bound to the iommu group. Expected results: DevB is removed from the guest, but is not rebound to a host kernel driver until all devices from the same IOMMU group are returned to the host. No crash. Additional info:
I was just pointed at this BZ by Alex after getting a query about the same behavior on public IRC (#virt) from Shivaprasad at IBM, and I looked into the code a bit. I think the proper way to handle this is to check for other devices in the same iommu group that are still in use during virPCIDeviceUnbindFromStub(), and if there aren't any still in use, then go through the entire inactive list doing a reprobe for every inactive device that still has dev->reprobe set to true. This effectively delays the re-bind to host driver of all devices until none of them are in use (and then rebinds only those that were originally intended to be re-bound) This of course means that the activeDevs and inactiveDevs lists will need to be passed into virPCIDeviceUnbindFromStub() and virPCIDeviceBindToStub(). And we will also need to start storing the iommu group of each device in the _virPCIDevice struct (either that or use virPCIDeviceAddressIOMMUGroupIterate(), which is much less efficient, since it has to do a bunch of reading/parsing sysfs entries - since iommu group of a device never changes I think it's better to store it internally then just look at our own list).
(In reply to Laine Stump from comment #1) > I think the proper way to handle this is to check for other devices in the > same iommu group that are still in use during virPCIDeviceUnbindFromStub(), > and if there aren't any still in use, then go through the entire inactive > list doing a reprobe for every inactive device that still has dev->reprobe > set to true. This effectively delays the re-bind to host driver of all > devices until none of them are in use (and then rebinds only those that were > originally intended to be re-bound) I was looking into doing pretty much what you suggest in virHostdevReAttachPCIDevices(), but doing it even further down the stack could indeed be a better idea to make sure we never run into the issue. > This of course means that the activeDevs and inactiveDevs lists will need to > be passed into virPCIDeviceUnbindFromStub() and virPCIDeviceBindToStub(). > And we will also need to start storing the iommu group of each device in the > _virPCIDevice struct (either that or use > virPCIDeviceAddressIOMMUGroupIterate(), which is much less efficient, since > it has to do a bunch of reading/parsing sysfs entries - since iommu group of > a device never changes I think it's better to store it internally then just > look at our own list). We could store the IOMMU group internally, but we'd have to add one more list to virHostdevManager as the existing lists (activePCIHostdevs and inactivePCIHostdevs) don't store information about devices that are not listed in a guest definition, and we need to be mindful of those as well when reattaching a device to the guest...
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.