Bug 1272300

Summary:	libvirt management of VFIO devices can lead to host crashes
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	David Gibson <dgibson>
Component:	libvirt	Assignee:	Andrea Bolognani <abologna>
Status:	CLOSED WONTFIX	QA Contact:	yalzhang <yalzhang>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	8.0	CC:	abologna, alex.williamson, dyuan, dzheng, gsun, jsuchane, kchamart, knoel, laine, michal.skrivanek, mzhan, xuzhang, yafu, yalzhang
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.1	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-15 07:37:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1288337, 1313485

Description David Gibson 2015-10-16 02:51:34 UTC

Description of problem:

libvirt's management of host PCI devices which can be assigned to guests using VFIO can lead to host kernel drivers and the vfio-pci driver simultaneously binding to devices in the same IOMMU group.

That's an invalid state, which can lead to a host crash (see bug 1270636). IMO, that's a host kernel bug, but apparently it's not one that will be fixed soon.

Version-Release number of selected component (if applicable):

libvirt-daemon-1.2.17-13.el7.ppc64le

How reproducible:

100%

Steps to Reproduce:
1. Locate a system which has at least two PCI devices in the same IOMMU group, call these devices DevA and DevB
[On a POWER8 machine, the different functions of a Broadcom tg3 multi-port NIC are suitable. On x86, this may require a PCI-E to PCI bridge]
2. Ensure that there are host kernel drivers which bind to both DevA and DevB
3. In libvirt configuration , but both DevA and DevB in "mananged" mode
4. Assign DevA and DevB to the same guest
5. Start the guest
5. Hot unplug just DevB from the guest.

Actual results:

DevB is rebound to the host kernel driver, while DevA is still bound to vfio-pci. This leads to a host kernel crash. Depending on the details of hardware and driver, this could be a crash during DMA initialization in the host DevB driver, or a BUG_ON() from the vfio code when it detects a different host driver bound to the iommu group.

Expected results:

DevB is removed from the guest, but is not rebound to a host kernel driver until all devices from the same IOMMU group are returned to the host. No crash.

Additional info:

Comment 1 Laine Stump 2015-10-28 16:25:02 UTC

I was just pointed at this BZ by Alex after getting a query about the same behavior on public IRC (#virt) from Shivaprasad at IBM, and I looked into the code a bit.

I think the proper way to handle this is to check for other devices in the same iommu group that are still in use during virPCIDeviceUnbindFromStub(), and if there aren't any still in use, then go through the entire inactive list doing a reprobe for every inactive device that still has dev->reprobe set to true. This effectively delays the re-bind to host driver of all devices until none of them are in use (and then rebinds only those that were originally intended to be re-bound)

This of course means that the activeDevs and inactiveDevs lists will need to be passed into virPCIDeviceUnbindFromStub() and virPCIDeviceBindToStub(). And we will also need to start storing the iommu group of each device in the _virPCIDevice struct (either that or use virPCIDeviceAddressIOMMUGroupIterate(), which is much less efficient, since it has to do a bunch of reading/parsing sysfs entries - since iommu group of a device never changes I think it's better to store it internally then just look at our own list).

Comment 2 Andrea Bolognani 2015-10-29 12:19:07 UTC

(In reply to Laine Stump from comment #1)
> I think the proper way to handle this is to check for other devices in the
> same iommu group that are still in use during virPCIDeviceUnbindFromStub(),
> and if there aren't any still in use, then go through the entire inactive
> list doing a reprobe for every inactive device that still has dev->reprobe
> set to true. This effectively delays the re-bind to host driver of all
> devices until none of them are in use (and then rebinds only those that were
> originally intended to be re-bound)

I was looking into doing pretty much what you suggest in
virHostdevReAttachPCIDevices(), but doing it even further down the stack
could indeed be a better idea to make sure we never run into the issue.

> This of course means that the activeDevs and inactiveDevs lists will need to
> be passed into virPCIDeviceUnbindFromStub() and virPCIDeviceBindToStub().
> And we will also need to start storing the iommu group of each device in the
> _virPCIDevice struct (either that or use
> virPCIDeviceAddressIOMMUGroupIterate(), which is much less efficient, since
> it has to do a bunch of reading/parsing sysfs entries - since iommu group of
> a device never changes I think it's better to store it internally then just
> look at our own list).

We could store the IOMMU group internally, but we'd have to add one
more list to virHostdevManager as the existing lists (activePCIHostdevs
and inactivePCIHostdevs) don't store information about devices that are
not listed in a guest definition, and we need to be mindful of those as
well when reattaching a device to the guest...

Comment 15 RHEL Program Management 2020-12-15 07:37:53 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.