RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1141399 - Device 'vfio-pci' could not be initialized when passing through Intel 82599
Summary: Device 'vfio-pci' could not be initialized when passing through Intel 82599
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Alex Williamson
QA Contact: Yulong Pei
URL:
Whiteboard:
Depends On: 1131552
Blocks: 743661 1038706 1156447
TreeView+ depends on / blocked
 
Reported: 2014-09-13 00:56 UTC by Stephen Gordon
Modified: 2019-02-15 13:45 UTC (History)
20 users (show)

Fixed In Version: kernel-3.10.0-193.el7
Doc Type: Bug Fix
Doc Text:
Clone Of: 1113399
: 1156447 (view as bug list)
Environment:
Last Closed: 2015-03-05 12:43:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0290 0 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2015-03-05 16:13:58 UTC

Description Stephen Gordon 2014-09-13 00:56:27 UTC
"PCI passthrough is not working as expected, because of kernel bug available. When  assigning a PCI's device to a VM with KVM
and vfio-pci, some PCI devices will cause the following error message:  "Device 'vfio-pci' could not be initialized". So openstack will
always fail to boot virual machine with PCI passthrough. There is reference bug available in bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1113399 "

+++ This bug was initially created as a clone of Bug #1113399 +++

When assigning a PCIe device to a VM with KVM and vfio-pci, some PCIe devices will cause the following error message:

Error starting domain: internal error: early end of file from monitor: possible problem:
2014-06-26T05:58:47.482875Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: vfio: error, group 13 is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver.
2014-06-26T05:58:47.483075Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: vfio: failed to get group 13
2014-06-26T05:58:47.483102Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: Device initialization failed.
2014-06-26T05:58:47.483128Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: Device 'vfio-pci' could not be initialized


As you can see below, other PCIe devices are present in the same iommu_group. I don't want to pass through these other devices into the VM, but since they are in the same iommu_group, they are interfering with the operation of vfio-pci.

$ ls /sys/kernel/iommu_groups/13/devices/
0000:00:15.0  0000:00:15.2  0000:00:15.3  0000:06:00.0  0000:07:00.0  0000:08:00.0


It seems that the following patch is required to fix this: https://lkml.org/lkml/2013/5/30/513
The patch allows the user to set a kernel argument that assumes that PCIe devices can be isolated from each other in their own iommu group. Without this patch, vfio-pci is essentially broken for certain PCIe devices which do not correctly utilize ACS functionality.

Is it possible to have this patch added to the Fedora kernel?

--- Additional comment from Josh Boyer on 2014-06-26 08:35:04 EDT ---

(In reply to oakwhiz from comment #0)
> Is it possible to have this patch added to the Fedora kernel?

At the moment, no.  Mostly because it's still in the middle of being discussed and isn't in the upstream tree yet.  We'll keep an eye on it and see if it's applicable for backport once it hits the mainline kernel tree.

--- Additional comment from Alex Williamson on 2014-06-26 10:12:11 EDT ---

The patch is not expected to be accepted upstream but downstreams can obviously choose to carry it.  The argument upstream is that issues that arise from a user overriding known, hardware advertised device isolation can be subtle and incredibly difficult to debug.  The path forward to allowing configurations that are currently prevented is to work with the hardware vendors to determine whether devices are isolated and encourage future products to support PCI ACS so that the hardware advertises this isolation automatically.

--- Additional comment from  on 2014-06-28 12:08:10 EDT ---

ACS override patch for 3.14.8-200.fc20.x86_64
This patch seems like it works on the latest Fedora kernel. I think.

Comment 2 Alex Williamson 2014-09-13 02:24:42 UTC
The ACS override patch should *not* be backported to RHEL.  It's for good reason that it has not been accepted upstream and if Fedora has taken it, it's a mistake.  Without ACS, we must assume that devices are able to do non-IOMMU translated peer-to-peer which creates an unsupportable environment.  A DMA from an assigned device meant for guest memory may instead be considered a peer-to-peer transaction and redirected to another device.  The correct answer is to work with the hardware vendors to confirm device isolation and add quirks to the kernel to expose that through IOMMU groups and thus through VFIO.  If we cannot get confirmation of that isolation from the vendor then we absolutely should not be attempting to support a kernel where the user has been given the privilege to override it.

Comment 3 Alex Williamson 2014-09-13 02:29:23 UTC
Please file feature requests for the specific hardware components which are preventing devices from being isolated and we can attempt to work with the hardware vendors to determine whether sufficient isolation is present.

Comment 4 Stephen Gordon 2014-09-19 13:17:12 UTC
00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07)
	Subsystem: Hewlett-Packard Company Device 18a8
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 0
	Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <16us
			ClockPM- Surprise+ LLActRep+ BwNot+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [e0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
	Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
	Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
	Capabilities: [280 v1] Vendor Specific Information: ID=0004 Rev=2 Len=018 <?>

00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
	I/O behind bridge: 00006000-00006fff
	Memory behind bridge: f7c00000-f7ffffff
	Prefetchable memory behind bridge: 00000000f6800000-00000000f6bfffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Subsystem: Hewlett-Packard Company Device 18a8
	Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit-
		Address: 00000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s unlimited, L1 <16us
			ClockPM- Surprise+ LLActRep+ BwNot+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd+
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [e0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
	Capabilities: [110 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [148 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSVi

Comment 5 Stephen Gordon 2014-09-19 13:20:05 UTC
The specific device they are trying to pass through is an Intel 82599.

Comment 6 Alex Williamson 2014-09-19 13:38:19 UTC
Please attach the full lspci -vvv as well as the output of `find /sys/kernel/iommu_groups`, the snippet in comment 4 doesn't tell us anything.

Comment 8 Alex Williamson 2014-09-23 13:03:49 UTC
I'm confused by the before vs after in the lspci and group output.  Before and after what?

I see 82599s here:

04:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable- Count=64 Masked-
	Capabilities: [a0] Express (v2) Endpoint, MSI 00
	Capabilities: [e0] Vital Product Data
	Capabilities: [100 v1] Advanced Error Reporting
	Capabilities: [140 v1] Device Serial Number 38-ea-a7-ff-ff-32-be-f0
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)

04:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable- Count=64 Masked-
	Capabilities: [a0] Express (v2) Endpoint, MSI 00
	Capabilities: [e0] Vital Product Data
	Capabilities: [100 v1] Advanced Error Reporting
	Capabilities: [140 v1] Device Serial Number 38-ea-a7-ff-ff-32-be-f0
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)

These are behind the following root port:

00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) (prog-if 00 [Normal decode])
	Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
	Capabilities: [40] Subsystem: Hewlett-Packard Company Device 18a8
	Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit-
	Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
	Capabilities: [e0] Power Management version 3
	Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
	Capabilities: [110 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
	Capabilities: [148 v1] Advanced Error Reporting
	Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
	Capabilities: [250 v1] #19
	Capabilities: [280 v1] Vendor Specific Information: ID=0004 Rev=2 Len=018 <?>

As shown, the root port does support ACS and isolation is enabled at the root port.  The Multifunction 82599ES does not support ACS, resulting in the functions being grouped together.  This is reflected in the IOMMU groups:

/sys/kernel/iommu_groups/21/devices/0000:04:00.0
/sys/kernel/iommu_groups/21/devices/0000:04:00.1

This is all working as expected and is consistent with what we've heard from Intel that individual PF assignment is not supported on devices supporting SR-IOV.  The VFs produces from these functions should each still be in separate IOMMU groups.

What's the actual request here, to separate the two 82599ES functions into separate groups?  Why?

Comment 10 Alex Williamson 2014-09-23 17:56:33 UTC
The Intel X540-AT2 10G controller does support ACS and is configured into separate IOMMU groups.

Comment 11 Stephen Gordon 2014-09-24 11:32:24 UTC
(In reply to Alex Williamson from comment #10)
> The Intel X540-AT2 10G controller does support ACS and is configured into
> separate IOMMU groups.

Interesting, do you happen to have any insight into whether the Intel X520 adapters support this or is the above the only one? This query comes from another independently raised customer case.

Do we have a specific contact at Intel on the hardware side to work through these with? I have up until now been working with their software teams working on OpenStack, Libvirt, qemu and DPDK.

Comment 13 Alex Williamson 2014-09-25 16:26:52 UTC
Re-opening, Intel is confirming multiple 82599 devices and X520 devices do have isolation between functions.  Quirks will be required to incorporate this into IOMMU grouping.

Comment 22 Jarod Wilson 2014-10-24 13:11:22 UTC
Patch(es) available on kernel-3.10.0-193.el7

Comment 29 errata-xmlrpc 2015-03-05 12:43:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0290.html


Note You need to log in before you can comment on or make changes to this bug.