Red Hat Bugzilla – Bug 1141399
Device 'vfio-pci' could not be initialized when passing through Intel 82599
Last modified: 2015-08-02 19:33:41 EDT
"PCI passthrough is not working as expected, because of kernel bug available. When assigning a PCI's device to a VM with KVM and vfio-pci, some PCI devices will cause the following error message: "Device 'vfio-pci' could not be initialized". So openstack will always fail to boot virual machine with PCI passthrough. There is reference bug available in bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1113399 " +++ This bug was initially created as a clone of Bug #1113399 +++ When assigning a PCIe device to a VM with KVM and vfio-pci, some PCIe devices will cause the following error message: Error starting domain: internal error: early end of file from monitor: possible problem: 2014-06-26T05:58:47.482875Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: vfio: error, group 13 is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver. 2014-06-26T05:58:47.483075Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: vfio: failed to get group 13 2014-06-26T05:58:47.483102Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: Device initialization failed. 2014-06-26T05:58:47.483128Z qemu-system-x86_64: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.0,addr=0x8: Device 'vfio-pci' could not be initialized As you can see below, other PCIe devices are present in the same iommu_group. I don't want to pass through these other devices into the VM, but since they are in the same iommu_group, they are interfering with the operation of vfio-pci. $ ls /sys/kernel/iommu_groups/13/devices/ 0000:00:15.0 0000:00:15.2 0000:00:15.3 0000:06:00.0 0000:07:00.0 0000:08:00.0 It seems that the following patch is required to fix this: https://lkml.org/lkml/2013/5/30/513 The patch allows the user to set a kernel argument that assumes that PCIe devices can be isolated from each other in their own iommu group. Without this patch, vfio-pci is essentially broken for certain PCIe devices which do not correctly utilize ACS functionality. Is it possible to have this patch added to the Fedora kernel? --- Additional comment from Josh Boyer on 2014-06-26 08:35:04 EDT --- (In reply to oakwhiz from comment #0) > Is it possible to have this patch added to the Fedora kernel? At the moment, no. Mostly because it's still in the middle of being discussed and isn't in the upstream tree yet. We'll keep an eye on it and see if it's applicable for backport once it hits the mainline kernel tree. --- Additional comment from Alex Williamson on 2014-06-26 10:12:11 EDT --- The patch is not expected to be accepted upstream but downstreams can obviously choose to carry it. The argument upstream is that issues that arise from a user overriding known, hardware advertised device isolation can be subtle and incredibly difficult to debug. The path forward to allowing configurations that are currently prevented is to work with the hardware vendors to determine whether devices are isolated and encourage future products to support PCI ACS so that the hardware advertises this isolation automatically. --- Additional comment from on 2014-06-28 12:08:10 EDT --- ACS override patch for 3.14.8-200.fc20.x86_64 This patch seems like it works on the latest Fedora kernel. I think.
The ACS override patch should *not* be backported to RHEL. It's for good reason that it has not been accepted upstream and if Fedora has taken it, it's a mistake. Without ACS, we must assume that devices are able to do non-IOMMU translated peer-to-peer which creates an unsupportable environment. A DMA from an assigned device meant for guest memory may instead be considered a peer-to-peer transaction and redirected to another device. The correct answer is to work with the hardware vendors to confirm device isolation and add quirks to the kernel to expose that through IOMMU groups and thus through VFIO. If we cannot get confirmation of that isolation from the vendor then we absolutely should not be attempting to support a kernel where the user has been given the privilege to override it.
Please file feature requests for the specific hardware components which are preventing devices from being isolated and we can attempt to work with the hardware vendors to determine whether sufficient isolation is present.
00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07) Subsystem: Hewlett-Packard Company Device 18a8 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 0 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <16us ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?> Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [280 v1] Vendor Specific Information: ID=0004 Rev=2 Len=018 <?> 00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 I/O behind bridge: 00006000-00006fff Memory behind bridge: f7c00000-f7ffffff Prefetchable memory behind bridge: 00000000f6800000-00000000f6bfffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Subsystem: Hewlett-Packard Company Device 18a8 Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit- Address: 00000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s unlimited, L1 <16us ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd+ LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [110 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [148 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSVi
The specific device they are trying to pass through is an Intel 82599.
Please attach the full lspci -vvv as well as the output of `find /sys/kernel/iommu_groups`, the snippet in comment 4 doesn't tell us anything.
I'm confused by the before vs after in the lspci and group output. Before and after what? I see 82599s here: 04:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable- Count=64 Masked- Capabilities: [a0] Express (v2) Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100 v1] Advanced Error Reporting Capabilities: [140 v1] Device Serial Number 38-ea-a7-ff-ff-32-be-f0 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) 04:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable- Count=64 Masked- Capabilities: [a0] Express (v2) Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100 v1] Advanced Error Reporting Capabilities: [140 v1] Device Serial Number 38-ea-a7-ff-ff-32-be-f0 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) These are behind the following root port: 00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 Capabilities: [40] Subsystem: Hewlett-Packard Company Device 18a8 Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit- Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 Capabilities: [e0] Power Management version 3 Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [110 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- Capabilities: [148 v1] Advanced Error Reporting Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [250 v1] #19 Capabilities: [280 v1] Vendor Specific Information: ID=0004 Rev=2 Len=018 <?> As shown, the root port does support ACS and isolation is enabled at the root port. The Multifunction 82599ES does not support ACS, resulting in the functions being grouped together. This is reflected in the IOMMU groups: /sys/kernel/iommu_groups/21/devices/0000:04:00.0 /sys/kernel/iommu_groups/21/devices/0000:04:00.1 This is all working as expected and is consistent with what we've heard from Intel that individual PF assignment is not supported on devices supporting SR-IOV. The VFs produces from these functions should each still be in separate IOMMU groups. What's the actual request here, to separate the two 82599ES functions into separate groups? Why?
The Intel X540-AT2 10G controller does support ACS and is configured into separate IOMMU groups.
(In reply to Alex Williamson from comment #10) > The Intel X540-AT2 10G controller does support ACS and is configured into > separate IOMMU groups. Interesting, do you happen to have any insight into whether the Intel X520 adapters support this or is the above the only one? This query comes from another independently raised customer case. Do we have a specific contact at Intel on the hardware side to work through these with? I have up until now been working with their software teams working on OpenStack, Libvirt, qemu and DPDK.
Re-opening, Intel is confirming multiple 82599 devices and X520 devices do have isolation between functions. Quirks will be required to incorporate this into IOMMU grouping.
Patch(es) available on kernel-3.10.0-193.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0290.html