Bug 1141399
| Summary: | Device 'vfio-pci' could not be initialized when passing through Intel 82599 | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Stephen Gordon <sgordon> | |
| Component: | kernel | Assignee: | Alex Williamson <alex.williamson> | |
| kernel sub component: | Other | QA Contact: | Yulong Pei <ypei> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | alex.williamson, dhoward, extras-qa, gansalmon, itamar, jfeeney, jonathan, kboumedh, kdube, kernel-maint, lilu, madhu.chinakonda, mchehab, network-qe, oakwhiz, sgordon, snagar, tvvcox, wdaniel, zshi | |
| Version: | 7.1 | Keywords: | Reopened, ZStream | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | kernel-3.10.0-193.el7 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1113399 | |||
| : | 1156447 (view as bug list) | Environment: | ||
| Last Closed: | 2015-03-05 12:43:37 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1131552 | |||
| Bug Blocks: | 743661, 1038706, 1156447 | |||
|
Description
Stephen Gordon
2014-09-13 00:56:27 UTC
The ACS override patch should *not* be backported to RHEL. It's for good reason that it has not been accepted upstream and if Fedora has taken it, it's a mistake. Without ACS, we must assume that devices are able to do non-IOMMU translated peer-to-peer which creates an unsupportable environment. A DMA from an assigned device meant for guest memory may instead be considered a peer-to-peer transaction and redirected to another device. The correct answer is to work with the hardware vendors to confirm device isolation and add quirks to the kernel to expose that through IOMMU groups and thus through VFIO. If we cannot get confirmation of that isolation from the vendor then we absolutely should not be attempting to support a kernel where the user has been given the privilege to override it. Please file feature requests for the specific hardware components which are preventing devices from being isolated and we can attempt to work with the hardware vendors to determine whether sufficient isolation is present. 00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07) Subsystem: Hewlett-Packard Company Device 18a8 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 0 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <16us ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?> Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [280 v1] Vendor Specific Information: ID=0004 Rev=2 Len=018 <?> 00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 I/O behind bridge: 00006000-00006fff Memory behind bridge: f7c00000-f7ffffff Prefetchable memory behind bridge: 00000000f6800000-00000000f6bfffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Subsystem: Hewlett-Packard Company Device 18a8 Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit- Address: 00000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s unlimited, L1 <16us ClockPM- Surprise+ LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd+ LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [110 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [148 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSVi The specific device they are trying to pass through is an Intel 82599. Please attach the full lspci -vvv as well as the output of `find /sys/kernel/iommu_groups`, the snippet in comment 4 doesn't tell us anything. I'm confused by the before vs after in the lspci and group output. Before and after what? I see 82599s here: 04:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable- Count=64 Masked- Capabilities: [a0] Express (v2) Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100 v1] Advanced Error Reporting Capabilities: [140 v1] Device Serial Number 38-ea-a7-ff-ff-32-be-f0 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) 04:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable- Count=64 Masked- Capabilities: [a0] Express (v2) Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100 v1] Advanced Error Reporting Capabilities: [140 v1] Device Serial Number 38-ea-a7-ff-ff-32-be-f0 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) These are behind the following root port: 00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 Capabilities: [40] Subsystem: Hewlett-Packard Company Device 18a8 Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit- Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 Capabilities: [e0] Power Management version 3 Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [110 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- Capabilities: [148 v1] Advanced Error Reporting Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [250 v1] #19 Capabilities: [280 v1] Vendor Specific Information: ID=0004 Rev=2 Len=018 <?> As shown, the root port does support ACS and isolation is enabled at the root port. The Multifunction 82599ES does not support ACS, resulting in the functions being grouped together. This is reflected in the IOMMU groups: /sys/kernel/iommu_groups/21/devices/0000:04:00.0 /sys/kernel/iommu_groups/21/devices/0000:04:00.1 This is all working as expected and is consistent with what we've heard from Intel that individual PF assignment is not supported on devices supporting SR-IOV. The VFs produces from these functions should each still be in separate IOMMU groups. What's the actual request here, to separate the two 82599ES functions into separate groups? Why? The Intel X540-AT2 10G controller does support ACS and is configured into separate IOMMU groups. (In reply to Alex Williamson from comment #10) > The Intel X540-AT2 10G controller does support ACS and is configured into > separate IOMMU groups. Interesting, do you happen to have any insight into whether the Intel X520 adapters support this or is the above the only one? This query comes from another independently raised customer case. Do we have a specific contact at Intel on the hardware side to work through these with? I have up until now been working with their software teams working on OpenStack, Libvirt, qemu and DPDK. Re-opening, Intel is confirming multiple 82599 devices and X520 devices do have isolation between functions. Quirks will be required to incorporate this into IOMMU grouping. Patch(es) available on kernel-3.10.0-193.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0290.html |