1131552 – Solarflare devices do not provide PCIe ACS support, limiting device assignment use case due to IOMMU grouping

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1131552 - Solarflare devices do not provide PCIe ACS support, limiting device assignment use case due to IOMMU grouping

Summary: Solarflare devices do not provide PCIe ACS support, limiting device assignmen...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Alex Williamson
QA Contact:	Yulong Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1140344 1141399 1158316
TreeView+	depends on / blocked

Reported:	2014-08-19 14:40 UTC by Edward Cree
Modified:	2015-08-02 23:33 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-3.10.0-193.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1158316 (view as bug list)
Environment:
Last Closed:	2015-03-05 12:38:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Domain XML from step 2 (2.34 KB, text/plain) 2014-08-19 14:40 UTC, Edward Cree	no flags	Details
lspci output and /proc/cpuinfo (23.48 KB, text/plain) 2014-08-20 11:01 UTC, Edward Cree	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0290	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2015-03-05 16:13:58 UTC

Description Edward Cree 2014-08-19 14:40:35 UTC

Created attachment 928392 [details]
Domain XML from step 2

Description of problem:
When trying to pass non-primary PFs of an sfn7x42q network card (which supports a multi-PF mode of operation) through to a VM, I get a VFIO error "group 1 is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver."  This may be because the primary PF is being terminated in the host, and not passed through to VMs.  The iommu_group contains all four PFs (the NIC has two ports and is configured with two PFs per port) as well as what appears to be the upstream PCIe port.
So I tried to use legacy KVM device assignment instead, but it appears this has been removed (CONFIG_KVM_DEVICE_ASSIGNMENT=n).
This is a regression from RHEL6.5 in which I am able to create this setup (using legacy KVM device assignment).

Version-Release number of selected component (if applicable):
$ virsh version
Compiled against library: libvirt 1.1.1
Using library: libvirt 1.1.1
Using API: QEMU 1.1.1
Running hypervisor: QEMU 1.5.3

How reproducible:
100% on this machine (a Dell R210 G2), haven't tested on others.

Steps to Reproduce:
1. Configure a Solarflare sfn7x42q NIC in dual-40G mode with two PFs per port (sfboot port-mode=default pf-count=2).  Following steps assume its PCI addresses are 0000:01:00.0 through 0000:01:00.3.
2. 'virsh create' a VM with the following device XML:
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x2'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
  (The full domain XML for the VM is attached.)
3. 'virsh start' the VM

Actual results:
error: Failed to start domain dibenchvm016
error: internal error: early end of file from monitor: possible problem:
qemu-kvm: -device vfio-pci,host=01:00.2,id=hostdev0,bus=pci.0,addr=0x5: vfio: error, group 1 is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver.
qemu-kvm: -device vfio-pci,host=01:00.2,id=hostdev0,bus=pci.0,addr=0x5: vfio: failed to get group 1
qemu-kvm: -device vfio-pci,host=01:00.2,id=hostdev0,bus=pci.0,addr=0x5: Device initialization failed.
qemu-kvm: -device vfio-pci,host=01:00.2,id=hostdev0,bus=pci.0,addr=0x5: Device 'vfio-pci' could not be initialized

Expected results:
VM started with a pair of PCIe devices connected to the PFs 0000:01:00.2 and ~.3.

Additional info:
When adding <driver name="kvm"/> to the <hostdev/> elements, the 'virsh create' fails with:
  error: Failed to create domain from drop/b.xml
  error: unsupported configuration: host doesn't support legacy PCI passthrough
This is apparently because /boot/rhel7-64/config-3.10.0-123.el7.x86_64 has "# CONFIG_KVM_DEVICE_ASSIGNMENT is not set".

Comment 2 Alex Williamson 2014-08-20 02:40:00 UTC

IOMMU groups are a kernel issue, not libvirt.  Can you provide 'sudo lspci -vvv -s 0000:01:00.'?

IOMMU groups, in part, rely on PCIe ACS (Access Control Services) in order to determine the isolation of devices.  A multifunction device _must_ support ACS in order for the individual functions to be considered isolated.  If it does not, the kernel must make the safe choice and assume the functions are not isolated and place them into the same IOMMU group, introducing the configuration restriction you experience here.

While this change limits configurations that were allowed in RHEL6, they never should have been allowed and it is only a failure to properly enforce ACS or the ability for the user to disable ACS checking that made it possible.

We do have support in the kernel for quirking devices that are fully isolated but do not support PCIe ACS.  If Solarflare would like to document PCI device IDs which do not support PCIe ACS and vouch for the isolation between functions, we can update the kernel to take this into account for IOMMU groups.  This bz can be used to provide that documentation and used as reference for upstream patches.

Comment 3 Alex Williamson 2014-08-20 03:01:17 UTC

(In reply to Edward Cree from comment #0)
> iommu_group contains all four PFs (the NIC has two ports and is configured
> with two PFs per port) as well as what appears to be the upstream PCIe port.

If the PCIe root port is included, then it may not be ACS support on the multi-function endpoint, but on the root port itself.  Without ACS on the root port, we must assume that a transaction can be re-routed at the root port prior to IOMMU translation, therefore breaking isolation of the subordinate devices.  Many Intel root ports fail to provide ACS, but are already quirked, as documented in bug 1037684.  In addition to the device lspci listing, please include lspci -vv of the parent root port as well as lspci -n so that we have the device IDs.  Apologies if I've preemptively mis-titled the bug if endpoint ACS is present.

Comment 4 Alex Williamson 2014-08-20 03:27:23 UTC

I also note that the test system is a Dell R210 G2, which supports Intel Xeon E3-1200 series processors.  Intel has updated their processor specifications for the E3-1200 V3, which may apply to this system if one of these processors are used and the device is hosted from a processor-based PCIe root port:

https://www-ssl.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf

---
HSW2. Intel® Virtualization Technology (Intel® VT) Clarification

Section 3.1 will be modified to include the following paragraph:

It is recommended to avoid device direct assignment to Virtual Machines in virtualized environments with this processor due to the lack of Access Control Services (ACS) support in PCI-Express root ports. Some Operating Systems may check for ACS support and potentially disable direct device assignment (that is, affects SR-IOV setup/configuration within the server) as well.
---

Similar statements may apply to other generations of the E3-1200 series processor if PCIe ACS is not supported by the root port.

Comment 5 Edward Cree 2014-08-20 11:01:11 UTC

Created attachment 928762 [details]
lspci output and /proc/cpuinfo

Comment 6 Alex Williamson 2014-08-20 13:02:51 UTC

(In reply to Edward Cree from comment #5)
> Created attachment 928762 [details]
> lspci output and /proc/cpuinfo

It looks like the lack of ACS support is at both the Solarflare multi-function endpoint as well as the PCIe root port.  We can only do anything about the former, Intel has specifically dis-recommended the latter for device assignment.  Please test the device in a slot where PCIe ACS is supported.  Note that most PCH PCIe root ports do support isolation via quirks rather than ACS support (X79 support is forthcoming).

If Solarflare wishes to pursue ACS quirks for this device, please verify that the device meets the requirements in section 6.12.1.2 of the PCIe 3.0 specification, in effect indicating that the device does not allow peer-to-peer between functions and requires all transactions be forwarded upstream, and the set of PCI IDs for which this has been verified.  Future products should include PCIe ACS support to avoid this issue.

Comment 7 Robert Stonehouse 2014-08-22 09:45:21 UTC

Alex,
Many thanks for the detailed info.

This has caused some internal discussion at Solarflare on what the PCIe spec states.
The overall conclusion was that the behaviour of peer-to-peer transactions was only defined if ACS was present.
I will ask for this to be considered for future designs.

I can confirm that Solarflare engineers have examined the requirements in section 6.12.1.2 of the PCIe 3.0 specification, there is no ability for our PCIe core to allow transactions between the functions on the device.

Therefore we would like to apply for a quirk for Vendor:Device
  1924:0903
  1924:0923

Is this best in this bug report? or does the request need to be made elsewhere?

Comment 8 Robert Stonehouse 2014-08-22 09:47:24 UTC

I have also been requested to ask if implementing ACS capability structure, but claiming no ACS support would be sufficient for this case?

Comment 9 Alex Williamson 2014-08-22 16:35:10 UTC

(In reply to Robert Stonehouse from comment #7)
> Alex,
> Many thanks for the detailed info.
> 
> This has caused some internal discussion at Solarflare on what the PCIe spec
> states.
> The overall conclusion was that the behaviour of peer-to-peer transactions
> was only defined if ACS was present.

Yes, that's our understanding as well.  The ability to change behavior via the ACS capability is not required, but the presence of an ACS capability is required for the vendor to indicate that peer-to-peer is isolated.

> I will ask for this to be considered for future designs.
> 
> I can confirm that Solarflare engineers have examined the requirements in
> section 6.12.1.2 of the PCIe 3.0 specification, there is no ability for our
> PCIe core to allow transactions between the functions on the device.
> 
> Therefore we would like to apply for a quirk for Vendor:Device
>   1924:0903
>   1924:0923
> 
> Is this best in this bug report? or does the request need to be made
> elsewhere?

Here is fine if that works for you.  Another option would be to Cc someone from Solarflare on the upstream patch and ask them to reply with an Acked-by or Reviewed-by sign-off.  Often a device will be quirked without vendor approval, but in this case the quirk is making a very low level statement about the behavior of the device internally.  It's difficult to make such a statement with observation alone, so we really want the vendor to be involved.

Here's an (untested) version of what this patch would look like:

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3664,6 +3664,29 @@ static int pci_quirk_intel_pch_acs(struct pci_dev *dev, u
        return acs_flags & ~flags ? 0 : 1;
 }
 
+static int pci_quirk_solarflare_acs(struct pci_dev *dev, u16 acs_flags)
+{
+       int pos;
+
+       /* Allow ACS support to override this quirk */
+       pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+       if (pos)
+               return -ENOTTY;
+
+       /*
+        * SV, TB, and UF are not relevant to multifunction endpoints.
+        *
+        * Solarflare indicates that peer-to-peer between functions is not
+        * possible, therefore RR, CR, and DT are not implemented.  Mask
+        * these out as if they were clear in the ACS capabilities register.
+        * https://bugzilla.redhat.com/show_bug.cgi?id=1131552
+        */
+       acs_flags &= ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+                      PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
+
+       return acs_flags ? 0 : 1;
+}
+
 static const struct pci_dev_acs_enabled {
        u16 vendor;
        u16 device;
@@ -3675,6 +3698,8 @@ static const struct pci_dev_acs_enabled {
        { PCI_VENDOR_ID_ATI, 0x439d, pci_quirk_amd_sb_acs },
        { PCI_VENDOR_ID_ATI, 0x4384, pci_quirk_amd_sb_acs },
        { PCI_VENDOR_ID_ATI, 0x4399, pci_quirk_amd_sb_acs },
+       { PCI_VENDOR_ID_SOLARFLARE, 0x0903, pci_quirk_solarflare_acs },
+       { PCI_VENDOR_ID_SOLARFLARE, 0x0923, pci_quirk_solarflare_acs },
        { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_quirk_intel_pch_acs },
        { 0 }
 };

If there's no chance that these device IDs will be used by a product that supports ACS, the first test can be removed.  When creating IOMMU groups we test for the following flags:

(PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)

So the above quirk will return true (>0) when those flags are requested.

(In reply to Robert Stonehouse from comment #8)
> I have also been requested to ask if implementing ACS capability structure,
> but claiming no ACS support would be sufficient for this case?

Yes, when an ACS capability is present, we use this test:

drivers/pci/pci.c:
static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
{
        int pos;
        u16 cap, ctrl;

        pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
        if (!pos)
                return false;

        /*
         * Except for egress control, capabilities are either required
         * or only required if controllable.  Features missing from the
         * capability field can therefore be assumed as hard-wired enabled.
         */
        pci_read_config_word(pdev, pos + PCI_ACS_CAP, &cap);
        acs_flags &= (cap | PCI_ACS_EC);

        pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
        return (ctrl & acs_flags) == acs_flags;
}

Therefore if the ACS capability register reads 0x0, all of those flags are handled as not controllable due to lack of peer-to-peer in the device.  If you approve of the above patch I can build and provide a kernel package for testing and post upstream.

Comment 10 Robert Stonehouse 2014-08-22 17:05:29 UTC

(In reply to Alex Williamson from comment #9)
> Here is fine if that works for you.  Another option would be to Cc someone
> from Solarflare on the upstream patch and ask them to reply with an Acked-by
> or Reviewed-by sign-off.  Often a device will be quirked without vendor
> approval, but in this case the quirk is making a very low level statement
> about the behavior of the device internally.  It's difficult to make such a
> statement with observation alone, so we really want the vendor to be
> involved.

Thanks; I will be happy to Ack when something goes upstream.

> If there's no chance that these device IDs will be used by a product that
> supports ACS, the first test can be removed.

I can confirm that this is true i.e. there is no chance that ACS will be available on devices with these IDs.

> Therefore if the ACS capability register reads 0x0, all of those flags are
> handled as not controllable due to lack of peer-to-peer in the device.  If
> you approve of the above patch I can build and provide a kernel package for
> testing and post upstream.

Patch looks good.
If it is easy for you to make a RHEL7 test kernel then that would be appreciated; and we can test (but there is no rush on this).

Many thanks for your help

Comment 11 Alex Williamson 2014-08-22 19:49:50 UTC

Kernel for testing:

http://people.redhat.com/~alwillia/bz1131552/

This is RHEL7 GA + the above patch, minus the unnecessary ACS capability test.  I've also included support for Intel X79 PCH root port ACS, which will be included in 7.1, in case it makes testing easier (you'll still need to move the device to a different slot to avoid the lack of ACS support on E3-1200 series processor root ports).  Please let us know when you've been able to confirm this resolves the issue.

Comment 12 Edward Cree 2014-09-12 16:10:15 UTC

(In reply to Alex Williamson from comment #11)
> Kernel for testing:
> 
> http://people.redhat.com/~alwillia/bz1131552/

I tested this kernel, and it appears to resolve the issue.
I was testing on a different machine to avoid the "processor root port lacks ACS" issue.  However I successfully reproduced the problem when running on the stock rhel7.0 kernel on this machine.

Running with the kernel from Comment 11, the PFs appeared in separate iommu_groups to each other, and I was able to start VMs using PCIe hostdev.

So I believe the patch as of Comment 11 fixes this.

Comment 13 Alex Williamson 2014-09-17 15:02:29 UTC

Thanks for the test report, patch posted upstream.  Robert, please ACK the upstream patch you were cc'd on to help speed along the upstreaming process.

Comment 14 Robert Stonehouse 2014-09-17 16:59:53 UTC

(In reply to Alex Williamson from comment #13)
> Thanks for the test report, patch posted upstream.  Robert, please ACK the
> upstream patch you were cc'd on to help speed along the upstreaming process.

Done.

Please excuse any horrid e-mail footers that the company e-mail server appends. I need to spend more time to try and persuade people that they are not appropriate for people that need to post on mailing lists.

Comment 15 Jarod Wilson 2014-10-24 13:11:20 UTC

Patch(es) available on kernel-3.10.0-193.el7

Comment 22 errata-xmlrpc 2015-03-05 12:38:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0290.html

Note You need to log in before you can comment on or make changes to this bug.