Bug 523819 - Block assignment of devices below non-ACS switch
Summary: Block assignment of devices below non-ACS switch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Don Dugger
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 526713
TreeView+ depends on / blocked
 
Reported: 2009-09-16 18:58 UTC by Don Dugger
Modified: 2014-07-25 03:22 UTC (History)
9 users (show)

Fixed In Version: xen-3.0.3-97.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 526713 (view as bug list)
Environment:
Last Closed: 2010-03-30 08:58:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmidecode output (8.40 KB, text/plain)
2010-01-15 03:29 UTC, Don Dugger
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0294 0 normal SHIPPED_LIVE xen bug fix and enhancement update 2010-03-29 14:20:32 UTC

Description Don Dugger 2009-09-16 18:58:12 UTC
Description of problem:
PCIe switches allow peer to peer transactions that are routed by the switch and
could bypass the VTd translation hardward potentially causing unexpected
behavior in the system.  ACS allows the system to force the PCIe switch route
all traffic upstream so that the VTd hardware can validate all transactions.  The virtualization management tools should not allow direct assignment of a device that is below a non-ACS enabled PCIe switch to a guest.


Version-Release number of selected component (if applicable):
RHEL 5.4

Comment 1 Bill Burns 2009-09-24 23:55:29 UTC
Thanks for posting this!

Comment 3 Bill Burns 2009-09-25 12:37:00 UTC
Don, I thought the solution for this was to be common to both Xen and KVM and thus a libvirt patch. This is a Xen patch. Is there an equivalent patch to deal with this for KVM?

Comment 4 Don Dugger 2009-09-25 13:53:52 UTC
We've created a patch for Xen that show how to do the check.  My understanding is that Chris is aware of the issue and was going to look into creating a similar solution for KVM.

Comment 8 Mark McLoughlin 2009-09-29 10:59:49 UTC
First thing that strikes me is that this non-ACS PCIe switch issue doesn't just affect device assignment, it also affects device isolation - we should at least have the kernel print a warning if this issue is undermining device isolation on a give machine.

Another point is that having a non-ACS PCIe switch is only an issue where there are multiple devices behind that switch and those devices are assigned to different IOMMU domains, correct?

If that's the case, we should treat it similar to some non-FLR device reset scenarios - that is, you can assign these devices to a guest, but only if you assign all devices behind the switch to the same guest.

So, IMHO - it makes sense for this code to go along with the PCI device reset code in xen and libvirt.

i.e. we should have three bugs:

  1) kernel should print a warning about non-ACS PCIe switches where IOMMU
     device isolation is undermined

  2) xen should block assigning devices behind non-ACS PCIe switches, where
     different devices behind the same switch would be assigned to different
     domains

  3) libvirt should do likewise

Comment 9 Don Dugger 2009-09-30 19:23:16 UTC
Not sure what the concerns are about device isolation.  The only way to trigger a fault is to present a bad DMA address.  For VMs the filtering avoids the problem by blocking assignment of affected devices to different VMs.  For the host OS you would need a malicious driver and, if you have a malicious driver in your host, this is the least of your problems.

Comment 10 Mark McLoughlin 2009-10-01 07:12:17 UTC
(In reply to comment #9)
> For the host OS you would need a malicious driver and, if you have a malicious 
> driver in your host, this is the least of your problems.  

Agree, but what does device isolation prevent against then?

More interested whether you agree that we should allow devices behind non-ACS switches so long as there are not other devices behind that switch assigned to another domain?

Comment 11 Don Dugger 2009-10-01 15:09:52 UTC
My issue with multiple device assignment is the complexity of checking for this condition.  When you consider the tree like structure of the PCI topology where a device can be attached to switch A which is attached to switch B things get complicated.  Now you have to verify that all devices attached to switch A and all devices attached to switch B and all devices attached to switches below switch B are assigned to the same guest.  The device assignment filtering code is going to become very complicated, will require persistent state info and be potentially error prone.

Given that we predicate the filtering code on the `strict-device-assignemnt' configuration flag I don't think we need to do this.  For the default user we go the safe route and don't allow assignment of devices below any non-ACS switch.  The more sophisticated user can turn off the `strict-device-assignment' config flag and then they will be able to assign devices at will.  Of course, we should document that devices below non-ACS switches should not be assigned to different guests.

Comment 12 Chris Wright 2009-10-02 16:04:57 UTC
(In reply to comment #8)
> First thing that strikes me is that this non-ACS PCIe switch issue doesn't just
> affect device assignment, it also affects device isolation - we should at least
> have the kernel print a warning if this issue is undermining device isolation
> on a give machine.

By device isolation, are you referring to an IOMMU protection domain for a device that is not assigned to a guest so that errant DMA can not destabilize the bare metal kernel?  This could allow a device to generate p2p traffic to another device w/out any IOMMU translations allowing it.

 > Another point is that having a non-ACS PCIe switch is only an issue where there
> are multiple devices behind that switch and those devices are assigned to
> different IOMMU domains, correct?

No, I don't think so.  Would be a less severe problem, but the guest could find its devices not functioning properly.  So this mode should not be allowed.

> If that's the case, we should treat it similar to some non-FLR device reset
> scenarios - that is, you can assign these devices to a guest, but only if you
> assign all devices behind the switch to the same guest.
> 
> So, IMHO - it makes sense for this code to go along with the PCI device reset
> code in xen and libvirt.

Not sure if it can go directly w/ it, but certainly very similar logic.
 
> i.e. we should have three bugs:
> 
>   1) kernel should print a warning about non-ACS PCIe switches where IOMMU
>      device isolation is undermined
> 
>   2) xen should block assigning devices behind non-ACS PCIe switches, where
>      different devices behind the same switch would be assigned to different
>      domains
> 
>   3) libvirt should do likewise

I agree, this should be cloned for libvirt.  It also needs to pertain to multifunction devices (quite ugly).

Comment 13 Mark McLoughlin 2009-10-02 17:15:23 UTC
(In reply to comment #12)
> (In reply to comment #8)

>  > Another point is that having a non-ACS PCIe switch is only an issue where
> there
> > are multiple devices behind that switch and those devices are assigned to
> > different IOMMU domains, correct?
> 
> No, I don't think so.  Would be a less severe problem, but the guest could find
> its devices not functioning properly.  So this mode should not be allowed.

I'm not sure I'm following correctly, but IMHO if a device behind an ACS bridge should never be permitted to be assigned to a guest, the *kernel* should refuse it. If it's an issue around whether different devices behind the bridge are different domains, then it sounds more like a job for libvirt. That's the core of what I'm trying to get at here

Comment 14 Chris Wright 2009-10-02 19:17:41 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #8)
> > > Another point is that having a non-ACS PCIe switch is only an issue where
> > there
> > > are multiple devices behind that switch and those devices are assigned to
> > > different IOMMU domains, correct?
> > 
> > No, I don't think so.  Would be a less severe problem, but the guest could find
> > its devices not functioning properly.  So this mode should not be allowed.
> 
> I'm not sure I'm following correctly, but IMHO if a device behind an ACS bridge
> should never be permitted to be assigned to a guest, the *kernel* should refuse
> it. If it's an issue around whether different devices behind the bridge are
> different domains, then it sounds more like a job for libvirt. That's the core
> of what I'm trying to get at here 

A device behind a PCIe switch that either does not support ACS (at all) or has not had ACS enabled should not be allowed to be assigned to a guest.  It may be reasonable to filter this from the kernel.  It gets ugly w/ multifunction devices though.

Comment 15 Mark McLoughlin 2009-10-05 09:55:29 UTC
Moving discussion to bug #526713

Comment 18 Jiri Denemark 2009-11-13 22:23:38 UTC
Fix built into xen-3.0.3-97.el5

Comment 24 XinSun 2010-01-08 08:35:54 UTC
Thanks Don Dugger to verify this bug on xen-libs-3.0.3-94.el5 , the checking steps by him are shown as follows: (copying from his emails)


1) Installed a native RHEL5.4 in a new partition;

2) Installed the xen packages:
	a) installed kernel-xen-2.6.18-164.el5bz547980v2.x86_64.rpm: from
		https://bugzilla.redhat.com/show_bug.cgi?id=547980: Comment #20.  
		(BTW, I first tried kernel-xen-2.6.18-164.el5.x86_64.rpm from the
		RHEL5.4 ISO, but dom0 complained "PCI: Cannot map mmconfig aperture
		for segment 0" on my NHM-HEDT host, so I followed BZ 547980 to get
		the good rpm.)

	b) installed xen-libs-3.0.3-94.el5.x86_64.rpm from the RHEL5.4 ISO

	c) rpm -ivh xen-3.0.3-102.x86_64.rpm --nodeps. This rpm is from you. 
	
3) My grub.conf:
title Red Hat Enterprise Linux Server (2.6.18-164.el5bz547980v2xen)
    root (hd0,0)
    kernel /boot/xen.gz-2.6.18-164.el5bz547980v2 iommu=1
    module /boot/vmlinuz-2.6.18-164.el5bz547980v2xen ro root=LABEL=/ pci_pt_e820_access=on
    module /boot/initrd-2.6.18-164.el5bz547980v2xen.img

4) After booting into the new xen/dom0 environment, I unloaded igb driver
and loaded pciback and hid the 4 PFs of the 4-port Kewela NIC.

5) Tried to assign any PF to HVM guest and the ACS filtering code prevented
the guest creation as we expected:

[root@localhost ~]#  xm create 32e_rhel5u2.hvm
Using config file "./32e_rhel5u2.hvm".
Error: pci: to avoid potential security issue, 0000:03:00.1 is not allowed
to be assigned to guest since it is behind PCIe switch that does not support
or enable ACS.

6) I turned off the "pci-dev-assign-strict-check" option in
/etc/xen/xend-config.sxp and did "xend restart", and re-did step 5 and I
could create the hvm guest successfully and the NIC could work fine inside
the hvm guest (of course, this kind of assignment is potentially unsafe.)


So according to his comments, change this bug's status to verified.

Comment 25 Don Dutile (Red Hat) 2010-01-14 22:03:48 UTC
Question wrt step 2a in C#24:

-- was the machine an HP z800?  
     -- pls attach the dmidecode of the host machine.

The patch in V2 was a hardcode patch for all machines;
the eventual patch for bz547980 was restricted to hpz800's
for avoidance of regressions on other machines for rhel5.

if it's another machine, i need to know *NOW* -- I'm just
about to post the patch for inclusion in rhel5.5.

Thanks... Don (Dutile)

Comment 26 XinSun 2010-01-15 02:08:53 UTC
Sorry Don Dutile, I think you can ask Don Dugger since he help us to verify this bug. Because we don't have this non-ACS pcie switch, so I writer a letter to Don Dugger (He is the bug reporter and developer).So only he knows this and can give the dmidecode messages.Thanks!

Comment 27 Don Dugger 2010-01-15 03:29:10 UTC
Created attachment 384513 [details]
dmidecode output

The dmidecode output is attached.

The was not an HP machine, it was a Tylersberg High End DeskTop, a Nehalem based machine.

Comment 29 Don Dutile (Red Hat) 2010-01-15 17:11:19 UTC
Don (Dugger):

So, did you have to use the kernel from bz547980 (el5bz547980v2xen)
in order to get the PCI mmconfig space to be mapped to dom0???

or did you use the kernel 'just in case'?  
if just in case, pls. try latest rhel5.5 kernel-xen &
report if pci mmconfig is seen when pci_pt_e820_access=on
is set.

Comment 30 Don Dugger 2010-01-19 23:09:54 UTC
Per the comments from our PRC engineer Dexuan in comment 24 he did try the RHEL 5.4 kernel ver. -164 and it failed.  He then followed the instructions from BZ 547980, installed kernel ver. 2.6.18-164.el5bz547980v2xen and it worked.

Comment 31 errata-xmlrpc 2010-03-30 08:58:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0294.html

Comment 32 Paolo Bonzini 2010-04-08 15:49:48 UTC
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).


Note You need to log in before you can comment on or make changes to this bug.