Bug 526713 - Block assignment of devices below non-ACS switch for KVM in libvirt or kernel
Summary: Block assignment of devices below non-ACS switch for KVM in libvirt or kernel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: libvirt
Version: 5.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 523819
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-10-01 13:51 UTC by Bill Burns
Modified: 2012-05-22 03:20 UTC (History)
13 users (show)

Fixed In Version: libvirt-0.6.3-31.el5
Doc Type: Bug Fix
Doc Text:
Clone Of: 523819
Environment:
Last Closed: 2010-03-30 08:09:06 UTC
Target Upstream Version:


Attachments (Terms of Use)
Standalone program to show the concept for non-ACS blocking for libvirt (9.39 KB, patch)
2009-12-14 15:46 UTC, Chris Lalancette
no flags Details | Diff
Attempt to port Chris' standalone program to libvirt (3.64 KB, patch)
2009-12-21 10:31 UTC, Jiri Denemark
no flags Details | Diff
RHEL-5 port of upstream patch (9.36 KB, patch)
2009-12-21 18:37 UTC, Jiri Denemark
no flags Details | Diff
Test package (1.97 MB, application/x-rpm)
2009-12-21 18:45 UTC, Jiri Denemark
no flags Details
RHEL-5 port of upstream patch v2 (10.90 KB, patch)
2009-12-22 18:27 UTC, Jiri Denemark
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0205 0 normal SHIPPED_LIVE libvirt bug fix and enhancement update 2010-03-29 12:27:37 UTC

Description Bill Burns 2009-10-01 13:51:19 UTC
+++ This bug was initially created as a clone of Bug #523819 +++

need to implement something in libvirt for KVM here.
 
Original description:

Description of problem:
PCIe switches allow peer to peer transactions that are routed by the switch and
could bypass the VTd translation hardward potentially causing unexpected
behavior in the system.  ACS allows the system to force the PCIe switch route
all traffic upstream so that the VTd hardware can validate all transactions.  The virtualization management tools should not allow direct assignment of a device that is below a non-ACS enabled PCIe switch to a guest.


Version-Release number of selected component (if applicable):
RHEL 5.4

--- Additional comment from bburns@redhat.com on 2009-09-24 19:55:29 EDT ---

Thanks for posting this!

--- Additional comment from jdenemar@redhat.com on 2009-09-25 03:33:52 EDT ---

Created an attachment (id=362626)
Check if a device is behind PCIe switch that lacks ACS

Patch that is currently on the git patch queue.

--- Additional comment from bburns@redhat.com on 2009-09-25 08:37:00 EDT ---

Don, I thought the solution for this was to be common to both Xen and KVM and thus a libvirt patch. This is a Xen patch. Is there an equivalent patch to deal with this for KVM?

--- Additional comment from ddugger@redhat.com on 2009-09-25 09:53:52 EDT ---

We've created a patch for Xen that show how to do the check.  My understanding is that Chris is aware of the issue and was going to look into creating a similar solution for KVM.

--- Additional comment from bburns@redhat.com on 2009-09-25 10:23:18 EDT ---

Jiri, not sure if you know enough about libvirt yet to know if a common patch to protect us against this is possible. I think that would be preferable to separate Xen and KVM patches. Can you look into it?

--- Additional comment from jdenemar@redhat.com on 2009-09-29 04:31:21 EDT ---

(In reply to comment #5)
> Jiri, not sure if you know enough about libvirt yet to know if a common patch
> to protect us against this is possible. I think that would be preferable to
> separate Xen and KVM patches. Can you look into it?  

I don't know much about PCI handling in libvirt but I don't feel like it should deal with this kind of stuff. libvirt should only provide APIs for users to be able to assign PCI devices to guests but what PCI devices can be assigned under what conditions should really be decided by the underlying hypervisor. That's my opinion...

--- Additional comment from bburns@redhat.com on 2009-09-29 05:04:42 EDT ---

Thanks Jiri. Chris, so it seems that this solution to prevent problematic device assignment is Xen specific. How would we prevnet it for KVM? Would that be in QEMU? Any idea who should be looking at this for the KVM side?

--- Additional comment from markmc@redhat.com on 2009-09-29 06:59:49 EDT ---

First thing that strikes me is that this non-ACS PCIe switch issue doesn't just affect device assignment, it also affects device isolation - we should at least have the kernel print a warning if this issue is undermining device isolation on a give machine.

Another point is that having a non-ACS PCIe switch is only an issue where there are multiple devices behind that switch and those devices are assigned to different IOMMU domains, correct?

If that's the case, we should treat it similar to some non-FLR device reset scenarios - that is, you can assign these devices to a guest, but only if you assign all devices behind the switch to the same guest.

So, IMHO - it makes sense for this code to go along with the PCI device reset code in xen and libvirt.

i.e. we should have three bugs:

  1) kernel should print a warning about non-ACS PCIe switches where IOMMU
     device isolation is undermined

  2) xen should block assigning devices behind non-ACS PCIe switches, where
     different devices behind the same switch would be assigned to different
     domains

  3) libvirt should do likewise

--- Additional comment from ddugger@redhat.com on 2009-09-30 15:23:16 EDT ---

Not sure what the concerns are about device isolation.  The only way to trigger a fault is to present a bad DMA address.  For VMs the filtering avoids the problem by blocking assignment of affected devices to different VMs.  For the host OS you would need a malicious driver and, if you have a malicious driver in your host, this is the least of your problems.

--- Additional comment from markmc@redhat.com on 2009-10-01 03:12:17 EDT ---

(In reply to comment #9)
> For the host OS you would need a malicious driver and, if you have a malicious 
> driver in your host, this is the least of your problems.  

Agree, but what does device isolation prevent against then?

More interested whether you agree that we should allow devices behind non-ACS switches so long as there are not other devices behind that switch assigned to another domain?

Comment 1 Mark McLoughlin 2009-10-05 09:56:00 UTC
From bug #523819:

> I'm not sure I'm following correctly, but IMHO if a device behind an ACS bridge
> should never be permitted to be assigned to a guest, the *kernel* should refuse
> it. If it's an issue around whether different devices behind the bridge are
> different domains, then it sounds more like a job for libvirt. That's the core
> of what I'm trying to get at here 

A device behind a PCIe switch that either does not support ACS (at all) or has
not had ACS enabled should not be allowed to be assigned to a guest.  It may be
reasonable to filter this from the kernel.  It gets ugly w/ multifunction
devices though.

Comment 2 Mark McLoughlin 2009-11-09 17:33:27 UTC
AFAIR cdub suggests that while all non-ACS devices should be blocked by default, there should be a whitelist since it will be safe for some devices

Perhaps the whitelist could live in the hwdata package and libvirt would use it, rather than libvirt having to be updated every time we want to add a new device

Chris: am I summarizing correctly?

Comment 3 Chris Wright 2009-11-12 17:13:41 UTC
Yes, that's what I was thinking.  The issue being that technically multifunction devices that don't advertise ACS would all fall into the "can't assign to guest" category based on the possibility that they can initiate P2P traffic between functions (whether they do or not is not externally discoverable).  However, the likelihood they do this is relatively low, and a huge number of NICs are multifunction (e.g. function per port)...same NICs that we'd like to allow users to assign to their guests.

Comment 7 Chris Lalancette 2009-12-14 15:45:34 UTC
ChrisW, 
     I'm going to attach here a standalone program that implements the low-level pieces of the code I think we need to put in libvirt to block devices between non-ACS switches.  Note that at the moment, I don't have any machines with ACS, so I can't test that it really works.  If you have a machine that could be used for testing, could you give us a pointer to it?
     What remains to be implemented is the logic of the whitelist that you mention in comments #2 and #3.  To be honest, I don't love this idea of the whitelist; not only will we have to maintain some kind of table, we will need to make sure the table is up-to-date every time new hardware comes out.  It also breaks the security of the setup without letting the user know about (because it is on a magic whitelist that the user probably won't know anything about).
     I have an alternate proposal.  What if we added a new <permissive/> tag to the libvirt XML for device assignment?  In the normal case, we wouldn't allow *any* passthrough of devices behind non-ACS switches.  However, if the user knows what they are doing, and they want to take this risk, they can add the <permissive/> tag to the XML, in which case it would allow the assignment to happen.  This can even be used pretty successfully in virt-manager; it just needs to catch the appropriate exception from the first assignment, pop-up "This is dangerous because of non-ACS, blah, blah.  Are you sure?", and then re-do the assignment with the <permissive/> tag.  What do you think about this?

Chris Lalancette

Comment 8 Chris Lalancette 2009-12-14 15:46:16 UTC
Created attachment 378251 [details]
Standalone program to show the concept for non-ACS blocking for libvirt

Comment 9 Jiri Denemark 2009-12-21 10:31:32 UTC
Created attachment 379591 [details]
Attempt to port Chris' standalone program to libvirt

Comment 10 Jiri Denemark 2009-12-21 18:37:53 UTC
Created attachment 379671 [details]
RHEL-5 port of upstream patch

Comment 12 Jiri Denemark 2009-12-21 18:45:01 UTC
Created attachment 379675 [details]
Test package

Comment 13 Jiri Denemark 2009-12-22 18:27:57 UTC
Created attachment 379887 [details]
RHEL-5 port of upstream patch v2

Comment 15 Daniel Veillard 2009-12-22 23:30:51 UTC
libvirt-0.6.3-27.el5 has been built in dist-5E-qu-candidate with the fix

Daniel

Comment 17 Gunannan Ren 2010-01-06 09:37:09 UTC
Hi Bill

How can I determine a machine with a ACS-enabled PCIe switch?
For now, I can not to verify the bug. 
Could you help me this?

Comment 18 Chris Wright 2010-01-06 20:18:56 UTC
You can use an updated lspci and look for the ACS PCIe capability.

http://et.redhat.com/~chrisw/rhel5/5.4/bin/lspci

# ./lspci -vvv
...
	Capabilities: [150] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
...

Note that the ACSCtl register should specifically be showing these 4 ASC capabilities enabled:

ScrValid+, ReqRedir+, CmpltRedir+, and UpstreamFwd+

Comment 19 Gunannan Ren 2010-01-07 10:09:25 UTC
Hi Chris

   Thank you for your updated lscpi command tool
   But I tried the tool on several box, none of them has the output like yours
   Why? Could you tell me what is your box?

Comment 20 Gunannan Ren 2010-01-11 03:12:49 UTC
The original bug has been verified by Don Dugger
https://bugzilla.redhat.com/show_bug.cgi?id=523819

Don Dugger, there is no test environment, could you please help this bug verification?

Comment 21 Jiri Denemark 2010-01-28 16:17:46 UTC
Fix built in libvirt-0.6.3-31.el5

Comment 23 Gunannan Ren 2010-02-26 07:53:07 UTC
The bug has been fixed in libvirt-0.6.3-31.el5

1) Command "lspci" output pci devices in the system:
<snip>
...
05:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
06:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
06:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
...
</snip>
2) ./lspci -vvv , it show the device no ACS function
...

        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
...

3) Using "virsh nodedev-dettach" to hid all of PF of the NIC

one example :virsh nodedev-dettach pci_8086_10e8
...

4) add the xml description of PF that is ready to be assigned to vm into the xml description of the vm , like the follows:
...
    <hostdev mode='subsystem' type='pci'>
      <source>
        <address bus='6' slot='0' function='1'/>
       </source>
    </hostdev>
...
5) Tried to assign any PF to HVM guest and the ACS filtering code prevented
the guest creation as we expected

error: Failed to start domain hvm_acs_test
error: this function is not supported by the hypervisor: Device 0000:06:00.1 is behind a switch lacking ACS and cannot be assigned

Comment 24 Gunannan Ren 2010-02-26 07:55:16 UTC
supplement:
The test is performed on kvm hypervisor
# rpm -qa|grep kvm
kmod-kvm-83-105.el5
etherboot-zroms-kvm-5.4.4-10.el5
kvm-qemu-img-83-105.el5
etherboot-roms-kvm-5.4.4-10.el5
kvm-83-105.el5
kvm-tools-83-105.el5

Comment 26 Chris Wright 2010-02-26 19:14:13 UTC
Note that in Comment #23, step 2 is not showing the device's lack of ACS support.  ACS support for a PCIe device is described in its own PCIe capability.  The description in Comment #18 shows this w/ a modified lspci binary (just first line of capability entry shown here):

  Capabilities: [150] Access Control Services

With a standard RHEL 5 lspci, you'd see an unknown PCIe capability such as:

  Capabilities: [150] Unknown (13)

In the above example the '150' is a device specific offset into the PCIe Extended Configuration Space where the Capability is described.  So '150' is not special here and may be different for different PCIe functions (just needs to be greater than 0xFF).  The PCIe Capability ID for ACS is 0xD (13).  So the string "Access Control Services" (using my patched lspci binary) or the string "Unknown (13)" are the important bit here.

If you are not using a patched lspci binary it's much more difficult to describe what to look for to see ACS support enabled (easy to see whether it's capable or not by the (lack of) existance of "Capabilities: [???] Unknown (13)").  But I can see you are using the patched lspci since it is properly parsing the ARI Capability.

Comment 27 Gunannan Ren 2010-02-27 11:06:38 UTC
Got your point, thank you. If there is no ACS capability in isolation for a PCIe device using patched lscpi, it means the PCIe device has no ACS support, right?
Comment #18 shows the device with ACS support could print out a separate capability description. 

And from the step 5 in Comment #23, it reports expected errors. Whether or not that can indicate the right PCIe device used for the verification?

Comment 29 errata-xmlrpc 2010-03-30 08:09:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0205.html


Note You need to log in before you can comment on or make changes to this bug.