+++ This bug was initially created as a clone of Bug #523819 +++
need to implement something in libvirt for KVM here.
Description of problem:
PCIe switches allow peer to peer transactions that are routed by the switch and
could bypass the VTd translation hardward potentially causing unexpected
behavior in the system. ACS allows the system to force the PCIe switch route
all traffic upstream so that the VTd hardware can validate all transactions. The virtualization management tools should not allow direct assignment of a device that is below a non-ACS enabled PCIe switch to a guest.
Version-Release number of selected component (if applicable):
--- Additional comment from email@example.com on 2009-09-24 19:55:29 EDT ---
Thanks for posting this!
--- Additional comment from firstname.lastname@example.org on 2009-09-25 03:33:52 EDT ---
Created an attachment (id=362626)
Check if a device is behind PCIe switch that lacks ACS
Patch that is currently on the git patch queue.
--- Additional comment from email@example.com on 2009-09-25 08:37:00 EDT ---
Don, I thought the solution for this was to be common to both Xen and KVM and thus a libvirt patch. This is a Xen patch. Is there an equivalent patch to deal with this for KVM?
--- Additional comment from firstname.lastname@example.org on 2009-09-25 09:53:52 EDT ---
We've created a patch for Xen that show how to do the check. My understanding is that Chris is aware of the issue and was going to look into creating a similar solution for KVM.
--- Additional comment from email@example.com on 2009-09-25 10:23:18 EDT ---
Jiri, not sure if you know enough about libvirt yet to know if a common patch to protect us against this is possible. I think that would be preferable to separate Xen and KVM patches. Can you look into it?
--- Additional comment from firstname.lastname@example.org on 2009-09-29 04:31:21 EDT ---
(In reply to comment #5)
> Jiri, not sure if you know enough about libvirt yet to know if a common patch
> to protect us against this is possible. I think that would be preferable to
> separate Xen and KVM patches. Can you look into it?
I don't know much about PCI handling in libvirt but I don't feel like it should deal with this kind of stuff. libvirt should only provide APIs for users to be able to assign PCI devices to guests but what PCI devices can be assigned under what conditions should really be decided by the underlying hypervisor. That's my opinion...
--- Additional comment from email@example.com on 2009-09-29 05:04:42 EDT ---
Thanks Jiri. Chris, so it seems that this solution to prevent problematic device assignment is Xen specific. How would we prevnet it for KVM? Would that be in QEMU? Any idea who should be looking at this for the KVM side?
--- Additional comment from firstname.lastname@example.org on 2009-09-29 06:59:49 EDT ---
First thing that strikes me is that this non-ACS PCIe switch issue doesn't just affect device assignment, it also affects device isolation - we should at least have the kernel print a warning if this issue is undermining device isolation on a give machine.
Another point is that having a non-ACS PCIe switch is only an issue where there are multiple devices behind that switch and those devices are assigned to different IOMMU domains, correct?
If that's the case, we should treat it similar to some non-FLR device reset scenarios - that is, you can assign these devices to a guest, but only if you assign all devices behind the switch to the same guest.
So, IMHO - it makes sense for this code to go along with the PCI device reset code in xen and libvirt.
i.e. we should have three bugs:
1) kernel should print a warning about non-ACS PCIe switches where IOMMU
device isolation is undermined
2) xen should block assigning devices behind non-ACS PCIe switches, where
different devices behind the same switch would be assigned to different
3) libvirt should do likewise
--- Additional comment from email@example.com on 2009-09-30 15:23:16 EDT ---
Not sure what the concerns are about device isolation. The only way to trigger a fault is to present a bad DMA address. For VMs the filtering avoids the problem by blocking assignment of affected devices to different VMs. For the host OS you would need a malicious driver and, if you have a malicious driver in your host, this is the least of your problems.
--- Additional comment from firstname.lastname@example.org on 2009-10-01 03:12:17 EDT ---
(In reply to comment #9)
> For the host OS you would need a malicious driver and, if you have a malicious
> driver in your host, this is the least of your problems.
Agree, but what does device isolation prevent against then?
More interested whether you agree that we should allow devices behind non-ACS switches so long as there are not other devices behind that switch assigned to another domain?
From bug #523819:
> I'm not sure I'm following correctly, but IMHO if a device behind an ACS bridge
> should never be permitted to be assigned to a guest, the *kernel* should refuse
> it. If it's an issue around whether different devices behind the bridge are
> different domains, then it sounds more like a job for libvirt. That's the core
> of what I'm trying to get at here
A device behind a PCIe switch that either does not support ACS (at all) or has
not had ACS enabled should not be allowed to be assigned to a guest. It may be
reasonable to filter this from the kernel. It gets ugly w/ multifunction
AFAIR cdub suggests that while all non-ACS devices should be blocked by default, there should be a whitelist since it will be safe for some devices
Perhaps the whitelist could live in the hwdata package and libvirt would use it, rather than libvirt having to be updated every time we want to add a new device
Chris: am I summarizing correctly?
Yes, that's what I was thinking. The issue being that technically multifunction devices that don't advertise ACS would all fall into the "can't assign to guest" category based on the possibility that they can initiate P2P traffic between functions (whether they do or not is not externally discoverable). However, the likelihood they do this is relatively low, and a huge number of NICs are multifunction (e.g. function per port)...same NICs that we'd like to allow users to assign to their guests.
I'm going to attach here a standalone program that implements the low-level pieces of the code I think we need to put in libvirt to block devices between non-ACS switches. Note that at the moment, I don't have any machines with ACS, so I can't test that it really works. If you have a machine that could be used for testing, could you give us a pointer to it?
What remains to be implemented is the logic of the whitelist that you mention in comments #2 and #3. To be honest, I don't love this idea of the whitelist; not only will we have to maintain some kind of table, we will need to make sure the table is up-to-date every time new hardware comes out. It also breaks the security of the setup without letting the user know about (because it is on a magic whitelist that the user probably won't know anything about).
I have an alternate proposal. What if we added a new <permissive/> tag to the libvirt XML for device assignment? In the normal case, we wouldn't allow *any* passthrough of devices behind non-ACS switches. However, if the user knows what they are doing, and they want to take this risk, they can add the <permissive/> tag to the XML, in which case it would allow the assignment to happen. This can even be used pretty successfully in virt-manager; it just needs to catch the appropriate exception from the first assignment, pop-up "This is dangerous because of non-ACS, blah, blah. Are you sure?", and then re-do the assignment with the <permissive/> tag. What do you think about this?
Created attachment 378251 [details]
Standalone program to show the concept for non-ACS blocking for libvirt
Created attachment 379591 [details]
Attempt to port Chris' standalone program to libvirt
Created attachment 379671 [details]
RHEL-5 port of upstream patch
Patch sent upstream:
and RHEL-5 port sent to virtualist:
Created attachment 379675 [details]
Created attachment 379887 [details]
RHEL-5 port of upstream patch v2
Patch v2 sent upstream:
and RHEL-5 port sent to virtualist:
RPMs for testing can be found at http://people.redhat.com/jdenemar/libvirt/
libvirt-0.6.3-27.el5 has been built in dist-5E-qu-candidate with the fix
How can I determine a machine with a ACS-enabled PCIe switch?
For now, I can not to verify the bug.
Could you help me this?
You can use an updated lspci and look for the ACS PCIe capability.
# ./lspci -vvv
Capabilities:  Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
Note that the ACSCtl register should specifically be showing these 4 ASC capabilities enabled:
ScrValid+, ReqRedir+, CmpltRedir+, and UpstreamFwd+
Thank you for your updated lscpi command tool
But I tried the tool on several box, none of them has the output like yours
Why? Could you tell me what is your box?
The original bug has been verified by Don Dugger
Don Dugger, there is no test environment, could you please help this bug verification?
Fix built in libvirt-0.6.3-31.el5
The bug has been fixed in libvirt-0.6.3-31.el5
1) Command "lspci" output pci devices in the system:
05:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
06:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
06:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
2) ./lspci -vvv , it show the device no ACS function
Capabilities:  Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
3) Using "virsh nodedev-dettach" to hid all of PF of the NIC
one example :virsh nodedev-dettach pci_8086_10e8
4) add the xml description of PF that is ready to be assigned to vm into the xml description of the vm , like the follows:
<hostdev mode='subsystem' type='pci'>
<address bus='6' slot='0' function='1'/>
5) Tried to assign any PF to HVM guest and the ACS filtering code prevented
the guest creation as we expected
error: Failed to start domain hvm_acs_test
error: this function is not supported by the hypervisor: Device 0000:06:00.1 is behind a switch lacking ACS and cannot be assigned
The test is performed on kvm hypervisor
# rpm -qa|grep kvm
Note that in Comment #23, step 2 is not showing the device's lack of ACS support. ACS support for a PCIe device is described in its own PCIe capability. The description in Comment #18 shows this w/ a modified lspci binary (just first line of capability entry shown here):
Capabilities:  Access Control Services
With a standard RHEL 5 lspci, you'd see an unknown PCIe capability such as:
Capabilities:  Unknown (13)
In the above example the '150' is a device specific offset into the PCIe Extended Configuration Space where the Capability is described. So '150' is not special here and may be different for different PCIe functions (just needs to be greater than 0xFF). The PCIe Capability ID for ACS is 0xD (13). So the string "Access Control Services" (using my patched lspci binary) or the string "Unknown (13)" are the important bit here.
If you are not using a patched lspci binary it's much more difficult to describe what to look for to see ACS support enabled (easy to see whether it's capable or not by the (lack of) existance of "Capabilities: [???] Unknown (13)"). But I can see you are using the patched lspci since it is properly parsing the ARI Capability.
Got your point, thank you. If there is no ACS capability in isolation for a PCIe device using patched lscpi, it means the PCIe device has no ACS support, right?
Comment #18 shows the device with ACS support could print out a separate capability description.
And from the step 5 in Comment #23, it reports expected errors. Whether or not that can indicate the right PCIe device used for the verification?
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.