Description of problem: PCIe switches allow peer to peer transactions that are routed by the switch and could bypass the VTd translation hardward potentially causing unexpected behavior in the system. ACS allows the system to force the PCIe switch route all traffic upstream so that the VTd hardware can validate all transactions. The virtualization management tools should not allow direct assignment of a device that is below a non-ACS enabled PCIe switch to a guest. Version-Release number of selected component (if applicable): RHEL 5.4
Thanks for posting this!
Don, I thought the solution for this was to be common to both Xen and KVM and thus a libvirt patch. This is a Xen patch. Is there an equivalent patch to deal with this for KVM?
We've created a patch for Xen that show how to do the check. My understanding is that Chris is aware of the issue and was going to look into creating a similar solution for KVM.
First thing that strikes me is that this non-ACS PCIe switch issue doesn't just affect device assignment, it also affects device isolation - we should at least have the kernel print a warning if this issue is undermining device isolation on a give machine. Another point is that having a non-ACS PCIe switch is only an issue where there are multiple devices behind that switch and those devices are assigned to different IOMMU domains, correct? If that's the case, we should treat it similar to some non-FLR device reset scenarios - that is, you can assign these devices to a guest, but only if you assign all devices behind the switch to the same guest. So, IMHO - it makes sense for this code to go along with the PCI device reset code in xen and libvirt. i.e. we should have three bugs: 1) kernel should print a warning about non-ACS PCIe switches where IOMMU device isolation is undermined 2) xen should block assigning devices behind non-ACS PCIe switches, where different devices behind the same switch would be assigned to different domains 3) libvirt should do likewise
Not sure what the concerns are about device isolation. The only way to trigger a fault is to present a bad DMA address. For VMs the filtering avoids the problem by blocking assignment of affected devices to different VMs. For the host OS you would need a malicious driver and, if you have a malicious driver in your host, this is the least of your problems.
(In reply to comment #9) > For the host OS you would need a malicious driver and, if you have a malicious > driver in your host, this is the least of your problems. Agree, but what does device isolation prevent against then? More interested whether you agree that we should allow devices behind non-ACS switches so long as there are not other devices behind that switch assigned to another domain?
My issue with multiple device assignment is the complexity of checking for this condition. When you consider the tree like structure of the PCI topology where a device can be attached to switch A which is attached to switch B things get complicated. Now you have to verify that all devices attached to switch A and all devices attached to switch B and all devices attached to switches below switch B are assigned to the same guest. The device assignment filtering code is going to become very complicated, will require persistent state info and be potentially error prone. Given that we predicate the filtering code on the `strict-device-assignemnt' configuration flag I don't think we need to do this. For the default user we go the safe route and don't allow assignment of devices below any non-ACS switch. The more sophisticated user can turn off the `strict-device-assignment' config flag and then they will be able to assign devices at will. Of course, we should document that devices below non-ACS switches should not be assigned to different guests.
(In reply to comment #8) > First thing that strikes me is that this non-ACS PCIe switch issue doesn't just > affect device assignment, it also affects device isolation - we should at least > have the kernel print a warning if this issue is undermining device isolation > on a give machine. By device isolation, are you referring to an IOMMU protection domain for a device that is not assigned to a guest so that errant DMA can not destabilize the bare metal kernel? This could allow a device to generate p2p traffic to another device w/out any IOMMU translations allowing it. > Another point is that having a non-ACS PCIe switch is only an issue where there > are multiple devices behind that switch and those devices are assigned to > different IOMMU domains, correct? No, I don't think so. Would be a less severe problem, but the guest could find its devices not functioning properly. So this mode should not be allowed. > If that's the case, we should treat it similar to some non-FLR device reset > scenarios - that is, you can assign these devices to a guest, but only if you > assign all devices behind the switch to the same guest. > > So, IMHO - it makes sense for this code to go along with the PCI device reset > code in xen and libvirt. Not sure if it can go directly w/ it, but certainly very similar logic. > i.e. we should have three bugs: > > 1) kernel should print a warning about non-ACS PCIe switches where IOMMU > device isolation is undermined > > 2) xen should block assigning devices behind non-ACS PCIe switches, where > different devices behind the same switch would be assigned to different > domains > > 3) libvirt should do likewise I agree, this should be cloned for libvirt. It also needs to pertain to multifunction devices (quite ugly).
(In reply to comment #12) > (In reply to comment #8) > > Another point is that having a non-ACS PCIe switch is only an issue where > there > > are multiple devices behind that switch and those devices are assigned to > > different IOMMU domains, correct? > > No, I don't think so. Would be a less severe problem, but the guest could find > its devices not functioning properly. So this mode should not be allowed. I'm not sure I'm following correctly, but IMHO if a device behind an ACS bridge should never be permitted to be assigned to a guest, the *kernel* should refuse it. If it's an issue around whether different devices behind the bridge are different domains, then it sounds more like a job for libvirt. That's the core of what I'm trying to get at here
(In reply to comment #13) > (In reply to comment #12) > > (In reply to comment #8) > > > Another point is that having a non-ACS PCIe switch is only an issue where > > there > > > are multiple devices behind that switch and those devices are assigned to > > > different IOMMU domains, correct? > > > > No, I don't think so. Would be a less severe problem, but the guest could find > > its devices not functioning properly. So this mode should not be allowed. > > I'm not sure I'm following correctly, but IMHO if a device behind an ACS bridge > should never be permitted to be assigned to a guest, the *kernel* should refuse > it. If it's an issue around whether different devices behind the bridge are > different domains, then it sounds more like a job for libvirt. That's the core > of what I'm trying to get at here A device behind a PCIe switch that either does not support ACS (at all) or has not had ACS enabled should not be allowed to be assigned to a guest. It may be reasonable to filter this from the kernel. It gets ugly w/ multifunction devices though.
Moving discussion to bug #526713
Fix built into xen-3.0.3-97.el5
Thanks Don Dugger to verify this bug on xen-libs-3.0.3-94.el5 , the checking steps by him are shown as follows: (copying from his emails) 1) Installed a native RHEL5.4 in a new partition; 2) Installed the xen packages: a) installed kernel-xen-2.6.18-164.el5bz547980v2.x86_64.rpm: from https://bugzilla.redhat.com/show_bug.cgi?id=547980: Comment #20. (BTW, I first tried kernel-xen-2.6.18-164.el5.x86_64.rpm from the RHEL5.4 ISO, but dom0 complained "PCI: Cannot map mmconfig aperture for segment 0" on my NHM-HEDT host, so I followed BZ 547980 to get the good rpm.) b) installed xen-libs-3.0.3-94.el5.x86_64.rpm from the RHEL5.4 ISO c) rpm -ivh xen-3.0.3-102.x86_64.rpm --nodeps. This rpm is from you. 3) My grub.conf: title Red Hat Enterprise Linux Server (2.6.18-164.el5bz547980v2xen) root (hd0,0) kernel /boot/xen.gz-2.6.18-164.el5bz547980v2 iommu=1 module /boot/vmlinuz-2.6.18-164.el5bz547980v2xen ro root=LABEL=/ pci_pt_e820_access=on module /boot/initrd-2.6.18-164.el5bz547980v2xen.img 4) After booting into the new xen/dom0 environment, I unloaded igb driver and loaded pciback and hid the 4 PFs of the 4-port Kewela NIC. 5) Tried to assign any PF to HVM guest and the ACS filtering code prevented the guest creation as we expected: [root@localhost ~]# xm create 32e_rhel5u2.hvm Using config file "./32e_rhel5u2.hvm". Error: pci: to avoid potential security issue, 0000:03:00.1 is not allowed to be assigned to guest since it is behind PCIe switch that does not support or enable ACS. 6) I turned off the "pci-dev-assign-strict-check" option in /etc/xen/xend-config.sxp and did "xend restart", and re-did step 5 and I could create the hvm guest successfully and the NIC could work fine inside the hvm guest (of course, this kind of assignment is potentially unsafe.) So according to his comments, change this bug's status to verified.
Question wrt step 2a in C#24: -- was the machine an HP z800? -- pls attach the dmidecode of the host machine. The patch in V2 was a hardcode patch for all machines; the eventual patch for bz547980 was restricted to hpz800's for avoidance of regressions on other machines for rhel5. if it's another machine, i need to know *NOW* -- I'm just about to post the patch for inclusion in rhel5.5. Thanks... Don (Dutile)
Sorry Don Dutile, I think you can ask Don Dugger since he help us to verify this bug. Because we don't have this non-ACS pcie switch, so I writer a letter to Don Dugger (He is the bug reporter and developer).So only he knows this and can give the dmidecode messages.Thanks!
Created attachment 384513 [details] dmidecode output The dmidecode output is attached. The was not an HP machine, it was a Tylersberg High End DeskTop, a Nehalem based machine.
Don (Dugger): So, did you have to use the kernel from bz547980 (el5bz547980v2xen) in order to get the PCI mmconfig space to be mapped to dom0??? or did you use the kernel 'just in case'? if just in case, pls. try latest rhel5.5 kernel-xen & report if pci mmconfig is seen when pci_pt_e820_access=on is set.
Per the comments from our PRC engineer Dexuan in comment 24 he did try the RHEL 5.4 kernel ver. -164 and it failed. He then followed the instructions from BZ 547980, installed kernel ver. 2.6.18-164.el5bz547980v2xen and it worked.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0294.html
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).