Bug 781524
Summary: | AMD IOMMU driver hands out dma handles that are in the MSI address range | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Joerg Roedel <joerg.roedel> | ||||
Component: | kernel | Assignee: | Don Dutile (Red Hat) <ddutile> | ||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 6.2 | CC: | andreas.herrmann3, dhoward, frank.arnold, jwest, laine, mjenner, pbenas, peterm, ravikiran.thirumalai | ||||
Target Milestone: | rc | Keywords: | OtherQA, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-2.6.32-238.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, the AMD IOMMU (input/output memory management unit) driver could use the MSI address range for DMA (direct memory access) addresses. As a consequence, DMA could fail and spurious interrupts would occur if this address range was used. With this update, the MSI address range is reserved to prevent the driver from allocating wrong addresses and DMA is now assured to work as expected in the described scenario.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-06-20 08:19:19 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 809374 | ||||||
Attachments: |
|
Here's my understanding of the code & related patch; please provide corrections &/or acks: (1) when DMA address space is allocated, it's done in 128MB chunks: dma_dom->aperture_size += APERTURE_RANGE_SIZE; (2) not sure how that range doesn't run-into/overlap device mmio chunks (3) when the allocation is the last 128MB chunk below 0xfee0.000 & the next one will be over the same boundary, then reserve that chunk of address space: if (old_size < MSI_ADDR_BASE_LO && dma_dom->aperture_size > MSI_ADDR_BASE_LO) { unsigned long spage; int pages; pages = iommu_num_pages(MSI_ADDR_BASE_LO, 0x10000, PAGE_SIZE); spage = MSI_ADDR_BASE_LO >> PAGE_SHIFT; dma_ops_reserve_addresses(dma_dom, spage, pages); (4) I'm not sure how this code handles the case where the last chunk of allocation in exactly 128MB below MSI_ADDR_BASE_LO, b/c the next chunk will start exactly *at* MSI_ADDR_BASE_LO, and that will be cause a problem. So, shouldn't the check be >= MSI_ADDR_BASE_LO ? (5) finally, looking at the whole function, the out_free seems incorrectly placed: dma_dom->aperture[index]->bitmap = (void *)get_zeroed_page(gfp); if (!dma_dom->aperture[index]->bitmap) goto out_free; . . . out_free: update_domain(&dma_dom->domain); free_page((unsigned long)dma_dom->aperture[index]->bitmap); kfree(dma_dom->aperture[index]); dma_dom->aperture[index] = NULL; return -ENOMEM; } Now, free_page(), which becomes free_pages(addr,0), just rtns if addr is 0/NULL, but what's there to do in update_domain(), since the failed alloc??? shouldn't out_free be *after* the free_page ??? (6) Finally, this looks like linux upstream commit: 17f5b569e09cfa3488eaa663cbf9feada2e789f5 (In reply to comment #2) > Here's my understanding of the code & related patch; please provide corrections > &/or acks: > (1) when DMA address space is allocated, it's done in 128MB chunks: > dma_dom->aperture_size += APERTURE_RANGE_SIZE; Correct. > > (2) not sure how that range doesn't run-into/overlap device mmio chunks Is this important? The ranges only describe DMA address space, not system address space. > (3) when the allocation is the last 128MB chunk below 0xfee0.000 & the > next one will be over the same boundary, then reserve that chunk of > address space: > if (old_size < MSI_ADDR_BASE_LO && > dma_dom->aperture_size > MSI_ADDR_BASE_LO) { > unsigned long spage; > int pages; > > pages = iommu_num_pages(MSI_ADDR_BASE_LO, 0x10000, PAGE_SIZE); > spage = MSI_ADDR_BASE_LO >> PAGE_SHIFT; > > dma_ops_reserve_addresses(dma_dom, spage, pages); Better read it as: When a new range is allocated covering the MSI address range, then reserve that chunk of DMA address space. > (4) I'm not sure how this code handles the case where the last chunk of > allocation in exactly 128MB below MSI_ADDR_BASE_LO, b/c the next chunk will > start exactly *at* MSI_ADDR_BASE_LO, and that will be cause a problem. > So, shouldn't the check be >= MSI_ADDR_BASE_LO ? This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000 has an offset of 110MB into the last range. So the check works always. > (5) finally, looking at the whole function, the out_free seems incorrectly > placed: > dma_dom->aperture[index]->bitmap = (void *)get_zeroed_page(gfp); > if (!dma_dom->aperture[index]->bitmap) > goto out_free; > . > . > . > out_free: > update_domain(&dma_dom->domain); > > free_page((unsigned long)dma_dom->aperture[index]->bitmap); > > kfree(dma_dom->aperture[index]); > dma_dom->aperture[index] = NULL; > > return -ENOMEM; > } > > Now, free_page(), which becomes free_pages(addr,0), just rtns if addr is > 0/NULL, but what's there to do in update_domain(), since the failed alloc??? > shouldn't out_free be *after* the free_page ??? I don't think there is problem. Calling free_page is safe with addr==NULL, so an additional label is not necessary. The update_domain() function is also a NOP if the domain hasn't changed. The purpose of update_domain() is to flush the DTEs that reference the domain if necessary. The AMD IOMMU driver may switch from a 2-level to a 3-level page-table (happens when the address-space is extended above 1GB) transparently. This action requires such a flush to tell the IOMMUs in the system the new page-table root. > > > (6) Finally, this looks like linux upstream commit: > 17f5b569e09cfa3488eaa663cbf9feada2e789f5 Right. This is the commit-id I referenced in the patch too. (In reply to comment #3) > (In reply to comment #2) > > Here's my understanding of the code & related patch; please provide corrections > > &/or acks: > > (1) when DMA address space is allocated, it's done in 128MB chunks: > > dma_dom->aperture_size += APERTURE_RANGE_SIZE; > > Correct. > Thanks. > > > > (2) not sure how that range doesn't run-into/overlap device mmio chunks > > Is this important? The ranges only describe DMA address space, not system > address space. > Well, if system puts PCI(e) devices btwn 3.5G->4.0G *on the PCIe address space*, how does this allocation remove MMIO devices in the DMA (iova) address space? is that done in another function after allocating that address space ? > > (3) when the allocation is the last 128MB chunk below 0xfee0.000 & the > > next one will be over the same boundary, then reserve that chunk of > > address space: > > if (old_size < MSI_ADDR_BASE_LO && > > dma_dom->aperture_size > MSI_ADDR_BASE_LO) { > > unsigned long spage; > > int pages; > > > > pages = iommu_num_pages(MSI_ADDR_BASE_LO, 0x10000, PAGE_SIZE); > > spage = MSI_ADDR_BASE_LO >> PAGE_SHIFT; > > > > dma_ops_reserve_addresses(dma_dom, spage, pages); > > Better read it as: When a new range is allocated covering the MSI address > range, then reserve that chunk of DMA address space. > That's what I tried to say, but failed, I guess. Thanks for confirmation. > > (4) I'm not sure how this code handles the case where the last chunk of > > allocation in exactly 128MB below MSI_ADDR_BASE_LO, b/c the next chunk will > > start exactly *at* MSI_ADDR_BASE_LO, and that will be cause a problem. > > So, shouldn't the check be >= MSI_ADDR_BASE_LO ? > > This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000 > has an offset of 110MB into the last range. So the check works always. > right. but if either params change -- 128MB size, or alignment, or MSI ADDR alignment, the logic is flawed. I'm just not fond of flawed logic that's dependent on knowing that it's 'dodging a bullet'. > > (5) finally, looking at the whole function, the out_free seems incorrectly > > placed: > > dma_dom->aperture[index]->bitmap = (void *)get_zeroed_page(gfp); > > if (!dma_dom->aperture[index]->bitmap) > > goto out_free; > > . > > . > > . > > out_free: > > update_domain(&dma_dom->domain); > > > > free_page((unsigned long)dma_dom->aperture[index]->bitmap); > > > > kfree(dma_dom->aperture[index]); > > dma_dom->aperture[index] = NULL; > > > > return -ENOMEM; > > } > > > > Now, free_page(), which becomes free_pages(addr,0), just rtns if addr is > > 0/NULL, but what's there to do in update_domain(), since the failed alloc??? > > shouldn't out_free be *after* the free_page ??? > > I don't think there is problem. Calling free_page is safe with addr==NULL, so > an additional label is not necessary. The update_domain() function is also a > NOP if the domain hasn't changed. > > The purpose of update_domain() is to flush the DTEs that reference the domain > if necessary. The AMD IOMMU driver may switch from a 2-level to a 3-level > page-table (happens when the address-space is extended above 1GB) > transparently. This action requires such a flush to tell the IOMMUs in the > system the new page-table root. > but these calls are made on allocation failure, so why make them at all? seems like unnecessary overhead. Yes, it works due to null checks, but removing overhead related to IOMMU mapping should be a primary goal/effort. > > > > > > (6) Finally, this looks like linux upstream commit: > > 17f5b569e09cfa3488eaa663cbf9feada2e789f5 > > Right. This is the commit-id I referenced in the patch too. Yes, when you look at the attachment. We typically like to see that highlighted in the backport request itself (i.e., 'this is an upstream accepted patch, and here's the commit id). Thanks for the confirmations. Can you provide a test script that shows the failure, which I can include in the bz? this would make QE's acceptance much faster, as would the final approval of the patch in the release. > > > (2) not sure how that range doesn't run-into/overlap device mmio chunks > > > > Is this important? The ranges only describe DMA address space, not system > > address space. > > > Well, if system puts PCI(e) devices btwn 3.5G->4.0G *on the PCIe address > space*, > how does this allocation remove MMIO devices in the DMA (iova) address space? > is that done in another function after allocating that address space ? I think it is not necessary to remove MMIO ranges from the allocation-space. The IOMMU only translates device-dma-requests, not CPU requests to PCI address space. Further, when IOMMU is enabled PCI-ACS is enabled too, so all inter-device requests are routed through and translated by the IOMMU too. > > This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000 > > has an offset of 110MB into the last range. So the check works always. > > > right. but if either params change -- 128MB size, or alignment, or MSI ADDR > alignment, the logic is flawed. I'm just not fond of flawed logic that's > dependent on knowing that it's 'dodging a bullet'. Okay, I doubt that any of these parameters will change in the RHEL6 life-time, but I can certainly change the check if you want. > but these calls are made on allocation failure, so why make them at all? > seems like unnecessary overhead. Yes, it works due to null checks, but > removing overhead related to IOMMU mapping should be a primary goal/effort. The out_free label is also used when a alloc_pte call failed. The alloc_pte path may change the page-table from 2-level to 3-level. So in case a few alloc_pte calls went good the page-table may already be switched to 3-level. When an alloc_pte call fails then in the middle, the update_domain() call is required because in such a failure case the page-table is not switched back to 2-level. > > > (6) Finally, this looks like linux upstream commit: > > > 17f5b569e09cfa3488eaa663cbf9feada2e789f5 > > > > Right. This is the commit-id I referenced in the patch too. > Yes, when you look at the attachment. We typically like to see that > highlighted in the backport request itself (i.e., 'this is an upstream accepted > patch, and here's the commit id). Okay, will do that next time. > Can you provide a test script that shows the failure, which > I can include in the bz? this would make QE's acceptance much faster, > as would the final approval of the patch in the release. Well, I fear there is no simple script for this problem. My test-case which triggered this problem was: 1. Linux with IOMMU driver enabled 2. FGLRX installed (a buggy version that leaks DMA-handles in some cases) 3. VMware Workstation installed, with a Win7 32bit guest, 3D enabled in the guest 4. Run a special 3D testing application (Dolphin App) in the Win7 guest This setup eat up all 4GB DMA address space for the graphics card within minutes. (In reply to comment #5) > > > > (2) not sure how that range doesn't run-into/overlap device mmio chunks > > > > > > Is this important? The ranges only describe DMA address space, not system > > > address space. > > > > > Well, if system puts PCI(e) devices btwn 3.5G->4.0G *on the PCIe address > > space*, > > how does this allocation remove MMIO devices in the DMA (iova) address space? > > is that done in another function after allocating that address space ? > > I think it is not necessary to remove MMIO ranges from the allocation-space. > The IOMMU only translates device-dma-requests, not CPU requests to PCI address I understand the use & translation of an IOMMU... > space. Further, when IOMMU is enabled PCI-ACS is enabled too, so all > inter-device requests are routed through and translated by the IOMMU too. > PCI-ACS is *optional* in a PCIe switch. The call to enable it is in the primary PCI code, but it is optional. There are known PCIe switches that have it, and known PCIe switches that _do not_ have it. Thus, if your code's logic is premised on ACS throughout a PCIe hierarchy, then you should add some checking code for that condition, or modify your code to deal with the situation where ACS is not available. qemu-kvm had to provide a hook that allows device-assignment for non-ACS configurations, since without it, it's a security problem (one PCIe device can be maliciously programmed to DMA into/out of another PCIe device, causing unknown problems, minimally, easily crash the host kernel). I had the impression that this alloc was allocating iova (io virtual address space, i.e., free bus address space), thus the concern. but, I'm not familiar with the AMD-IOMMU code, so why the 128MB heuristic, why just part of the bus address space is allocated, etc., are the reasons for the questions. I'll have to study the AMD-IOMMU code further. > > > This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000 > > > has an offset of 110MB into the last range. So the check works always. > > > > > right. but if either params change -- 128MB size, or alignment, or MSI ADDR > > alignment, the logic is flawed. I'm just not fond of flawed logic that's > > dependent on knowing that it's 'dodging a bullet'. > > Okay, I doubt that any of these parameters will change in the RHEL6 life-time, > but I can certainly change the check if you want. > If there's ever a (rhel6) problem, we'll re-assign the bug to you! ;-) > > but these calls are made on allocation failure, so why make them at all? > > seems like unnecessary overhead. Yes, it works due to null checks, but > > removing overhead related to IOMMU mapping should be a primary goal/effort. > > The out_free label is also used when a alloc_pte call failed. The alloc_pte > path may change the page-table from 2-level to 3-level. So in case a few > alloc_pte calls went good the page-table may already be switched to 3-level. > When an alloc_pte call fails then in the middle, the update_domain() call is > required because in such a failure case the page-table is not switched back to > 2-level. > ok. I figured the other levels would have done their own flush, but using this point as the last alloc in the chain, and depending on the flush, would be nice to comment in the code. > > > > (6) Finally, this looks like linux upstream commit: > > > > 17f5b569e09cfa3488eaa663cbf9feada2e789f5 > > > > > > Right. This is the commit-id I referenced in the patch too. > > Yes, when you look at the attachment. We typically like to see that > > highlighted in the backport request itself (i.e., 'this is an upstream accepted > > patch, and here's the commit id). > > Okay, will do that next time. > > > Can you provide a test script that shows the failure, which > > I can include in the bz? this would make QE's acceptance much faster, > > as would the final approval of the patch in the release. > > Well, I fear there is no simple script for this problem. My test-case which > triggered this problem was: > > 1. Linux with IOMMU driver enabled > 2. FGLRX installed (a buggy version that leaks DMA-handles in some cases) > 3. VMware Workstation installed, with a Win7 32bit guest, 3D enabled in the > guest > 4. Run a special 3D testing application (Dolphin App) in the Win7 guest > > This setup eat up all 4GB DMA address space for the graphics card within > minutes. Hmmm, I wonder how the RH virt manager will feel when I ask for a VMware license for testing! ;-) Some test setup; we'll have to contrive a simpler setup. Thanks for all the info. (In reply to comment #6) > > space. Further, when IOMMU is enabled PCI-ACS is enabled too, so all > > inter-device requests are routed through and translated by the IOMMU too. > > > PCI-ACS is *optional* in a PCIe switch. The call to enable it is in the > primary PCI code, but it is optional. There are known PCIe switches that have > it, and known PCIe switches that _do not_ have it. Thus, if your code's logic > is premised on ACS throughout a PCIe hierarchy, then you should add some > checking code for that condition, or modify your code to deal with the > situation where ACS is not available. qemu-kvm had to provide a hook that > allows device-assignment for non-ACS configurations, since without it, it's a > security problem (one PCIe device can be maliciously programmed to DMA into/out > of another PCIe device, causing unknown problems, minimally, easily crash the > host kernel). Ok, ACS is optional. But all chips with an AMD IOMMU also support ACS, I checked that today. This are the RD890 NB chip and the Trinity APU. So there should be no problem, I think. > > Okay, I doubt that any of these parameters will change in the RHEL6 life-time, > > but I can certainly change the check if you want. > > > If there's ever a (rhel6) problem, we'll re-assign the bug to you! ;-) Fair enough ;-) > > This setup eat up all 4GB DMA address space for the graphics card within > > minutes. > Hmmm, I wonder how the RH virt manager will feel when I ask for a VMware > license for testing! ;-) > Some test setup; we'll have to contrive a simpler setup. > Thanks for all the info. You are welcome. Let me know if you need any additional info :)
>
> Ok, ACS is optional. But all chips with an AMD IOMMU also support ACS, I
> checked that today. This are the RD890 NB chip and the Trinity APU. So there
> should be no problem, I think.
>
but not *all* PCIe modules (read: quad-port, Intel NIC using Integrated Devices PCIe bridge) have ACS, thus, the ACS assumption is broken once one of these modules is plugged into an AMD system b/c the PCIe up/down bridge pair in a PCIe switch chip that doesn't have ACS support breaks this assumption.
IOW, an AMD system doesn't own/control the entire PCIe fabric components.
libvirt currently tests the PCIe hierarchy in the host when doing device assignment, and if ACS support doesn't exist, by default, fails device assignment to a guest. It requires manually setting a run-in-unsecure-acs mode to enable such a configuration for device-assignment. This (optional) ACS issue has been well-known for over a year now.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. AMD affirmed they will qa based on provided rpm in c#11 from: Ravikiran.Thirumalai Hi Don Yes we commit to test this. Please include the fix in. 6.3 Thanks, Kiran Update: We successfully tested the fix here in our environment. The issue is fixed and no new issue was found. (In reply to comment #14) > Update: > > We successfully tested the fix here in our environment. The issue is fixed and > no new issue was found. Thanks for prompt testing! I'll note in our internal patch posting so it the patches get into 6.3. Q: do you need these backported to 6.2.z ? -- sounds like they should (simple, fixes bugs months in advance of 6.3 release); if so, set 6.2.z flag to ?; if you can't, let me know and I will. Thanks. A backport into 6.2.z makes sense too. The bug also triggers there. Thanks. A backport into 6.2.z makes sense too. The bug also triggers there. Patch(es) available on kernel-2.6.32-238.el6 Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the AMD IOMMU (input/output memory management unit) driver could use the MSI address range for DMA (direct memory access) addresses. As a consequence, DMA could fail and spurious interrupts would occur if this address range was used. With this update, the MSI address range is reserved to prevent the driver from allocating wrong addresses and DMA is now assured to work as expected in the described scenario. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0862.html |
Created attachment 555096 [details] Patch that fixes the problem. Description of problem: AMD IOMMU driver hands out dma handles that are in the MSI range Version-Release number of selected component (if applicable): Kernels in RHEL 6.0 to 6.2 How reproducible: Let a driver allocate up to 4GB of dma handles to get a address from the MSI range. Steps to Reproduce: 1. 2. 3. Actual results: DMA will not work and spurious interrupts happen Expected results: DMA works and data is not corrupted. No spurious interrupts Additional info: The attached patch fixes the bug in the RHEL6 kernel.