Bug 781524

Summary:

AMD IOMMU driver hands out dma handles that are in the MSI address range

Product:

Red Hat Enterprise Linux 6

Reporter:

Joerg Roedel <joerg.roedel>

Component:

kernel

Assignee:

Don Dutile (Red Hat) <ddutile>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

6.2

CC:

andreas.herrmann3, dhoward, frank.arnold, jwest, laine, mjenner, pbenas, peterm, ravikiran.thirumalai

Target Milestone:

Keywords:

OtherQA, ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-238.el6

Doc Type:

Bug Fix

Doc Text:

Previously, the AMD IOMMU (input/output memory management unit) driver could use the MSI address range for DMA (direct memory access) addresses. As a consequence, DMA could fail and spurious interrupts would occur if this address range was used. With this update, the MSI address range is reserved to prevent the driver from allocating wrong addresses and DMA is now assured to work as expected in the described scenario.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-06-20 08:19:19 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

809374

Attachments:

Description	Flags
Patch that fixes the problem.	none

Description Joerg Roedel 2012-01-13 16:32:30 UTC

Created attachment 555096 [details]
Patch that fixes the problem.

Description of problem:

AMD IOMMU driver hands out dma handles that are in the MSI range


Version-Release number of selected component (if applicable):

Kernels in RHEL 6.0 to 6.2

How reproducible:

Let a driver allocate up to 4GB of dma handles to get a address from the MSI range.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

DMA will not work and spurious interrupts happen

Expected results:

DMA works and data is not corrupted. No spurious interrupts

Additional info:

The attached patch fixes the bug in the RHEL6 kernel.

Comment 2 Don Dutile (Red Hat) 2012-01-13 19:54:10 UTC

Here's my understanding of the code & related patch; please provide corrections &/or acks:
(1) when DMA address space is allocated, it's done in 128MB chunks:
       dma_dom->aperture_size += APERTURE_RANGE_SIZE;

(2) not sure how that range doesn't run-into/overlap device mmio chunks

(3) when the allocation is the last 128MB chunk below 0xfee0.000 & the 
    next one will be over the same boundary, then reserve that chunk of 
    address space:
        if (old_size < MSI_ADDR_BASE_LO &&
            dma_dom->aperture_size > MSI_ADDR_BASE_LO) {
                unsigned long spage;
                int pages;

                pages = iommu_num_pages(MSI_ADDR_BASE_LO, 0x10000, PAGE_SIZE);
                spage = MSI_ADDR_BASE_LO >> PAGE_SHIFT;

                dma_ops_reserve_addresses(dma_dom, spage, pages);

(4) I'm not sure how this code handles the case where the last chunk of allocation in exactly 128MB below MSI_ADDR_BASE_LO, b/c the next chunk will
start exactly *at* MSI_ADDR_BASE_LO, and that will be cause a problem.
So, shouldn't the check be >= MSI_ADDR_BASE_LO ?

(5) finally, looking at the whole function, the out_free seems incorrectly placed:
        dma_dom->aperture[index]->bitmap = (void *)get_zeroed_page(gfp);
        if (!dma_dom->aperture[index]->bitmap)
                goto out_free;
                .
                .
                .
out_free:
        update_domain(&dma_dom->domain);

        free_page((unsigned long)dma_dom->aperture[index]->bitmap); 

        kfree(dma_dom->aperture[index]);
        dma_dom->aperture[index] = NULL;

        return -ENOMEM;
}

Now, free_page(), which becomes free_pages(addr,0), just rtns if addr is 0/NULL, but what's there to do in update_domain(), since the failed alloc???
shouldn't out_free be *after* the free_page ???


(6) Finally, this looks like linux upstream commit:
    17f5b569e09cfa3488eaa663cbf9feada2e789f5

Comment 3 Joerg Roedel 2012-01-16 15:08:41 UTC

(In reply to comment #2)
> Here's my understanding of the code & related patch; please provide corrections
> &/or acks:
> (1) when DMA address space is allocated, it's done in 128MB chunks:
>        dma_dom->aperture_size += APERTURE_RANGE_SIZE;

Correct.

> 
> (2) not sure how that range doesn't run-into/overlap device mmio chunks

Is this important? The ranges only describe DMA address space, not system address space.

> (3) when the allocation is the last 128MB chunk below 0xfee0.000 & the 
>     next one will be over the same boundary, then reserve that chunk of 
>     address space:
>         if (old_size < MSI_ADDR_BASE_LO &&
>             dma_dom->aperture_size > MSI_ADDR_BASE_LO) {
>                 unsigned long spage;
>                 int pages;
> 
>                 pages = iommu_num_pages(MSI_ADDR_BASE_LO, 0x10000, PAGE_SIZE);
>                 spage = MSI_ADDR_BASE_LO >> PAGE_SHIFT;
> 
>                 dma_ops_reserve_addresses(dma_dom, spage, pages);

Better read it as: When a new range is allocated covering the MSI address range, then reserve that chunk of DMA address space.

> (4) I'm not sure how this code handles the case where the last chunk of
> allocation in exactly 128MB below MSI_ADDR_BASE_LO, b/c the next chunk will
> start exactly *at* MSI_ADDR_BASE_LO, and that will be cause a problem.
> So, shouldn't the check be >= MSI_ADDR_BASE_LO ?

This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000 has an offset of 110MB into the last range. So the check works always.

> (5) finally, looking at the whole function, the out_free seems incorrectly
> placed:
>         dma_dom->aperture[index]->bitmap = (void *)get_zeroed_page(gfp);
>         if (!dma_dom->aperture[index]->bitmap)
>                 goto out_free;
>                 .
>                 .
>                 .
> out_free:
>         update_domain(&dma_dom->domain);
> 
>         free_page((unsigned long)dma_dom->aperture[index]->bitmap); 
> 
>         kfree(dma_dom->aperture[index]);
>         dma_dom->aperture[index] = NULL;
> 
>         return -ENOMEM;
> }
> 
> Now, free_page(), which becomes free_pages(addr,0), just rtns if addr is
> 0/NULL, but what's there to do in update_domain(), since the failed alloc???
> shouldn't out_free be *after* the free_page ???

I don't think there is problem. Calling free_page is safe with addr==NULL, so an additional label is not necessary. The update_domain() function is also a NOP if the domain hasn't changed.

The purpose of update_domain() is to flush the DTEs that reference the domain if necessary. The AMD IOMMU driver may switch from a 2-level to a 3-level page-table (happens when the address-space is extended above 1GB) transparently. This action requires such a flush to tell the IOMMUs in the system the new page-table root.

> 
> 
> (6) Finally, this looks like linux upstream commit:
>     17f5b569e09cfa3488eaa663cbf9feada2e789f5

Right. This is the commit-id I referenced in the patch too.

Comment 4 Don Dutile (Red Hat) 2012-01-16 16:29:24 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > Here's my understanding of the code & related patch; please provide corrections
> > &/or acks:
> > (1) when DMA address space is allocated, it's done in 128MB chunks:
> >        dma_dom->aperture_size += APERTURE_RANGE_SIZE;
> 
> Correct.
> 
Thanks.
> > 
> > (2) not sure how that range doesn't run-into/overlap device mmio chunks
> 
> Is this important? The ranges only describe DMA address space, not system
> address space.
> 
Well, if system puts PCI(e) devices btwn 3.5G->4.0G *on the PCIe address space*,
how does this allocation remove MMIO devices in the DMA (iova) address space?
is that done in another function after allocating that address space ?

> > (3) when the allocation is the last 128MB chunk below 0xfee0.000 & the 
> >     next one will be over the same boundary, then reserve that chunk of 
> >     address space:
> >         if (old_size < MSI_ADDR_BASE_LO &&
> >             dma_dom->aperture_size > MSI_ADDR_BASE_LO) {
> >                 unsigned long spage;
> >                 int pages;
> > 
> >                 pages = iommu_num_pages(MSI_ADDR_BASE_LO, 0x10000, PAGE_SIZE);
> >                 spage = MSI_ADDR_BASE_LO >> PAGE_SHIFT;
> > 
> >                 dma_ops_reserve_addresses(dma_dom, spage, pages);
> 
> Better read it as: When a new range is allocated covering the MSI address
> range, then reserve that chunk of DMA address space.
> 
That's what I tried to say, but failed, I guess.
Thanks for confirmation.

> > (4) I'm not sure how this code handles the case where the last chunk of
> > allocation in exactly 128MB below MSI_ADDR_BASE_LO, b/c the next chunk will
> > start exactly *at* MSI_ADDR_BASE_LO, and that will be cause a problem.
> > So, shouldn't the check be >= MSI_ADDR_BASE_LO ?
> 
> This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000
> has an offset of 110MB into the last range. So the check works always.
> 
right. but if either params change -- 128MB size, or alignment, or MSI ADDR alignment, the logic is flawed.  I'm just not fond of flawed logic that's dependent on knowing that it's 'dodging a bullet'.

> > (5) finally, looking at the whole function, the out_free seems incorrectly
> > placed:
> >         dma_dom->aperture[index]->bitmap = (void *)get_zeroed_page(gfp);
> >         if (!dma_dom->aperture[index]->bitmap)
> >                 goto out_free;
> >                 .
> >                 .
> >                 .
> > out_free:
> >         update_domain(&dma_dom->domain);
> > 
> >         free_page((unsigned long)dma_dom->aperture[index]->bitmap); 
> > 
> >         kfree(dma_dom->aperture[index]);
> >         dma_dom->aperture[index] = NULL;
> > 
> >         return -ENOMEM;
> > }
> > 
> > Now, free_page(), which becomes free_pages(addr,0), just rtns if addr is
> > 0/NULL, but what's there to do in update_domain(), since the failed alloc???
> > shouldn't out_free be *after* the free_page ???
> 
> I don't think there is problem. Calling free_page is safe with addr==NULL, so
> an additional label is not necessary. The update_domain() function is also a
> NOP if the domain hasn't changed.
> 
> The purpose of update_domain() is to flush the DTEs that reference the domain
> if necessary. The AMD IOMMU driver may switch from a 2-level to a 3-level
> page-table (happens when the address-space is extended above 1GB)
> transparently. This action requires such a flush to tell the IOMMUs in the
> system the new page-table root.
> 
but these calls are made on allocation failure, so why make them at all?
seems like unnecessary overhead.  Yes, it works due to null checks, but 
removing overhead related to IOMMU mapping should be a primary goal/effort.

> > 
> > 
> > (6) Finally, this looks like linux upstream commit:
> >     17f5b569e09cfa3488eaa663cbf9feada2e789f5
> 
> Right. This is the commit-id I referenced in the patch too.
Yes, when you look at the attachment.  We typically like to see that
highlighted in the backport request itself (i.e., 'this is an upstream accepted patch, and here's the commit id). 

Thanks for the confirmations.

Can you provide a test script that shows the failure, which
I can include in the bz?  this would make QE's acceptance much faster,
as would the final approval of the patch in the release.

Comment 5 Joerg Roedel 2012-01-16 17:32:37 UTC

> > > (2) not sure how that range doesn't run-into/overlap device mmio chunks
> > 
> > Is this important? The ranges only describe DMA address space, not system
> > address space.
> > 
> Well, if system puts PCI(e) devices btwn 3.5G->4.0G *on the PCIe address
> space*,
> how does this allocation remove MMIO devices in the DMA (iova) address space?
> is that done in another function after allocating that address space ?

I think it is not necessary to remove MMIO ranges from the allocation-space. The IOMMU only translates device-dma-requests, not CPU requests to PCI address space. Further, when IOMMU is enabled PCI-ACS is enabled too, so all inter-device requests are routed through and translated by the IOMMU too.

> > This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000
> > has an offset of 110MB into the last range. So the check works always.
> > 
> right. but if either params change -- 128MB size, or alignment, or MSI ADDR
> alignment, the logic is flawed.  I'm just not fond of flawed logic that's
> dependent on knowing that it's 'dodging a bullet'.

Okay, I doubt that any of these parameters will change in the RHEL6 life-time, but I can certainly change the check if you want.
 
> but these calls are made on allocation failure, so why make them at all?
> seems like unnecessary overhead.  Yes, it works due to null checks, but 
> removing overhead related to IOMMU mapping should be a primary goal/effort.

The out_free label is also used when a alloc_pte call failed. The alloc_pte path may change the page-table from 2-level to 3-level. So in case a few alloc_pte calls went good the page-table may already be switched to 3-level. When an alloc_pte call fails then in the middle, the update_domain() call is required because in such a failure case the page-table is not switched back to 2-level.
 
> > > (6) Finally, this looks like linux upstream commit:
> > >     17f5b569e09cfa3488eaa663cbf9feada2e789f5
> > 
> > Right. This is the commit-id I referenced in the patch too.
> Yes, when you look at the attachment.  We typically like to see that
> highlighted in the backport request itself (i.e., 'this is an upstream accepted
> patch, and here's the commit id).

Okay, will do that next time.
 
> Can you provide a test script that shows the failure, which
> I can include in the bz?  this would make QE's acceptance much faster,
> as would the final approval of the patch in the release.

Well, I fear there is no simple script for this problem. My test-case which triggered this problem was:

1. Linux with IOMMU driver enabled
2. FGLRX installed (a buggy version that leaks DMA-handles in some cases)
3. VMware Workstation installed, with a Win7 32bit guest, 3D enabled in the guest
4. Run a special 3D testing application (Dolphin App) in the Win7 guest

This setup eat up all 4GB DMA address space for the graphics card within minutes.

Comment 6 Don Dutile (Red Hat) 2012-01-16 18:54:15 UTC

(In reply to comment #5)
> > > > (2) not sure how that range doesn't run-into/overlap device mmio chunks
> > > 
> > > Is this important? The ranges only describe DMA address space, not system
> > > address space.
> > > 
> > Well, if system puts PCI(e) devices btwn 3.5G->4.0G *on the PCIe address
> > space*,
> > how does this allocation remove MMIO devices in the DMA (iova) address space?
> > is that done in another function after allocating that address space ?
> 
> I think it is not necessary to remove MMIO ranges from the allocation-space.
> The IOMMU only translates device-dma-requests, not CPU requests to PCI address
I understand the use & translation of an IOMMU...

> space. Further, when IOMMU is enabled PCI-ACS is enabled too, so all
> inter-device requests are routed through and translated by the IOMMU too.
> 
PCI-ACS is *optional* in a PCIe switch.  The call to enable it is in the primary PCI code, but it is optional. There are known PCIe switches that have it, and known PCIe switches that _do not_ have it.  Thus, if your code's logic is premised on ACS throughout a PCIe hierarchy, then you should add some checking code for that condition, or modify your code to deal with the situation where ACS is not available. qemu-kvm had to provide a hook that allows device-assignment for non-ACS configurations, since without it, it's a security problem (one PCIe device can be maliciously programmed to DMA into/out of another PCIe device, causing unknown problems, minimally, easily crash the host kernel).

I had the impression that this alloc was allocating iova (io virtual address space, i.e., free bus address space), thus the concern.
but, I'm not familiar with the AMD-IOMMU code, so why the 128MB heuristic, 
why just part of the bus address space is allocated, etc., are the reasons for the questions.  I'll have to study the AMD-IOMMU code further.

> > > This shouldn't happen. The ranges are always aligned to 128MB and 0xfee00000
> > > has an offset of 110MB into the last range. So the check works always.
> > > 
> > right. but if either params change -- 128MB size, or alignment, or MSI ADDR
> > alignment, the logic is flawed.  I'm just not fond of flawed logic that's
> > dependent on knowing that it's 'dodging a bullet'.
> 
> Okay, I doubt that any of these parameters will change in the RHEL6 life-time,
> but I can certainly change the check if you want.
> 
If there's ever a (rhel6) problem, we'll re-assign the bug to you! ;-)

> > but these calls are made on allocation failure, so why make them at all?
> > seems like unnecessary overhead.  Yes, it works due to null checks, but 
> > removing overhead related to IOMMU mapping should be a primary goal/effort.
> 
> The out_free label is also used when a alloc_pte call failed. The alloc_pte
> path may change the page-table from 2-level to 3-level. So in case a few
> alloc_pte calls went good the page-table may already be switched to 3-level.
> When an alloc_pte call fails then in the middle, the update_domain() call is
> required because in such a failure case the page-table is not switched back to
> 2-level.
> 
ok.  I figured the other levels would have done their own flush, but
using this point as the last alloc in the chain, and depending on the flush,
would be nice to comment in the code.

> > > > (6) Finally, this looks like linux upstream commit:
> > > >     17f5b569e09cfa3488eaa663cbf9feada2e789f5
> > > 
> > > Right. This is the commit-id I referenced in the patch too.
> > Yes, when you look at the attachment.  We typically like to see that
> > highlighted in the backport request itself (i.e., 'this is an upstream accepted
> > patch, and here's the commit id).
> 
> Okay, will do that next time.
> 
> > Can you provide a test script that shows the failure, which
> > I can include in the bz?  this would make QE's acceptance much faster,
> > as would the final approval of the patch in the release.
> 
> Well, I fear there is no simple script for this problem. My test-case which
> triggered this problem was:
> 
> 1. Linux with IOMMU driver enabled
> 2. FGLRX installed (a buggy version that leaks DMA-handles in some cases)
> 3. VMware Workstation installed, with a Win7 32bit guest, 3D enabled in the
> guest
> 4. Run a special 3D testing application (Dolphin App) in the Win7 guest
> 
> This setup eat up all 4GB DMA address space for the graphics card within
> minutes.
Hmmm, I wonder how the RH virt manager will feel when I ask for a VMware license for testing! ;-)
Some test setup; we'll have to contrive a simpler setup.
Thanks for all the info.

Comment 7 Joerg Roedel 2012-01-17 17:52:16 UTC

(In reply to comment #6)
> > space. Further, when IOMMU is enabled PCI-ACS is enabled too, so all
> > inter-device requests are routed through and translated by the IOMMU too.
> > 
> PCI-ACS is *optional* in a PCIe switch.  The call to enable it is in the
> primary PCI code, but it is optional. There are known PCIe switches that have
> it, and known PCIe switches that _do not_ have it.  Thus, if your code's logic
> is premised on ACS throughout a PCIe hierarchy, then you should add some
> checking code for that condition, or modify your code to deal with the
> situation where ACS is not available. qemu-kvm had to provide a hook that
> allows device-assignment for non-ACS configurations, since without it, it's a
> security problem (one PCIe device can be maliciously programmed to DMA into/out
> of another PCIe device, causing unknown problems, minimally, easily crash the
> host kernel).

Ok, ACS is optional. But all chips with an AMD IOMMU also support ACS, I checked that today. This are the RD890 NB chip and the Trinity APU. So there should be no problem, I think.

> > Okay, I doubt that any of these parameters will change in the RHEL6 life-time,
> > but I can certainly change the check if you want.
> > 
> If there's ever a (rhel6) problem, we'll re-assign the bug to you! ;-)

Fair enough ;-)

> > This setup eat up all 4GB DMA address space for the graphics card within
> > minutes.
> Hmmm, I wonder how the RH virt manager will feel when I ask for a VMware
> license for testing! ;-)
> Some test setup; we'll have to contrive a simpler setup.
> Thanks for all the info.

You are welcome. Let me know if you need any additional info :)

Comment 8 Don Dutile (Red Hat) 2012-01-17 19:25:26 UTC

> 
> Ok, ACS is optional. But all chips with an AMD IOMMU also support ACS, I
> checked that today. This are the RD890 NB chip and the Trinity APU. So there
> should be no problem, I think.
> 
but not *all* PCIe modules (read: quad-port, Intel NIC using Integrated Devices PCIe bridge) have ACS, thus, the ACS assumption is broken once one of these modules is plugged into an AMD system b/c the PCIe up/down bridge pair in a PCIe switch chip that doesn't have ACS support breaks this assumption.
IOW, an AMD system doesn't own/control the entire PCIe fabric components.

libvirt  currently tests the PCIe hierarchy in the host when doing device assignment, and if ACS support doesn't exist, by default, fails device assignment to a guest.  It requires manually setting a run-in-unsecure-acs mode to enable such a configuration for device-assignment.  This (optional) ACS issue has been well-known for over a year now.

Comment 10 RHEL Program Management 2012-02-01 16:29:19 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 12 RHEL Program Management 2012-02-07 22:59:07 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 13 Don Dutile (Red Hat) 2012-02-14 22:34:01 UTC

AMD affirmed they will qa based on provided rpm in c#11

from: Ravikiran.Thirumalai

Hi Don
Yes we commit to test this. Please include the fix in. 6.3

Thanks,
Kiran

Comment 14 Joerg Roedel 2012-02-15 10:18:55 UTC

Update:

We successfully tested the fix here in our environment. The issue is fixed and no new issue was found.

Comment 15 Don Dutile (Red Hat) 2012-02-15 16:33:30 UTC

(In reply to comment #14)
> Update:
> 
> We successfully tested the fix here in our environment. The issue is fixed and
> no new issue was found.

Thanks for prompt testing!
I'll note in our internal patch posting so it the patches get into 6.3.

Q: do you need these backported to 6.2.z ?
   -- sounds like they should (simple, fixes bugs months in advance of 6.3 release); if so, set 6.2.z flag to ?; if you can't, let me know and I will.

Comment 16 Joerg Roedel 2012-02-15 16:41:35 UTC

Thanks.

A backport into 6.2.z makes sense too. The bug also triggers there.

Comment 17 Joerg Roedel 2012-02-15 16:42:51 UTC

Thanks.

A backport into 6.2.z makes sense too. The bug also triggers there.

Comment 19 Aristeu Rozanski 2012-02-24 21:56:03 UTC

Patch(es) available on kernel-2.6.32-238.el6

Comment 24 Tomas Capek 2012-05-16 11:44:11 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the AMD IOMMU (input/output memory management unit) driver could use the MSI address range for DMA (direct memory access) addresses. As a consequence, DMA could fail and spurious interrupts would occur if this address range was used. With this update, the MSI address range is reserved to prevent the driver from allocating wrong addresses and DMA is now assured to work as expected in the described scenario.

Comment 26 errata-xmlrpc 2012-06-20 08:19:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html