Bug 433554 - [RHEL5 U2] Kernel-xen PCI-DMA: Out of SW-IOMMU space for 57344 bytes at device 0000:03:04.0
Summary: [RHEL5 U2] Kernel-xen PCI-DMA: Out of SW-IOMMU space for 57344 bytes at devic...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.2
Hardware: All
OS: Linux
high
medium
Target Milestone: rc
: ---
Assignee: Stephen Tweedie
QA Contact: Martin Jenner
URL: http://rhts.redhat.com/testlogs/15796...
Whiteboard:
: 436111 437031 438799 440229 441984 442094 442347 (view as bug list)
Depends On:
Blocks: 391501 445799
TreeView+ depends on / blocked
 
Reported: 2008-02-19 23:19 UTC by Jeff Burke
Modified: 2018-10-19 20:17 UTC (History)
14 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 15:10:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Boot log for kernel-xen-2.6.18-79.el5 (19.52 KB, text/plain)
2008-02-19 23:19 UTC, Jeff Burke
no flags Details
Boot log for kernel-xen-2.6.18-53.el5 (17.83 KB, text/plain)
2008-02-19 23:20 UTC, Jeff Burke
no flags Details
dmesg output for the machine in question (84.42 KB, text/plain)
2008-02-21 16:10 UTC, Jeff Moyer
no flags Details
Don't perform unnecessarily swiotlb copies (3.12 KB, patch)
2008-03-14 14:27 UTC, Stephen Tweedie
no flags Details | Diff
Detect physically-contiguous pages when determining if memory spans a page boundary (3.61 KB, patch)
2008-03-28 13:10 UTC, Stephen Tweedie
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Jeff Burke 2008-02-19 23:19:43 UTC
Description of problem:
 While booting ibm-ls21-7972-01.lab.boston.redhat.com with
kernel-xen-2.6.18-79.el5 The system reports the following:

PCI-DMA: Out of SW-IOMMU space for 57344 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
printk: 44 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
printk: 757 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 32768 bytes at device 0000:03:04.0


Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-79.el5

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL5.2-Server-20080212.0 on ibm-ls21-7972-01.lab.boston.redhat.com
  
Actual results:
printk: 67 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 57344 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
printk: 44 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
printk: 757 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 32768 bytes at device 0000:03:04.0
printk: 83 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:03:04.0
printk: 65 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 45056 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0
printk: 9 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 12288 bytes at device 0000:03:04.0
printk: 77 messages suppressed.
PCI-DMA: Out of SW-IOMMU space for 16384 bytes at device 0000:03:04.0

Expected results:
We should not have to rate limit printks on a normal boot. This is a sign of a
potential bigger problem.

Additional info:

Comment 1 Jeff Burke 2008-02-19 23:19:43 UTC
Created attachment 295354 [details]
Boot log for kernel-xen-2.6.18-79.el5

Comment 2 Jeff Burke 2008-02-19 23:20:51 UTC
Created attachment 295355 [details]
Boot log for kernel-xen-2.6.18-53.el5

Comment 3 Jeff Burke 2008-02-21 13:53:50 UTC
                        < Notes from Chip Coldwell >

mptsas is causing this.  We maybe wrong, but we don't know what other io device
might be doing large chunks of DMA.

Actually, not such a big assumption.  The boot log has this:

ACPI: PCI Interrupt 0000:03:04.0[A] -> GSI 19 (level, low) -> IRQ 16
mptbase: ioc0: Initiating bringup
ioc0: LSISAS1064 A3: Capabilities={Initiator}
scsi0 : ioc0: LSISAS1064 A3, FwRev=000a0f00h, Ports=1, MaxQ=511, IRQ=16
  Vendor: IBM-ESXS  Model: MAY2036RC         Rev: T106
  Type:   Direct-Access                      ANSI SCSI revision: 05

and the error messages are

PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0

I think it's pretty clear that the device at 0000:03:04.0 is mptsas.

--------------------------------------------------------------------------------

OK, mptsas_qcmd calls mptscsih_qcmd, which in turn will call either
pci_map_sg or pci_map_single, which are #defines for dma_map_(sg|single). 
That's the end of the code path that leads to that error message.  I cannot see
anywhere else where mptsas is calling into the SW-IOMMU.

mptsas_qcmd is installed as the .queuecommand method in the
mptsas_driver_template (an instance of struct scsi_host_template). This gets
called scsi_dispatch_cmd, itself called by scsi_request_fn. What this boils down
to, is those requests are coming from I/Os submitted to the HBA.

                     < End of notes from Chip Coldwell >


Comment 4 Jeff Moyer 2008-02-21 16:09:58 UTC
I'm getting quite a few of these errors during boot with kernel 2.6.18-82.el5-xen:

PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:00:1f.2
ata1.00: exception Emask 0x0 SAct 0x40 SErr 0x0 action 0x0
ata1.00: cmd 61/00:30:d7:b7:7b/04:00:02:00:00/40 tag 6 ncq 524288 out
         res 40/00:1c:d7:a7:7b/00:00:02:00:00/40 Emask 0x40 (internal error)
ata1.00: status: { DRDY }
ata1.00: configured for UDMA/133
ata1: EH complete

dmesg shows this for device 0000:00:1f.2:

libata version 3.00 loaded.
ahci 0000:00:1f.2: version 3.0
GSI 22 sharing vector 0xD0 and IRQ 22
ACPI: PCI Interrupt 0000:00:1f.2[C] -> GSI 20 (level, low) -> IRQ 22
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq pm led clo pio slum part 
PCI: Setting latency timer of device 0000:00:1f.2 to 64
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
ata1: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970100 irq 22
ata2: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970180 irq 22
ata3: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970200 irq 22
ata4: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970280 irq 22
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: SAMSUNG HD160JJ/P, ZM100-34, max UDMA7
ata1.00: 312500000 sectors, multi 8: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133


Comment 5 Jeff Moyer 2008-02-21 16:10:34 UTC
Created attachment 295526 [details]
dmesg output for the machine in question

Comment 6 Stephen Tweedie 2008-02-25 16:00:07 UTC
Jeff, for the later errors on PCI address 0000:00:1f.2 from comment #4, what's
the controller in question?

And do we know which was the latest kernel NOT to show these problems?


Comment 7 Jeff Burke 2008-02-25 16:15:32 UTC
Stephen,
   There are 2 Jeff's on this BZ. I see that you put in NEEDINFO for me. But can
you be a little more specific. Which comment are you asking about?

Thanks,
JeffB

Comment 8 Don Zickus 2008-02-25 16:39:08 UTC
Stephen, this is the chunk that causes the problem.  It was added in the -70.el5
kernel by Bill.  As you can see, all it does is enforce the dma restrictions,
nothing to serious.  The result is magnifying issues in the scsi layer it appears.  

Before, address_needs_mapping would fail because the whole 64-bit range was
masked (which is expected).  Now the code is checking to make sure the sg list
is a chain of pages, which in the printks we are seeing are clearly not.

@@ -529,7 +529,9 @@ swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg,
int nelems,

        for (i = 0; i < nelems; i++, sg++) {
                dev_addr = SG_ENT_PHYS_ADDRESS(sg);
-               if (address_needs_mapping(hwdev, dev_addr)) {
+               if (range_straddles_page_boundary(page_to_pseudophys(sg->page)
+                                                 + sg->offset, sg->length)
+                   || address_needs_mapping(hwdev, dev_addr)) {
                        buffer.page   = sg->page;
                        buffer.offset = sg->offset;
                        map = map_single(hwdev, buffer, sg->length, dir);


Comment 9 Stephen Tweedie 2008-02-25 16:43:05 UTC
My question about the PCI address specifically referred to comment #4, so I'm
asking Jeff M...

But the question "do we know which was the latest kernel NOT to show these
problems?" is a general request applicable to all of the instances of swiotlb in
this BZ, so I'm leaving it open as NEEDINFO(reporter) in general, as I can't set
the request to multiple people in BZ.


Comment 10 Stephen Tweedie 2008-02-25 16:43:59 UTC
And to add yet _another_ person to the virtual NEEDINFO list... Don, do we have
confirmation that backing out that one section eliminates the messages?


Comment 11 Jeff Moyer 2008-02-25 17:31:41 UTC
(In reply to comment #6)
> Jeff, for the later errors on PCI address 0000:00:1f.2 from comment #4, what's
> the controller in question?
> 

00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI
Controller (rev 01) (prog-if 01 [AHCI 1.0])
        Subsystem: Dell Unknown device 01de
        Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 22
        I/O ports at fe00 [size=8]
        I/O ports at fe10 [size=4]
        I/O ports at fe20 [size=8]
        I/O ports at fe30 [size=4]
        I/O ports at fec0 [size=16]
        Memory at ff970000 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable-
        Capabilities: [70] Power Management version 2

> And do we know which was the latest kernel NOT to show these problems?

No, I haven't tried earlier kernels.  Would you like me to narrow it down?  It
will mean rebooting my workstation.


Comment 12 Don Zickus 2008-02-25 18:00:03 UTC
(In reply to comment #10)
no i haven't confirmed it yet.  it seemed obvious, but then again i guess it
could be a combination with another patch (even though -70.el5 is mostly xen
patches).

Comment 13 Jeff Moyer 2008-02-25 18:29:05 UTC
(In reply to comment #8)
> Stephen, this is the chunk that causes the problem.  It was added in the -70.el5
> kernel by Bill.  As you can see, all it does is enforce the dma restrictions,
> nothing to serious.  The result is magnifying issues in the scsi layer it
appears.  
> 
> Before, address_needs_mapping would fail because the whole 64-bit range was
> masked (which is expected).  Now the code is checking to make sure the sg list
> is a chain of pages, which in the printks we are seeing are clearly not.
> 
> @@ -529,7 +529,9 @@ swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg,
> int nelems,
> 
>         for (i = 0; i < nelems; i++, sg++) {
>                 dev_addr = SG_ENT_PHYS_ADDRESS(sg);
> -               if (address_needs_mapping(hwdev, dev_addr)) {
> +               if (range_straddles_page_boundary(page_to_pseudophys(sg->page)
> +                                                 + sg->offset, sg->length)
> +                   || address_needs_mapping(hwdev, dev_addr)) {
>                         buffer.page   = sg->page;
>                         buffer.offset = sg->offset;
>                         map = map_single(hwdev, buffer, sg->length, dir);

Don, please include a patch name in the future so I don't have to go digging.

Looking at the patch
(linux-2.6-xen-handle-multi-page-segments-in-dma_map_sg.patch), the chunk you
mention is applied in arch/i386.  I am running on an x86_64 box.

Comment 14 Chris Lalancette 2008-02-25 18:37:06 UTC
JeffM,
     However, there is a *lot* of crossover between these two arches, especially
in the Xen case.  And, if you look in arch/x86_64/kernel/Makefile, you'll see
that the i386 version is built in the x86_64 case as well.  Another reason the
i386/x86_64 upstream merge was good, but we have to live with it for RHEL-5.

Chris Lalancette

Comment 15 Don Zickus 2008-02-25 18:47:07 UTC
Actually that isn't true.

If you look in lib/Makefile you will find that not only do x86_64 and i386 share
the same swiotlb.c file but it _differs_ from bare-metal.  Which explains why
you don't see it there.

Comment 16 Jeff Moyer 2008-02-26 15:28:46 UTC
I'm changing this back to ASSIGNED as all of the questions have been answered. 
Stephen, if you need to know if backing out that patch will fix things, then
I'll kick off a build.  Don seems convinced that the cause has been identified,
though.

Comment 17 Don Zickus 2008-03-10 13:47:20 UTC
*** Bug 436265 has been marked as a duplicate of this bug. ***

Comment 18 Don Zickus 2008-03-10 13:48:06 UTC
*** Bug 436111 has been marked as a duplicate of this bug. ***

Comment 19 Stephen Tweedie 2008-03-14 14:27:53 UTC
Created attachment 298053 [details]
Don't perform unnecessarily swiotlb copies

Possible fix: when we receive page-spanning scatter-gather segments which
happen to be machine-contiguous already, don't copy them via swiotlb
unnecessarily.

Comment 20 Stephen Tweedie 2008-03-14 14:28:54 UTC
Fuller log for the fix, copied straight from the patch header:

    xen dma: avoid unnecessarily SWIOTLB bounce buffering.
    
    On Xen kernels, BIOVEC_PHYS_MERGEABLE permits merging of disk IOs that
    span multiple pages, provided that the pages are both pseudophysically-
    AND machine-contiguous ---
    
        (((bvec_to_phys((vec1)) + (vec1)->bv_len) == bvec_to_phys((vec2))) && \
         ((bvec_to_pseudophys((vec1)) + (vec1)->bv_len) == \
          bvec_to_pseudophys((vec2))))
    
    However, this best-effort merging of adjacent pages can occur in
    regions of dom0 memory which just happen, by virtue of having been
    initially set up that way, to be machine-contiguous.  Such pages
    which occur outside of a range created by xen_create_contiguous_
    region won't be seen as contiguous by range_straddles_page_boundary(),
    so the pci-dma-xen.c code for dma_map_sg() will send these regions
    to the swiotlb for bounce buffering.
    
    In RHEL-5.1 this did not happen, because we did not have the check
    for range_straddles_page_boundary() in that code.  Now that that check
    has been added, these SG ranges --- which ARE machine contiguous and
    which can perfectly well be sent to a dma engine --- are being bounce-
    buffered in the swiotlb instead, causing a performance overhead and
    potentially leading to early swiotlb exhaustion.
    
    This patch adds a new check, check_pages_physically_contiguous(),
    to the swiotlb_map_sg() code to capture these ranges and map them
    directly via virt_to_bus() mapping rather than through the swiotlb.
    


Comment 22 Jeff Moyer 2008-03-14 17:23:39 UTC
The patched kernel fixes the problem on my system;  I no longer see any of the
messages pertaining to SW-IOMMU exhaustion.

Comment 24 Jeff Burke 2008-03-14 21:35:09 UTC
The 2.6.18-85.el5.swiotlbfix test kernel fixes the issue seen in RHTS as well.


Comment 25 Chris Lalancette 2008-03-25 12:44:23 UTC
*** Bug 438799 has been marked as a duplicate of this bug. ***

Comment 26 Stephen Tweedie 2008-03-28 13:10:55 UTC
Created attachment 299461 [details]
Detect physically-contiguous pages when determining if memory spans a page boundary

Updates the previous patch (attachment 298053 [details]).  The same test is still
performed, but now in the core Xen dma layer, not in the swiotlb code, so the
fix still works if we run with swiotlb=off.

Comment 29 Bill Burns 2008-03-28 17:29:59 UTC
Setting flags.


Comment 31 Chris Lalancette 2008-04-01 14:02:21 UTC
*** Bug 437031 has been marked as a duplicate of this bug. ***

Comment 33 Qian Cai 2008-04-03 06:56:28 UTC
*** Bug 440229 has been marked as a duplicate of this bug. ***

Comment 36 Don Zickus 2008-04-09 18:44:19 UTC
in kernel-2.6.18-89.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 37 Gurhan Ozen 2008-04-11 14:50:36 UTC
*** Bug 441984 has been marked as a duplicate of this bug. ***

Comment 39 Don Zickus 2008-04-16 20:47:38 UTC
in kernel-2.6.18-90.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 40 Don Zickus 2008-04-16 20:55:30 UTC
Sorry, disregard previous comment

Comment 41 Bill Burns 2008-04-17 14:07:17 UTC
*** Bug 442347 has been marked as a duplicate of this bug. ***

Comment 42 Bill Burns 2008-04-18 16:20:00 UTC
*** Bug 442094 has been marked as a duplicate of this bug. ***

Comment 43 Andrius Benokraitis 2008-04-18 18:24:38 UTC
Adding QLogic and EMC to this bug.

Comment 46 errata-xmlrpc 2008-05-21 15:10:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html



Note You need to log in before you can comment on or make changes to this bug.